From Playwright Scraper to Clinical Intelligence Platform in One Week

A week ago, the Heidi scraper was a useful but limited tool: a Playwright script that logged into Heidi Health, pulled clinical session data, and saved JSON files to disk. The data existed. You couldn’t do anything with it.

This week, we shipped five interconnected services that turned that flat file dump into a proper clinical intelligence layer. Here’s what we built and why the pieces fit together the way they do.

The Starting Problem

Heidi Health is the AI clinical documentation tool Dr. Ray uses to generate session notes and consult summaries. It’s good at what it does. It does not have a usable API.

So we have a Playwright scraper that impersonates the browser session and extracts the data — session transcripts, consult notes, patient names, dates. The scraper worked. But writing to a local JSON file means the data is only accessible on the machine that ran the script. One misplaced laptop and six months of clinical sessions vanish.

Step 1: Persistence That Survives Reboots

The first change was adding a real persistence layer to the scraper itself. Sessions now write to two places: DynamoDB (HeidiSessions table) for structured metadata, and S3 for full transcript JSON. The scraper deduplicates on sessionId before writing, so repeated runs are safe.

Three GSIs on the DynamoDB table make the data actually queryable: by patient name (lowercase, for case-insensitive lookups), by calendar date, and by month. The scraper also triggers an async Lambda invoke for sessions with consult notes — kicking off the embedding pipeline.

Step 2: Semantic Search via Bedrock Embeddings

The heidi-embed-session Lambda handles the AI layer. It calls Bedrock Titan Text Embeddings V2 to generate 256-dimensional vectors from consult note text, then stores them back in DynamoDB alongside the session record.

The Lambda supports three actions: embed (process a single session), search (vector similarity query), and backfill (run over all historical sessions that don’t yet have embeddings). The backfill was necessary — once we had the embedding infrastructure, we wanted the full historical corpus searchable, not just sessions going forward.

A query like “patient with gestational diabetes concerns” returns the semantically closest sessions, ranked by cosine similarity. That’s a qualitatively different capability than text search.

Step 3: The API Layer

heidi-session-api is a single Lambda function with a multi-action dispatch pattern — the same approach used throughout the Dr. Ray stack. It handles: listing patients, fetching a patient’s full session history, semantic search, session detail retrieval (including S3 transcript fetch), analytics, and patient name correction.

The auditSessions action was a late addition that turned out to be important. Heidi doesn’t always capture patient names cleanly — abbreviated names, typos, and partial matches accumulate. The audit action surfaces sessions where the name looks suspect, so a human can verify and use updatePatientName to fix the record in place. Renaming updates every session for that patient in a single pass.

Calendar reconciliation was also added: the API cross-references session dates against Google Calendar events (via the calendar wrapper running on the VPS) to match sessions to the actual appointment they belong to. The localDate field normalizes timestamps to America/Los_Angeles before comparison, which matters because the Heidi API returns UTC and calendar events are stored in local time.

Step 4: The UI

Heidi Explorer is the React front-end that ties all of this together. It follows the same brand system as the rest of the Dr. Ray apps — cream backgrounds, terracotta accents, Fraunces headers.

The core interaction: search by patient name or use semantic search to find sessions by clinical concept. Select a patient to see their full timeline — every Heidi session, with dates, consult note summaries, and expandable transcripts. Click into a session to see the SessionReview panel: the full structured data, the raw transcript, and any calendar events that were matched.

The inline patient name editing feature came out of using the tool for the first time. The first real session we pulled up had a misspelled name. Navigating away to fix it in a separate workflow was friction we didn’t need. Now there’s a pencil icon next to the name, an inline edit field, and a call to updatePatientName — one interaction to fix the canonical name across the entire patient history.

Password authentication gates the whole app via the existing fax-auth Lambda, so there’s no new auth infrastructure to maintain.

What Else Shipped

The Clarius scraper also got a significant upgrade this week: the flat exam-ids.json file was replaced with a DynamoDB table (ClariusExams), adding proper deduplication, timestamp tracking, and resume capability after interruptions. The scraper now paginates up to three pages and stops when it hits exams that are already processed — so daily runs are fast instead of exhaustive.

On the infrastructure side, 65 service repos in the grqg-dev GitHub org received their first pushed commits after a batch fix for a stale core.worktree path issue that had made git non-functional in 37 of them post-directory-move.

The Pattern

The Heidi platform is a case of compounding infrastructure. The scraper alone was useful but fragile. Adding DynamoDB made it durable. Adding embeddings made it searchable by meaning. Adding the API made it composable. Adding the UI made it usable by a non-engineer.

Each layer is independently useful. The embedding Lambda can be called from anywhere. The session API doesn’t care who calls it. The Explorer is just a client of the API. That loose coupling means the next capability — whatever it is — has a clean place to slot in.