My last post ended with a plan. I'd picked MedGemma as my note-generation model after evaluating multiple open-source medical models across clinical accuracy, entity recognition, and note formatting. I'd picked MedASR for voice transcription. Both models showed strong performance in evaluation, and fine-tuning MedGemma was the clear next step to push the model toward clinical-grade quality.

This post covers what happened next: building a synthetic data pipeline, training the model on cloud GPUs, and shipping an offline desktop app that handles transcription, encryption, and edge deployment on Apple Silicon.

Building the Training Data

Fine-tuning takes a pre-trained model and shows it examples of what "good" looks like for your specific task until it learns the pattern. I needed doctor-patient transcripts paired with high-quality clinical notes.

The transcripts came from MTS-Dialog, a public dataset of 492 medical conversations covering 19 clinical categories. I selected 200 conversations, ensuring balanced representation across all specialties so the model wouldn't skew toward common visits like wellness checks while missing rarer ones like oncology or hematology.

For the notes, I built a synthetic data generation pipeline using knowledge distillation, a technique where you use a large, capable model as a "teacher" to generate training data for a smaller "student" model. My teacher was Gemini 2.5 Pro. I fed it each of the 200 transcripts along with each of our 7 note templates (SOAP, H&P, Progress Note, Discharge Summary, Procedure Note, Consultation Note, DAP), generating 1,400 clinical notes total.

One discovery proved important: the format of instructions in the training data must exactly match the format the live app sends. An early experiment used template names in training but full template text in production, and performance dropped because the model was seeing input patterns it wasn't trained on.

The final dataset: 1,190 training examples and 204 validation examples. I split them by conversation, not by individual note, to prevent data leakage. If the model sees a SOAP note for conversation #42 in training and then gets tested on a different note type for the same conversation, it has an unfair advantage. Splitting by conversation means every transcript in validation is completely unseen during training.

Fine-Tuning on Cloud GPUs

Training an AI model is much more memory-intensive than simply running one. When you run a model (called inference), you only need the model itself in memory. When you train it, you also need to track how wrong each prediction was and which direction to adjust every parameter. That roughly triples the memory requirement.

I trained on Modal, a cloud GPU platform, using an NVIDIA A100 with 80GB of memory. I originally planned on 40GB, but MedGemma's unusually large vocabulary (about 262,000 tokens) made the math too big to fit.

The training used LoRA (Low-Rank Adaptation), a technique that makes fine-tuning practical. Instead of updating all 4 billion of the model's parameters (which would require enormous memory and risk breaking what the model already knows), LoRA freezes the entire base model and attaches small, trainable "adapter" layers alongside it. Think of it like putting a thin lens in front of a camera rather than rebuilding the entire camera. I trained just 38 million parameters, roughly 1% of the model. Each training run took 15-30 minutes, so I could experiment freely.

Run A: Full Output Layer Training

Following Google's MedGemma fine-tuning notebook, I enabled a setting called modules_to_save that also trains the model's embedding layer and output head. These are the layers responsible for converting words into numbers (input) and numbers back into words (output). With this enabled, the trainable parameter count jumped from 38 million (1%) to 1.38 billion (24%).

The model peaked at step 50 and then started memorizing the training data instead of learning general patterns. This is called overfitting: the model had so much capacity relative to the dataset that it was easier to memorize every example than to learn the underlying rules.

Step Epoch Training Loss Validation Loss Status
25 0.34 - 0.578 Improving
50 0.67 0.444 0.479 Best checkpoint
75 1.00 0.346 0.512 Getting worse
100 1.34 0.206 0.666 Memorizing
150 2.00 0.083 0.891 Fully memorized

In this table, training loss measures how well the model performs on examples it has already seen (lower is better).

Validation loss measures how well it performs on held-out examples it hasn't seen. The healthy pattern is both losses decreasing together, with training loss slightly lower (the model will always perform a bit better on data it's trained on, which is expected).

At step 50, they're close: 0.444 vs 0.479. That's the sweet spot. After step 50, the gap blows open: training loss plummets to 0.083 while validation loss climbs to 0.891. The model is getting better at reciting its homework but worse at answering new questions. That divergence is how you spot overfitting in practice.

Run A showed early signs of entity tagging, wrapping medications and conditions in structured markup like {{drug:metformin}} so the app can highlight and link them. It produced 9 entity tags on test cases where the base model produced 0. Not yet reliable, but enough to justify shipping Run A's best checkpoint (step 50) over Run B.

Run B: Pure LoRA

For Run B, I removed modules_to_save entirely. Pure LoRA, 38.5 million trainable parameters. The tradeoff was immediate: overfitting was delayed and the model generalized better (validation loss improved from 0.479 to 0.453). But entity tagging nearly disappeared, dropping to just 2 tags on the same test cases.

Metric Base Model Run A (24% params) Run B (0.89% params)
Entity tags (HTN case) 0 3 2
Entity tags (DM2 case) 0 6 0
Total entities 0 9 2
Validation loss - 0.479 0.453

Run A learned entity tagging but memorized the data. Run B generalized better but couldn't learn new output patterns. I shipped Run A's best checkpoint (step 50, before overfitting set in) because entity tagging was the higher-priority capability, even if not yet fully reliable.

What the Comparison Revealed

Running both configurations side by side exposed a genuine tradeoff. Entity tagging requires the model to produce output it has never seen before, like {{drug:metformin}}. The only way to teach it new output patterns is to train the layers responsible for generating words. But training those layers means training 24% of the model instead of 1%, and with only 1,190 examples, that much capacity leads to memorization instead of learning.

I shipped Run A's best checkpoint because entity tagging, even partially working, was more valuable than slightly better generalization without it. Making entity tagging fully reliable remains a key goal for the next fine-tuning round, where more data should let the model learn new output patterns without memorizing. The insight from these two runs would have been invisible without running both.

Beyond Fine-Tuning: Shipping the App

While fine-tuning was the core ML challenge, shipping an offline medical app required solving several other problems.

Transcription. I chose MLX, Apple's machine learning framework (50MB), over PyTorch (2GB) to run speech recognition natively on the Mac's GPU. Two production bugs shaped the design: the GPU crashes when two AI models try to use it at the same time (fixed by queuing them so only one runs at a time), and long recordings exceed memory limits (fixed by transcribing in 20-second chunks instead of all at once).

Encryption. Medical data must be encrypted on disk. I used envelope encryption: instead of encrypting the database directly with the user's password, the password protects a random key, and that random key encrypts the database. This means changing your password is instant because only the small wrapper needs updating, not the entire database. The database is encrypted with SQLCipher (AES-256). The password goes through Argon2id, a hashing algorithm designed to be slow on purpose so that guessing passwords by trying millions of combinations is impractical. When the app locks, keys are wiped from memory.

Speaker diarization. I tried to automatically identify who is speaking, doctor or patient. It wasn't reliable enough, and wrong labels would make notes worse. I removed it entirely. Shipping nothing beats shipping something unreliable in a medical context.

Lesson

Run experiments, not plans. The fine-tuning plan said to use modules_to_save. Running both configurations showed it overfits on small data but is necessary for entity tagging. Without both runs, I wouldn't have known which tradeoff to make. Plans are hypotheses. Experiments are evidence.

The infrastructure decisions determined whether the app could ship. MLX over PyTorch, ONNX over PyTorch for VAD, SQLCipher over plain SQLite, Modal over local training. These are deployment-critical engineering tradeoffs that shaped the entire architecture.

Correct yourself publicly. My first eval had rubric flaws. My first training run overfitted. My diarization feature wasn't reliable enough to ship. In each case, the correction was more valuable than the original attempt. The pattern of "try, measure, correct" is the actual skill.

Ship before it's perfect. The beta runs completely offline, encrypts all patient data, generates 7 types of clinical notes (plus custom templates), transcribes medical speech with domain-specific accuracy, and fits in a 4GB download. There's more to build, but the foundation is solid.

What's Next

The next phase focuse:

  • Fine-tuning Run C with 2.5x more data to resist memorization and make entity tagging reliable
  • Speaker diarization revisited with real user feedback
  • Lazy model loading to reduce idle memory
  • An update mechanism for incremental model updates
  • Apple notarization for seamless distribution

KasaMD is still in beta. It sits at the intersection of the hardest problems in Applied AI: making models useful under real-world constraints, deploying to edge hardware, and building for where reliability is non-negotiable.