A real-time system that converts live doctor–patient conversations into structured clinical notes, smart suggestions, and print-ready Hindi PDF reports.
Capabilities
From live audio capture to structured notes — the entire documentation workflow handled intelligently.
Low-latency Hindi speech-to-text powered by Vosk. Runs fully offline, streamed over WebSocket for instant display as the doctor speaks.
Gemini incrementally extracts symptoms, medications, and diagnoses in real-time — watch the structured fields populate as the consultation progresses.
ChromaDB vector search over historical consultations surfaces relevant diagnoses, missed tests, and common medications based on similar past cases.
Auto-generates a print-ready OPD slip with proper Devanagari rendering using NotoSansDevanagari fonts via ReportLab. One click to export.
Every session stores raw audio, raw transcripts, and structured JSON in date-partitioned folders. Nothing is ever lost; everything is replayable.
Speech recognition runs entirely on-device with Vosk — no audio ever leaves the local machine. Patient data stays within the clinic's infrastructure.
Workflow
A five-step pipeline from raw audio to a structured, PDF-ready OPD report with AI-powered insights.
The frontend uses the Web Audio API to capture microphone input at 16-bit PCM, 16 kHz — optimised for speech recognition models.
WebSocket StreamFastAPI receives raw PCM frames and feeds them to the Vosk ASR engine (vosk-model-hi-0.22). Emits both partial and final results for low-latency display.
Offline · On-DeviceRaw transcript fragments are sent to Google Gemini with a strict JSON schema at Temperature 0.0, incrementally building the clinical state: symptoms, medications, BP, temperature, diagnosis.
Deterministic · Schema-LockedAt session end, the finalized consultation is embedded into ChromaDB. A similarity search returns the top-N historical cases, suggesting likely diagnoses and commonly ordered tests.
ChromaDB · Semantic SearchReportLab renders a formatted, Hindi-compatible clinical PDF using NotoSansDevanagari. The full session (audio, transcript, JSON) is archived for auditability.
ReportLab · DevanagariSystem Pipeline
Architecture
Each module owns a single responsibility, making the system easy to extend, swap, or harden for production.
Dumb renderer — captures mic audio and displays live JSON updates pushed from the backend. All logic lives server-side.
WebSocket layer handles bi-directional streaming, session management, and triggers the suggestion engine on finalization.
Vosk adapter streams PCM chunks and emits partial + final results. Fully offline, privacy-preserving, no external API calls.
Incremental clinical structuring via Gemini Flash/Pro. Temperature 0.0, strict JSON schema, deterministic output every time.
ChromaDB stores embeddings of past consultations. Similarity search returns the top-N matches to surface relevant clinical hints.
Date-partitioned session files archive raw audio, transcripts, and structured JSON. ReportLab renders Devanagari-compatible PDFs.
Tech Stack
Quick Start
Project Status
This project demonstrates a complete pipeline from audio ingestion to vector-backed clinical insights. The architecture is modular and suitable for further hardening, security auditing, and integration into real hospital workflows. It is a documentation assistant only — not a diagnostic tool.
Open Source
Dive into the codebase, fork it, and adapt it for your own clinical documentation needs.