# Space Talkers A diarization viewer for Whisper transcription output ![Demo Screenshot](screenshots/demo.jpg) ## Features - **Speaker visualization**: Speakers displayed as animated orbs in a starfield - **Real-time transcription**: Live transcript panel following audio playback - **Waveform navigation**: Click/drag on the waveform to seek through the audio - **Keyboard controls**: Space to play/pause, Arrow keys to seek ## Keyboard Shortcuts | Key | Action | |-----|--------| | `Space` | Play/Pause | | `←` / `A` | Seek back 10 seconds | | `→` / `D` | Seek forward 10 seconds | | `Shift` + `←`/`→` | Seek 60 seconds | ## Quick Start 1. Place your audio file in `input/` 2. Place your Whisper transcript JSON in `outputs/float32/` 3. Generate the waveform data (see below) 4. Start a local server and open in browser: ```bash npx serve -p 5000 ``` Then navigate to http://localhost:5000 ## Waveform Generation For optimal performance with long audio files (1-3 hours), waveform data is pre-generated as JSON rather than decoded in the browser. ### Prerequisites - [Node.js](https://nodejs.org/) (v14+) - [FFmpeg](https://ffmpeg.org/) installed and available in PATH ### Generate Waveform Data ```bash node scripts/generate-waveform.js [output-json] [columns] ``` **Arguments:** - `input-audio` - Path to the audio file (opus, mp3, wav, etc.) - `output-json` - Output path for waveform JSON (default: `.waveform.json`) - `columns` - Number of waveform columns/peaks (default: 1000) **Example:** ```bash # Generate waveform for a single file node scripts/generate-waveform.js input/amuta_2026-01-12_1.opus outputs/float32/amuta_2026-01-12_1.waveform.json # Or let it auto-generate the output path node scripts/generate-waveform.js input/meeting.opus ``` ## Configuration Edit the paths at the top of `app.js`: ```javascript const transcriptPath = "outputs/float32/amuta_2026-01-12_1.json"; const waveformPath = "outputs/float32/amuta_2026-01-12_1.waveform.json"; ``` ### Speaker Labels Map speaker IDs to display names in `app.js`: supports merging names ```javascript const SPEAKER_LABELS = { "SPEAKER_01": "Maya", "SPEAKER_02": "SPEAKER_23","SPEAKER_4", }; ``` ## File Structure ``` amuta-meetings/ ├── index.html # Main HTML page ├── app.js # Application logic ├── styles.css # Styles ├── scripts/ │ └── generate-waveform.js # Waveform generator script ├── input/ # Audio files (gitignored) ├── outputs/ │ └── float32/ # Transcript and waveform JSON ``` ## Transcription with WhisperX (GPU or CPU) For transcribing audio with speaker diarization, we used [WhisperX](https://github.com/m-bain/whisperX) on a rented GPU service (verda) or book one of tami's P40s, or do cpp. ### WhisperX CLI Command the code to save as json and convert to srt for quick anima runs from https://notes.nicolasdeville.com/python/library-whisperx/ we adapted to add diarization (see below for huginface hug) ### Key Arguments | Argument | Description | |----------|-------------| | `--model` | Whisper model: `large-v3` (best quality), `turbo` (fastest), `large-v2` | | `--language` | Source language code (e.g., `en` for English, country ISO codes) | | `--diarize` | Enable speaker diarization (requires HuggingFace token) | | `--compute_type` | `float16` (GPU), `int8` (CPU/low memory), `float32` (highest accuracy) | | `--batch_size` | Higher = faster but uses more VRAM (16-32 for 24GB GPUs) | | `--hf_token` | HuggingFace token for PyAnnote diarization models | ### Performance Benchmarks (from [Nic's notes](https://notes.nicolasdeville.com/python/library-whisperx/)) | Configuration | Speed Ratio | |---------------|-------------| | `turbo`, `int8`, `batch_size=16` | ~2.3x realtime | | `large-v3`, `int8`, `batch_size=16` | ~1.2x realtime | | `large-v2`, `float16`, `batch_size=32` | ~1.5x realtime (GPU) | > **Tip**: For Hebrew transcription, `large-v3` typically provides better accuracy than `turbo`. ### Getting a HuggingFace Token 1. Create account at [huggingface.co](https://huggingface.co) 2. Accept model terms at: - [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) - [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) 3. Generate token at [Settings → Access Tokens](https://huggingface.co/settings/tokens) ### Output Format WhisperX outputs JSON with word-level timestamps and speaker labels: ```json { "segments": [ { "start": 0.0, "end": 2.5, "text": "שלום לכולם", "speaker": "SPEAKER_01", "words": [ { "word": "שלום", "start": 0.0, "end": 0.8 }, { "word": "לכולם", "start": 0.9, "end": 2.5 } ] } ] } ```