Space Talkers
A diarization viewer for Whisper transcription output
Features
- Speaker visualization: Speakers displayed as animated orbs in a starfield
- Real-time transcription: Live transcript panel following audio playback
- Waveform navigation: Click/drag on the waveform to seek through the audio
- Keyboard controls: Space to play/pause, Arrow keys to seek
Keyboard Shortcuts
| Key | Action |
|---|---|
Space |
Play/Pause |
← / A |
Seek back 10 seconds |
→ / D |
Seek forward 10 seconds |
Shift + ←/→ |
Seek 60 seconds |
Quick Start
- Place your audio file in
input/ - Place your Whisper transcript JSON in
outputs/float32/ - Generate the waveform data (see below)
- Start a local server and open in browser:
Then navigate to http://localhost:5000
npx serve -p 5000
Waveform Generation
For optimal performance with long audio files (1-3 hours), waveform data is pre-generated as JSON rather than decoded in the browser.
Prerequisites
Generate Waveform Data
node scripts/generate-waveform.js <input-audio> [output-json] [columns]
Arguments:
input-audio- Path to the audio file (opus, mp3, wav, etc.)output-json- Output path for waveform JSON (default:<input>.waveform.json)columns- Number of waveform columns/peaks (default: 1000)
Example:
# Generate waveform for a single file
node scripts/generate-waveform.js input/amuta_2026-01-12_1.opus outputs/float32/amuta_2026-01-12_1.waveform.json
# Or let it auto-generate the output path
node scripts/generate-waveform.js input/meeting.opus
Configuration
Edit the paths at the top of app.js:
const transcriptPath = "outputs/float32/amuta_2026-01-12_1.json";
const waveformPath = "outputs/float32/amuta_2026-01-12_1.waveform.json";
Speaker Labels
Map speaker IDs to display names in app.js:
supports merging names
const SPEAKER_LABELS = {
"SPEAKER_01": "Maya",
"SPEAKER_02": "SPEAKER_23","SPEAKER_4",
};
File Structure
amuta-meetings/
├── index.html # Main HTML page
├── app.js # Application logic
├── styles.css # Styles
├── scripts/
│ └── generate-waveform.js # Waveform generator script
├── input/ # Audio files (gitignored)
├── outputs/
│ └── float32/ # Transcript and waveform JSON
Transcription with WhisperX (GPU or CPU)
For transcribing audio with speaker diarization, we used WhisperX on a rented GPU service (verda) or book one of tami's P40s, or do cpp.
WhisperX CLI Command
the code to save as json and convert to srt for quick anima runs from https://notes.nicolasdeville.com/python/library-whisperx/
we adapted to add diarization (see below for huginface hug)
Key Arguments
| Argument | Description |
|---|---|
--model |
Whisper model: large-v3 (best quality), turbo (fastest), large-v2 |
--language |
Source language code (e.g., en for English, country ISO codes) |
--diarize |
Enable speaker diarization (requires HuggingFace token) |
--compute_type |
float16 (GPU), int8 (CPU/low memory), float32 (highest accuracy) |
--batch_size |
Higher = faster but uses more VRAM (16-32 for 24GB GPUs) |
--hf_token |
HuggingFace token for PyAnnote diarization models |
Performance Benchmarks (from Nic's notes)
| Configuration | Speed Ratio |
|---|---|
turbo, int8, batch_size=16 |
~2.3x realtime |
large-v3, int8, batch_size=16 |
~1.2x realtime |
large-v2, float16, batch_size=32 |
~1.5x realtime (GPU) |
Tip
: For Hebrew transcription,
large-v3typically provides better accuracy thanturbo.
Getting a HuggingFace Token
- Create account at huggingface.co
- Accept model terms at:
- Generate token at Settings → Access Tokens
Output Format
WhisperX outputs JSON with word-level timestamps and speaker labels:
{
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "שלום לכולם",
"speaker": "SPEAKER_01",
"words": [
{ "word": "שלום", "start": 0.0, "end": 0.8 },
{ "word": "לכולם", "start": 0.9, "end": 2.5 }
]
}
]
}
