Space Talkers
A diarization viewer for Whisper transcription output
Features
- Speaker visualization: Speakers displayed as animated orbs in a starfield
- Real-time transcription: Live transcript panel following audio playback
- Waveform navigation: Click/drag on the waveform to seek through the audio
- Keyboard controls: Space to play/pause, Arrow keys to seek
Keyboard Shortcuts
| Key | Action |
|---|---|
Space |
Play/Pause |
← / A |
Seek back 10 seconds |
→ / D |
Seek forward 10 seconds |
Shift + ←/→ |
Seek 60 seconds |
Quick Start
- Place your audio file in
input/- an example is pre configured - Place your Whisper transcript JSON in
outputs/float32/- an example is pre configured - Generate the waveform data (see below) - an example is pre configured
- Start a local server and open in browser:
Then navigate to http://localhost:5000
npx serve -p 5000
Waveform Generation
For optimal performance with long audio files (1-3 hours), waveform data is pre-generated as JSON rather than decoded in the browser.
Prerequisites
Generate Waveform Data
node scripts/generate-waveform.js <input-audio> [output-json] [columns]
Arguments:
input-audio- Path to the audio file (opus, mp3, wav, etc.)output-json- Output path for waveform JSON (default:<input>.waveform.json)columns- Number of waveform columns/peaks (default: 1000)
Example:
# Generate waveform for a single file
node scripts/generate-waveform.js input/amuta_2026-01-12_1.opus outputs/float32/amuta_2026-01-12_1.waveform.json
# Or let it auto-generate the output path
node scripts/generate-waveform.js input/meeting.opus
Configuration
Edit the paths at the top of app.js:
const transcriptPath = "outputs/float32/amuta_2026-01-12_1.json";
const waveformPath = "outputs/float32/amuta_2026-01-12_1.waveform.json";
Speaker Labels
Map speaker IDs to display names in app.js:
supports merging names
const SPEAKER_LABELS = {
"SPEAKER_01": "Maya",
"SPEAKER_02": "SPEAKER_23","SPEAKER_4",
};
File Structure
amuta-meetings/
├── index.html # Main HTML page
├── app.js # Application logic
├── styles.css # Styles
├── scripts/
│ └── generate-waveform.js # Waveform generator script
├── input/ # Audio files (gitignored)
├── outputs/
│ └── float32/ # Transcript and waveform JSON
Transcription with WhisperX (GPU or CPU)
For transcribing audio with speaker diarization
we used WhisperX on a rented GPU service (verda), book one of tami's P40s, or change --device to cpu for non cuda machines.
WhisperX CLI Command
the code to save as json and convert to srt for quick anima runs
from https://notes.nicolasdeville.com/python/library-whisperx/
we adapted to add diarization (see below for huginface hug)
Key Arguments
| Argument | Description |
|---|---|
--device |
Device to use for inference (cpu or cuda) |
--model |
Whisper model: large-v3 (best quality), turbo (fastest), large-v2 |
--language |
Source language code (e.g., en for English, country ISO codes) |
--diarize |
Enable speaker diarization (requires HuggingFace token) |
--compute_type |
float16 (GPU), int8 (CPU/low memory), float32 (highest accuracy) |
--batch_size |
Higher = faster but uses more VRAM (16-32 for 24GB GPUs) |
--hf_token |
HuggingFace token for PyAnnote diarization models |
Performance Benchmarks (from Nic's notes)
| Configuration | Speed Ratio |
|---|---|
turbo, int8, batch_size=16 |
~2.3x realtime |
large-v3, int8, batch_size=16 |
~1.2x realtime |
large-v2, float16, batch_size=32 |
~1.5x realtime (GPU) |
Tip
: For Hebrew transcription,
large-v3typically provides better accuracy thanturbo.
Getting a HuggingFace Token
- Create account at huggingface.co
- Accept model terms at:
- Generate token at Settings → Access Tokens
Output Format
WhisperX outputs JSON with word-level timestamps and speaker labels:
{
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "שלום לכולם",
"speaker": "SPEAKER_01",
"words": [
{ "word": "שלום", "start": 0.0, "end": 0.8 },
{ "word": "לכולם", "start": 0.9, "end": 2.5 }
]
}
]
}
LLM notes
this is work of claude-opus4.5 with roo-code vscode extension.
intial prompt
a webplayer that shows animation of diffrent talkers in space.
it is output of whisperX with diraization.
@/outputs/float32/amuta_2026-01-12_1.json
at bottom there is an audio spectorgram that allows user to scrub timeline
the json is aligned with @/input/amuta_2026-01-12_1.opus
