03615b97025113b3ba865d2b615c9b676f1682f4
- Document recommended GPU configuration (A10/A100/RTX 4090) - Include CLI command with optimal settings for diarization - Add key arguments reference table - Include performance benchmarks from reference source - Document HuggingFace token setup for PyAnnote models - Show expected JSON output format
Amuta Space Talkers
A diarization viewer for Whisper transcription output, featuring a visual "space" display of speakers and waveform-based audio navigation.
Features
- Speaker visualization: Speakers displayed as animated orbs in a starfield
- Real-time transcription: Live transcript panel following audio playback
- Waveform navigation: Click/drag on the waveform to seek through the audio
- Keyboard controls: Space to play/pause, Arrow keys to seek
Quick Start
- Place your audio file in
input/ - Place your Whisper transcript JSON in
outputs/float32/ - Generate the waveform data (see below)
- Start a local server and open in browser:
Then navigate to http://localhost:5000
npx serve -p 5000
Waveform Generation
For optimal performance with long audio files (1-3 hours), waveform data is pre-generated as JSON rather than decoded in the browser.
Prerequisites
Generate Waveform Data
node scripts/generate-waveform.js <input-audio> [output-json] [columns]
Arguments:
input-audio- Path to the audio file (opus, mp3, wav, etc.)output-json- Output path for waveform JSON (default:<input>.waveform.json)columns- Number of waveform columns/peaks (default: 1000)
Example:
# Generate waveform for a single file
node scripts/generate-waveform.js input/amuta_2026-01-12_1.opus outputs/float32/amuta_2026-01-12_1.waveform.json
# Or let it auto-generate the output path
node scripts/generate-waveform.js input/meeting.opus
Waveform JSON Format
The generated JSON file has this structure:
{
"version": 1,
"source": "meeting.opus",
"duration": 7200.5,
"sampleRate": 48000,
"columns": 1000,
"peaks": [
{ "min": -0.82, "max": 0.91 },
{ "min": -0.45, "max": 0.52 }
]
}
| Field | Description |
|---|---|
version |
Schema version for future compatibility |
source |
Original audio filename |
duration |
Audio duration in seconds |
sampleRate |
Original sample rate |
columns |
Number of data points |
peaks |
Array of min/max amplitude pairs |
Configuration
Edit the paths at the top of app.js:
const transcriptPath = "outputs/float32/amuta_2026-01-12_1.json";
const waveformPath = "outputs/float32/amuta_2026-01-12_1.waveform.json";
Speaker Labels
Map speaker IDs to display names in app.js:
const SPEAKER_LABELS = {
"SPEAKER_01": "Maya",
"SPEAKER_02": "David",
};
Keyboard Shortcuts
| Key | Action |
|---|---|
Space |
Play/Pause |
← / A |
Seek back 10 seconds |
→ / D |
Seek forward 10 seconds |
Shift + ←/→ |
Seek 60 seconds |
File Structure
amuta-meetings/
├── index.html # Main HTML page
├── app.js # Application logic
├── styles.css # Styles
├── scripts/
│ └── generate-waveform.js # Waveform generator script
├── input/ # Audio files (gitignored)
├── outputs/
│ └── float32/ # Transcript and waveform JSON
└── plans/
└── waveform-optimization.md # Architecture documentation
Performance Notes
- Waveform JSON (~20KB) loads in milliseconds vs decoding 50-100MB audio in 5-15 seconds
- The waveform is loaded immediately on page load for instant display
- Audio is only downloaded once (by the
<audio>element)
Transcription with WhisperX (Rented GPU)
For transcribing audio with speaker diarization, use WhisperX on a rented GPU service (e.g., RunPod, Vast.ai, Lambda Labs).
Recommended GPU Configuration
- GPU: NVIDIA A10, A100, or RTX 4090 (24GB+ VRAM recommended)
- compute_type:
float16(optimal for GPU speed/quality balance) - batch_size:
16-32(increase for faster processing on high-VRAM GPUs)
WhisperX CLI Command
whisperx input.opus \
--model large-v3 \
--language he \
--task transcribe \
--diarize \
--compute_type float16 \
--batch_size 16 \
--device cuda \
--output_format json \
--output_dir ./outputs/float32/ \
--hf_token YOUR_HUGGINGFACE_TOKEN
Key Arguments
| Argument | Description |
|---|---|
--model |
Whisper model: turbo (fastest), large-v2, large-v3 (best quality) |
--language |
Source language code (e.g., he for Hebrew, en for English) |
--diarize |
Enable speaker diarization (requires HuggingFace token) |
--compute_type |
float16 (GPU), int8 (CPU/low memory), float32 (highest accuracy) |
--batch_size |
Higher = faster but uses more VRAM (16-32 for 24GB GPUs) |
--hf_token |
HuggingFace token for PyAnnote diarization models |
--min_speakers / --max_speakers |
Hint for expected speaker count |
Performance Benchmarks (from Nic's notes)
| Configuration | Speed Ratio |
|---|---|
turbo, int8, batch_size=16 |
~2.3x realtime |
large-v3, int8, batch_size=16 |
~1.2x realtime |
large-v2, float16, batch_size=32 |
~1.5x realtime (GPU) |
Tip
: For Hebrew transcription,
large-v3typically provides better accuracy thanturbo.
Getting a HuggingFace Token
- Create account at huggingface.co
- Accept model terms at:
- Generate token at Settings → Access Tokens
Output Format
WhisperX outputs JSON with word-level timestamps and speaker labels:
{
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "שלום לכולם",
"speaker": "SPEAKER_01",
"words": [
{ "word": "שלום", "start": 0.0, "end": 0.8 },
{ "word": "לכולם", "start": 0.9, "end": 2.5 }
]
}
]
}
Description
Languages
JavaScript
68.9%
Python
13.7%
CSS
10.9%
HTML
6.5%
