Space Talkers

A diarization viewer for Whisper transcription output

Demo Screenshot

Features

  • Speaker visualization: Speakers displayed as animated orbs in a starfield
  • Real-time transcription: Live transcript panel following audio playback
  • Waveform navigation: Click/drag on the waveform to seek through the audio
  • Keyboard controls: Space to play/pause, Arrow keys to seek

Keyboard Shortcuts

Key Action
Space Play/Pause
/ A Seek back 10 seconds
/ D Seek forward 10 seconds
Shift + / Seek 60 seconds

Quick Start

  1. Place your audio file in input/
  2. Place your Whisper transcript JSON in outputs/float32/
  3. Generate the waveform data (see below)
  4. Start a local server and open in browser:
    npx serve -p 5000
    
    Then navigate to http://localhost:5000

Waveform Generation

For optimal performance with long audio files (1-3 hours), waveform data is pre-generated as JSON rather than decoded in the browser.

Prerequisites

Generate Waveform Data

node scripts/generate-waveform.js <input-audio> [output-json] [columns]

Arguments:

  • input-audio - Path to the audio file (opus, mp3, wav, etc.)
  • output-json - Output path for waveform JSON (default: <input>.waveform.json)
  • columns - Number of waveform columns/peaks (default: 1000)

Example:

# Generate waveform for a single file
node scripts/generate-waveform.js input/amuta_2026-01-12_1.opus outputs/float32/amuta_2026-01-12_1.waveform.json

# Or let it auto-generate the output path
node scripts/generate-waveform.js input/meeting.opus

Configuration

Edit the paths at the top of app.js:

const transcriptPath = "outputs/float32/amuta_2026-01-12_1.json";
const waveformPath = "outputs/float32/amuta_2026-01-12_1.waveform.json";

Speaker Labels

Map speaker IDs to display names in app.js: supports merging names

const SPEAKER_LABELS = {
  "SPEAKER_01": "Maya",
  "SPEAKER_02": "SPEAKER_23","SPEAKER_4",
};

File Structure

amuta-meetings/
├── index.html              # Main HTML page
├── app.js                  # Application logic
├── styles.css              # Styles
├── scripts/
│   └── generate-waveform.js  # Waveform generator script
├── input/                  # Audio files (gitignored)
├── outputs/
│   └── float32/            # Transcript and waveform JSON

Transcription with WhisperX (GPU or CPU)

For transcribing audio with speaker diarization, we used WhisperX on a rented GPU service (verda) or book one of tami's P40s, or do cpp.

WhisperX CLI Command

the code to save as json and convert to srt for quick anima runs from https://notes.nicolasdeville.com/python/library-whisperx/

we adapted to add diarization (see below for huginface hug)

Key Arguments

Argument Description
--model Whisper model: large-v3 (best quality), turbo (fastest), large-v2
--language Source language code (e.g., en for English, country ISO codes)
--diarize Enable speaker diarization (requires HuggingFace token)
--compute_type float16 (GPU), int8 (CPU/low memory), float32 (highest accuracy)
--batch_size Higher = faster but uses more VRAM (16-32 for 24GB GPUs)
--hf_token HuggingFace token for PyAnnote diarization models

Performance Benchmarks (from Nic's notes)

Configuration Speed Ratio
turbo, int8, batch_size=16 ~2.3x realtime
large-v3, int8, batch_size=16 ~1.2x realtime
large-v2, float16, batch_size=32 ~1.5x realtime (GPU)

Tip

: For Hebrew transcription, large-v3 typically provides better accuracy than turbo.

Getting a HuggingFace Token

  1. Create account at huggingface.co
  2. Accept model terms at:
  3. Generate token at Settings → Access Tokens

Output Format

WhisperX outputs JSON with word-level timestamps and speaker labels:

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "שלום לכולם",
      "speaker": "SPEAKER_01",
      "words": [
        { "word": "שלום", "start": 0.0, "end": 0.8 },
        { "word": "לכולם", "start": 0.9, "end": 2.5 }
      ]
    }
  ]
}
Description
effort to visualize serialize and Diarizationize meetings
Readme 22 MiB
Languages
JavaScript 68.9%
Python 13.7%
CSS 10.9%
HTML 6.5%