Place your audio file in input/
Place your Whisper transcript JSON in outputs/float32/
Generate the waveform data (see below)
Start a local server and open in browser:
```
npx serve -p 5000
```
Then navigate to http://localhost:5000

Waveform Generation

For optimal performance with long audio files (1-3 hours), waveform data is pre-generated as JSON rather than decoded in the browser.

Prerequisites

Node.js (v14+)
FFmpeg installed and available in PATH

Generate Waveform Data

node scripts/generate-waveform.js <input-audio> [output-json] [columns]

Arguments:

input-audio - Path to the audio file (opus, mp3, wav, etc.)
output-json - Output path for waveform JSON (default: <input>.waveform.json)
columns - Number of waveform columns/peaks (default: 1000)

Example:

# Generate waveform for a single file
node scripts/generate-waveform.js input/amuta_2026-01-12_1.opus outputs/float32/amuta_2026-01-12_1.waveform.json

# Or let it auto-generate the output path
node scripts/generate-waveform.js input/meeting.opus

Configuration

Edit the paths at the top of app.js:

const transcriptPath = "outputs/float32/amuta_2026-01-12_1.json";
const waveformPath = "outputs/float32/amuta_2026-01-12_1.waveform.json";

Speaker Labels

Map speaker IDs to display names in app.js: supports merging names

const SPEAKER_LABELS = {
  "SPEAKER_01": "Maya",
  "SPEAKER_02": "SPEAKER_23","SPEAKER_4",
};

File Structure

amuta-meetings/
├── index.html              # Main HTML page
├── app.js                  # Application logic
├── styles.css              # Styles
├── scripts/
│   └── generate-waveform.js  # Waveform generator script
├── input/                  # Audio files (gitignored)
├── outputs/
│   └── float32/            # Transcript and waveform JSON

Transcription with WhisperX (GPU or CPU)

For transcribing audio with speaker diarization, we used WhisperX on a rented GPU service (verda) or book one of tami's P40s, or do cpp.

WhisperX CLI Command

the code to save as json and convert to srt for quick anima runs from https://notes.nicolasdeville.com/python/library-whisperx/

we adapted to add diarization (see below for huginface hug)

Key Arguments

Argument	Description
`--model`	Whisper model: `large-v3` (best quality), `turbo` (fastest), `large-v2`
`--language`	Source language code (e.g., `en` for English, country ISO codes)
`--diarize`	Enable speaker diarization (requires HuggingFace token)
`--compute_type`	`float16` (GPU), `int8` (CPU/low memory), `float32` (highest accuracy)
`--batch_size`	Higher = faster but uses more VRAM (16-32 for 24GB GPUs)
`--hf_token`	HuggingFace token for PyAnnote diarization models

Performance Benchmarks (from Nic's notes)

Configuration	Speed Ratio
`turbo`, `int8`, `batch_size=16`	~2.3x realtime
`large-v3`, `int8`, `batch_size=16`	~1.2x realtime
`large-v2`, `float16`, `batch_size=32`	~1.5x realtime (GPU)

Tip

: For Hebrew transcription, large-v3 typically provides better accuracy than turbo.

Getting a HuggingFace Token

Create account at huggingface.co
Accept model terms at:
- pyannote/speaker-diarization-3.1
- pyannote/segmentation-3.0
Generate token at Settings → Access Tokens

Output Format

WhisperX outputs JSON with word-level timestamps and speaker labels:

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "שלום לכולם",
      "speaker": "SPEAKER_01",
      "words": [
        { "word": "שלום", "start": 0.0, "end": 0.8 },
        { "word": "לכולם", "start": 0.9, "end": 2.5 }
      ]
    }
  ]
}