Place your audio file in input/ - an example is pre configured
Place your Whisper transcript JSON in outputs/float32/ - an example is pre configured
Generate the waveform data (see below) - an example is pre configured
Start a local server and open in browser:
```
npx serve -p 5000
```
Then navigate to http://localhost:5000

Waveform Generation

For optimal performance with long audio files (1-3 hours), waveform data is pre-generated as JSON rather than decoded in the browser.

Prerequisites

Node.js (v14+)
FFmpeg installed and available in PATH

Generate Waveform Data

node scripts/generate-waveform.js <input-audio> [output-json] [columns]

Arguments:

input-audio - Path to the audio file (opus, mp3, wav, etc.)
output-json - Output path for waveform JSON (default: <input>.waveform.json)
columns - Number of waveform columns/peaks (default: 1000)

Example:

# Generate waveform for a single file
node scripts/generate-waveform.js input/amuta_2026-01-12_1.opus outputs/float32/amuta_2026-01-12_1.waveform.json

# Or let it auto-generate the output path
node scripts/generate-waveform.js input/meeting.opus

Configuration

Edit the paths at the top of app.js:

const transcriptPath = "outputs/float32/amuta_2026-01-12_1.json";
const waveformPath = "outputs/float32/amuta_2026-01-12_1.waveform.json";

Speaker Labels

Map speaker IDs to display names in app.js: supports merging names

const SPEAKER_LABELS = {
  "SPEAKER_01": "Maya",
  "SPEAKER_02": "SPEAKER_23","SPEAKER_4",
};

File Structure

amuta-meetings/
├── index.html              # Main HTML page
├── app.js                  # Application logic
├── styles.css              # Styles
├── scripts/
│   └── generate-waveform.js  # Waveform generator script
├── input/                  # Audio files (gitignored)
├── outputs/
│   └── float32/            # Transcript and waveform JSON

Transcription with WhisperX (GPU or CPU)

For transcribing audio with speaker diarization
we used WhisperX on a rented GPU service (verda), book one of tami's P40s, or change --device to cpu for non cuda machines.

WhisperX CLI Command

the code to save as json and convert to srt for quick anima runs
from https://notes.nicolasdeville.com/python/library-whisperx/

we adapted to add diarization (see below for huginface hug)

Key Arguments

Argument	Description
`--device`	Device to use for inference (`cpu` or `cuda`)
`--model`	Whisper model: `large-v3` (best quality), `turbo` (fastest), `large-v2`
`--language`	Source language code (e.g., `en` for English, country ISO codes)
`--diarize`	Enable speaker diarization (requires HuggingFace token)
`--compute_type`	`float16` (GPU), `int8` (CPU/low memory), `float32` (highest accuracy)
`--batch_size`	Higher = faster but uses more VRAM (16-32 for 24GB GPUs)
`--hf_token`	HuggingFace token for PyAnnote diarization models

Performance Benchmarks (from Nic's notes)

Configuration	Speed Ratio
`turbo`, `int8`, `batch_size=16`	~2.3x realtime
`large-v3`, `int8`, `batch_size=16`	~1.2x realtime
`large-v2`, `float16`, `batch_size=32`	~1.5x realtime (GPU)

Tip

: For Hebrew transcription, large-v3 typically provides better accuracy than turbo.

Getting a HuggingFace Token

Create account at huggingface.co
Accept model terms at:
- pyannote/speaker-diarization-3.1
- pyannote/segmentation-3.0
Generate token at Settings → Access Tokens

Output Format

WhisperX outputs JSON with word-level timestamps and speaker labels:

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "שלום לכולם",
      "speaker": "SPEAKER_01",
      "words": [
        { "word": "שלום", "start": 0.0, "end": 0.8 },
        { "word": "לכולם", "start": 0.9, "end": 2.5 }
      ]
    }
  ]
}

LLM notes

this is work of claude-opus4.5 with roo-code vscode extension.

intial prompt

a webplayer that shows animation of diffrent talkers in space. 
it is output of whisperX with diraization. 
@/outputs/float32/amuta_2026-01-12_1.json 
at bottom there is an audio spectorgram that allows user to scrub timeline
the json is aligned with @/input/amuta_2026-01-12_1.opus