Go to file

5shekel 03615b9702 Add WhisperX rented GPU transcription guide to README

- Document recommended GPU configuration (A10/A100/RTX 4090)
- Include CLI command with optimal settings for diarization
- Add key arguments reference table
- Include performance benchmarks from reference source
- Document HuggingFace token setup for PyAnnote models
- Show expected JSON output format

2026-01-18 02:01:32 +02:00

input/amuta_2026-01-12_1

rent

2026-01-18 01:28:36 +02:00

outputs

Reorganize files by basename and add config.js

2026-01-18 01:25:35 +02:00

screenshots

Add demo screenshot and serve instructions to README

2026-01-18 01:56:40 +02:00

scripts

Add meeting recorder app with waveform visualization

2026-01-17 23:53:38 +02:00

.gitignore

Reorganize files by basename and add config.js

2026-01-18 01:25:35 +02:00

app.js

Add click-to-jump feature for speakers and fix start offset logic

2026-01-18 01:57:26 +02:00

config.js

Add click-to-jump feature for speakers and fix start offset logic

2026-01-18 01:57:26 +02:00

index.html

Reorganize files by basename and add config.js

2026-01-18 01:25:35 +02:00

README.md

Add WhisperX rented GPU transcription guide to README

2026-01-18 02:01:32 +02:00

styles.css

Move transport controls to left side

2026-01-18 01:11:37 +02:00

README.md

Amuta Space Talkers

A diarization viewer for Whisper transcription output, featuring a visual "space" display of speakers and waveform-based audio navigation.

Features

Speaker visualization: Speakers displayed as animated orbs in a starfield
Real-time transcription: Live transcript panel following audio playback
Waveform navigation: Click/drag on the waveform to seek through the audio
Keyboard controls: Space to play/pause, Arrow keys to seek

Quick Start

Place your audio file in input/
Place your Whisper transcript JSON in outputs/float32/
Generate the waveform data (see below)
Start a local server and open in browser:
```
npx serve -p 5000
```
Then navigate to http://localhost:5000

Waveform Generation

For optimal performance with long audio files (1-3 hours), waveform data is pre-generated as JSON rather than decoded in the browser.

Prerequisites

Node.js (v14+)
FFmpeg installed and available in PATH

Generate Waveform Data

node scripts/generate-waveform.js <input-audio> [output-json] [columns]

Arguments:

input-audio - Path to the audio file (opus, mp3, wav, etc.)
output-json - Output path for waveform JSON (default: <input>.waveform.json)
columns - Number of waveform columns/peaks (default: 1000)

Example:

# Generate waveform for a single file
node scripts/generate-waveform.js input/amuta_2026-01-12_1.opus outputs/float32/amuta_2026-01-12_1.waveform.json

# Or let it auto-generate the output path
node scripts/generate-waveform.js input/meeting.opus

Waveform JSON Format

The generated JSON file has this structure:

{
  "version": 1,
  "source": "meeting.opus",
  "duration": 7200.5,
  "sampleRate": 48000,
  "columns": 1000,
  "peaks": [
    { "min": -0.82, "max": 0.91 },
    { "min": -0.45, "max": 0.52 }
  ]
}

Field	Description
`version`	Schema version for future compatibility
`source`	Original audio filename
`duration`	Audio duration in seconds
`sampleRate`	Original sample rate
`columns`	Number of data points
`peaks`	Array of min/max amplitude pairs

Configuration

Edit the paths at the top of app.js:

const transcriptPath = "outputs/float32/amuta_2026-01-12_1.json";
const waveformPath = "outputs/float32/amuta_2026-01-12_1.waveform.json";

Speaker Labels

Map speaker IDs to display names in app.js:

const SPEAKER_LABELS = {
  "SPEAKER_01": "Maya",
  "SPEAKER_02": "David",
};

Keyboard Shortcuts

Key	Action
`Space`	Play/Pause
`←` / `A`	Seek back 10 seconds
`→` / `D`	Seek forward 10 seconds
`Shift` + `←`/`→`	Seek 60 seconds

File Structure

amuta-meetings/
├── index.html              # Main HTML page
├── app.js                  # Application logic
├── styles.css              # Styles
├── scripts/
│   └── generate-waveform.js  # Waveform generator script
├── input/                  # Audio files (gitignored)
├── outputs/
│   └── float32/            # Transcript and waveform JSON
└── plans/
    └── waveform-optimization.md  # Architecture documentation

Performance Notes

Waveform JSON (~20KB) loads in milliseconds vs decoding 50-100MB audio in 5-15 seconds
The waveform is loaded immediately on page load for instant display
Audio is only downloaded once (by the <audio> element)

Transcription with WhisperX (Rented GPU)

For transcribing audio with speaker diarization, use WhisperX on a rented GPU service (e.g., RunPod, Vast.ai, Lambda Labs).

Recommended GPU Configuration

GPU: NVIDIA A10, A100, or RTX 4090 (24GB+ VRAM recommended)
compute_type: float16 (optimal for GPU speed/quality balance)
batch_size: 16-32 (increase for faster processing on high-VRAM GPUs)

WhisperX CLI Command

whisperx input.opus \
  --model large-v3 \
  --language he \
  --task transcribe \
  --diarize \
  --compute_type float16 \
  --batch_size 16 \
  --device cuda \
  --output_format json \
  --output_dir ./outputs/float32/ \
  --hf_token YOUR_HUGGINGFACE_TOKEN

Key Arguments

Argument	Description
`--model`	Whisper model: `turbo` (fastest), `large-v2`, `large-v3` (best quality)
`--language`	Source language code (e.g., `he` for Hebrew, `en` for English)
`--diarize`	Enable speaker diarization (requires HuggingFace token)
`--compute_type`	`float16` (GPU), `int8` (CPU/low memory), `float32` (highest accuracy)
`--batch_size`	Higher = faster but uses more VRAM (16-32 for 24GB GPUs)
`--hf_token`	HuggingFace token for PyAnnote diarization models
`--min_speakers` / `--max_speakers`	Hint for expected speaker count

Performance Benchmarks (from Nic's notes)

Configuration	Speed Ratio
`turbo`, `int8`, `batch_size=16`	~2.3x realtime
`large-v3`, `int8`, `batch_size=16`	~1.2x realtime
`large-v2`, `float16`, `batch_size=32`	~1.5x realtime (GPU)

Tip

: For Hebrew transcription, large-v3 typically provides better accuracy than turbo.

Getting a HuggingFace Token

Create account at huggingface.co
Accept model terms at:
- pyannote/speaker-diarization-3.1
- pyannote/segmentation-3.0
Generate token at Settings → Access Tokens

Output Format

WhisperX outputs JSON with word-level timestamps and speaker labels:

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "שלום לכולם",
      "speaker": "SPEAKER_01",
      "words": [
        { "word": "שלום", "start": 0.0, "end": 0.8 },
        { "word": "לכולם", "start": 0.9, "end": 2.5 }
      ]
    }
  ]
}