# Space Talkers

A diarization viewer for Whisper transcription output

![Demo Screenshot](screenshots/demo.jpg)

## Features

- **Speaker visualization**: Speakers displayed as animated orbs in a starfield
- **Real-time transcription**: Live transcript panel following audio playback
- **Waveform navigation**: Click/drag on the waveform to seek through the audio
- **Keyboard controls**: Space to play/pause, Arrow keys to seek


## Keyboard Shortcuts

| Key | Action |
|-----|--------|
| `Space` | Play/Pause |
| `←` / `A` | Seek back 10 seconds |
| `→` / `D` | Seek forward 10 seconds |
| `Shift` + `←`/`→` | Seek 60 seconds |

## Quick Start

1. Place your audio file in `input/`
2. Place your Whisper transcript JSON in `outputs/float32/`
3. Generate the waveform data (see below)
4. Start a local server and open in browser:
   ```bash
   npx serve -p 5000
   ```
   Then navigate to http://localhost:5000

## Waveform Generation

For optimal performance with long audio files (1-3 hours), waveform data is pre-generated as JSON rather than decoded in the browser.

### Prerequisites

- [Node.js](https://nodejs.org/) (v14+)
- [FFmpeg](https://ffmpeg.org/) installed and available in PATH

### Generate Waveform Data

```bash
node scripts/generate-waveform.js <input-audio> [output-json] [columns]
```

**Arguments:**
- `input-audio` - Path to the audio file (opus, mp3, wav, etc.)
- `output-json` - Output path for waveform JSON (default: `<input>.waveform.json`)
- `columns` - Number of waveform columns/peaks (default: 1000)

**Example:**

```bash
# Generate waveform for a single file
node scripts/generate-waveform.js input/amuta_2026-01-12_1.opus outputs/float32/amuta_2026-01-12_1.waveform.json

# Or let it auto-generate the output path
node scripts/generate-waveform.js input/meeting.opus
```
## Configuration

Edit the paths at the top of `app.js`:

```javascript
const transcriptPath = "outputs/float32/amuta_2026-01-12_1.json";
const waveformPath = "outputs/float32/amuta_2026-01-12_1.waveform.json";
```

### Speaker Labels

Map speaker IDs to display names in `app.js`:
supports merging names

```javascript
const SPEAKER_LABELS = {
  "SPEAKER_01": "Maya",
  "SPEAKER_02": "SPEAKER_23","SPEAKER_4",
};
```


## File Structure

```
amuta-meetings/
├── index.html              # Main HTML page
├── app.js                  # Application logic
├── styles.css              # Styles
├── scripts/
│   └── generate-waveform.js  # Waveform generator script
├── input/                  # Audio files (gitignored)
├── outputs/
│   └── float32/            # Transcript and waveform JSON
```

## Transcription with WhisperX (GPU or CPU)

For transcribing audio with speaker diarization, we used [WhisperX](https://github.com/m-bain/whisperX) on a rented GPU service (verda) or book one of tami's P40s, or do cpp.

### WhisperX CLI Command

the code to save as json and convert to srt for quick anima runs
from https://notes.nicolasdeville.com/python/library-whisperx/

we adapted to add diarization (see below for huginface hug)
### Key Arguments

| Argument | Description |
|----------|-------------|
| `--model` | Whisper model: `large-v3` (best quality), `turbo` (fastest), `large-v2` |
| `--language` | Source language code (e.g., `en` for English,  country ISO codes) |
| `--diarize` | Enable speaker diarization (requires HuggingFace token) |
| `--compute_type` | `float16` (GPU), `int8` (CPU/low memory), `float32` (highest accuracy) |
| `--batch_size` | Higher = faster but uses more VRAM (16-32 for 24GB GPUs) |
| `--hf_token` | HuggingFace token for PyAnnote diarization models |

### Performance Benchmarks (from [Nic's notes](https://notes.nicolasdeville.com/python/library-whisperx/))

| Configuration | Speed Ratio |
|---------------|-------------|
| `turbo`, `int8`, `batch_size=16` | ~2.3x realtime |
| `large-v3`, `int8`, `batch_size=16` | ~1.2x realtime |
| `large-v2`, `float16`, `batch_size=32` | ~1.5x realtime (GPU) |

> **Tip**: For Hebrew transcription, `large-v3` typically provides better accuracy than `turbo`.

### Getting a HuggingFace Token

1. Create account at [huggingface.co](https://huggingface.co)
2. Accept model terms at:
   - [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
   - [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0)
3. Generate token at [Settings → Access Tokens](https://huggingface.co/settings/tokens)

### Output Format

WhisperX outputs JSON with word-level timestamps and speaker labels:

```json
{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "שלום לכולם",
      "speaker": "SPEAKER_01",
      "words": [
        { "word": "שלום", "start": 0.0, "end": 0.8 },
        { "word": "לכולם", "start": 0.9, "end": 2.5 }
      ]
    }
  ]
}
```