read nic
This commit is contained in:
92
README.md
92
README.md
@@ -1,6 +1,6 @@
|
||||
# Amuta Space Talkers
|
||||
# Space Talkers
|
||||
|
||||
A diarization viewer for Whisper transcription output, featuring a visual "space" display of speakers and waveform-based audio navigation.
|
||||
A diarization viewer for Whisper transcription output
|
||||
|
||||

|
||||
|
||||
@@ -11,6 +11,16 @@ A diarization viewer for Whisper transcription output, featuring a visual "space
|
||||
- **Waveform navigation**: Click/drag on the waveform to seek through the audio
|
||||
- **Keyboard controls**: Space to play/pause, Arrow keys to seek
|
||||
|
||||
|
||||
## Keyboard Shortcuts
|
||||
|
||||
| Key | Action |
|
||||
|-----|--------|
|
||||
| `Space` | Play/Pause |
|
||||
| `←` / `A` | Seek back 10 seconds |
|
||||
| `→` / `D` | Seek forward 10 seconds |
|
||||
| `Shift` + `←`/`→` | Seek 60 seconds |
|
||||
|
||||
## Quick Start
|
||||
|
||||
1. Place your audio file in `input/`
|
||||
@@ -51,34 +61,6 @@ node scripts/generate-waveform.js input/amuta_2026-01-12_1.opus outputs/float32/
|
||||
# Or let it auto-generate the output path
|
||||
node scripts/generate-waveform.js input/meeting.opus
|
||||
```
|
||||
|
||||
### Waveform JSON Format
|
||||
|
||||
The generated JSON file has this structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"version": 1,
|
||||
"source": "meeting.opus",
|
||||
"duration": 7200.5,
|
||||
"sampleRate": 48000,
|
||||
"columns": 1000,
|
||||
"peaks": [
|
||||
{ "min": -0.82, "max": 0.91 },
|
||||
{ "min": -0.45, "max": 0.52 }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| `version` | Schema version for future compatibility |
|
||||
| `source` | Original audio filename |
|
||||
| `duration` | Audio duration in seconds |
|
||||
| `sampleRate` | Original sample rate |
|
||||
| `columns` | Number of data points |
|
||||
| `peaks` | Array of min/max amplitude pairs |
|
||||
|
||||
## Configuration
|
||||
|
||||
Edit the paths at the top of `app.js`:
|
||||
@@ -91,22 +73,15 @@ const waveformPath = "outputs/float32/amuta_2026-01-12_1.waveform.json";
|
||||
### Speaker Labels
|
||||
|
||||
Map speaker IDs to display names in `app.js`:
|
||||
supports merging names
|
||||
|
||||
```javascript
|
||||
const SPEAKER_LABELS = {
|
||||
"SPEAKER_01": "Maya",
|
||||
"SPEAKER_02": "David",
|
||||
"SPEAKER_02": "SPEAKER_23","SPEAKER_4",
|
||||
};
|
||||
```
|
||||
|
||||
## Keyboard Shortcuts
|
||||
|
||||
| Key | Action |
|
||||
|-----|--------|
|
||||
| `Space` | Play/Pause |
|
||||
| `←` / `A` | Seek back 10 seconds |
|
||||
| `→` / `D` | Seek forward 10 seconds |
|
||||
| `Shift` + `←`/`→` | Seek 60 seconds |
|
||||
|
||||
## File Structure
|
||||
|
||||
@@ -120,53 +95,28 @@ amuta-meetings/
|
||||
├── input/ # Audio files (gitignored)
|
||||
├── outputs/
|
||||
│ └── float32/ # Transcript and waveform JSON
|
||||
└── plans/
|
||||
└── waveform-optimization.md # Architecture documentation
|
||||
```
|
||||
|
||||
## Performance Notes
|
||||
## Transcription with WhisperX (GPU or CPU)
|
||||
|
||||
- **Waveform JSON (~20KB)** loads in milliseconds vs decoding 50-100MB audio in 5-15 seconds
|
||||
- The waveform is loaded immediately on page load for instant display
|
||||
- Audio is only downloaded once (by the `<audio>` element)
|
||||
|
||||
## Transcription with WhisperX (Rented GPU)
|
||||
|
||||
For transcribing audio with speaker diarization, use [WhisperX](https://github.com/m-bain/whisperX) on a rented GPU service (e.g., [RunPod](https://runpod.io), [Vast.ai](https://vast.ai), [Lambda Labs](https://lambdalabs.com)).
|
||||
|
||||
### Recommended GPU Configuration
|
||||
|
||||
- **GPU**: NVIDIA A10, A100, or RTX 4090 (24GB+ VRAM recommended)
|
||||
- **compute_type**: `float16` (optimal for GPU speed/quality balance)
|
||||
- **batch_size**: `16-32` (increase for faster processing on high-VRAM GPUs)
|
||||
For transcribing audio with speaker diarization, we used [WhisperX](https://github.com/m-bain/whisperX) on a rented GPU service (verda) or book one of tami's P40s, or do cpp.
|
||||
|
||||
### WhisperX CLI Command
|
||||
|
||||
```bash
|
||||
whisperx input.opus \
|
||||
--model large-v3 \
|
||||
--language he \
|
||||
--task transcribe \
|
||||
--diarize \
|
||||
--compute_type float16 \
|
||||
--batch_size 16 \
|
||||
--device cuda \
|
||||
--output_format json \
|
||||
--output_dir ./outputs/float32/ \
|
||||
--hf_token YOUR_HUGGINGFACE_TOKEN
|
||||
```
|
||||
the code to save as json and convert to srt for quick anima runs
|
||||
from https://notes.nicolasdeville.com/python/library-whisperx/
|
||||
|
||||
we adapted to add diarization (see below for huginface hug)
|
||||
### Key Arguments
|
||||
|
||||
| Argument | Description |
|
||||
|----------|-------------|
|
||||
| `--model` | Whisper model: `turbo` (fastest), `large-v2`, `large-v3` (best quality) |
|
||||
| `--language` | Source language code (e.g., `he` for Hebrew, `en` for English) |
|
||||
| `--model` | Whisper model: `large-v3` (best quality), `turbo` (fastest), `large-v2` |
|
||||
| `--language` | Source language code (e.g., `en` for English, country ISO codes) |
|
||||
| `--diarize` | Enable speaker diarization (requires HuggingFace token) |
|
||||
| `--compute_type` | `float16` (GPU), `int8` (CPU/low memory), `float32` (highest accuracy) |
|
||||
| `--batch_size` | Higher = faster but uses more VRAM (16-32 for 24GB GPUs) |
|
||||
| `--hf_token` | HuggingFace token for PyAnnote diarization models |
|
||||
| `--min_speakers` / `--max_speakers` | Hint for expected speaker count |
|
||||
|
||||
### Performance Benchmarks (from [Nic's notes](https://notes.nicolasdeville.com/python/library-whisperx/))
|
||||
|
||||
|
||||
220
whisperX-nic.py
Normal file
220
whisperX-nic.py
Normal file
@@ -0,0 +1,220 @@
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
# Fix cuDNN library path for CUDA support
|
||||
cudnn_path = os.path.join(
|
||||
os.path.dirname(os.path.abspath(__file__)),
|
||||
".venv/lib/python3.10/site-packages/nvidia/cudnn/lib"
|
||||
)
|
||||
if os.path.exists(cudnn_path):
|
||||
original_ld_path = os.environ.get("LD_LIBRARY_PATH", "")
|
||||
os.environ["LD_LIBRARY_PATH"] = cudnn_path + ":" + original_ld_path
|
||||
|
||||
|
||||
def generate_en_srt(mp4_path):
|
||||
|
||||
output_dir = os.path.dirname(mp4_path) or "."
|
||||
|
||||
# WhisperX Configuration
|
||||
LANGUAGE = "en"
|
||||
VERBOSE = "False"
|
||||
MODEL = "turbo"
|
||||
MODEL_CACHE_ONLY = "False"
|
||||
MODEL_DIR = None
|
||||
DEVICE = "cuda"
|
||||
DEVICE_INDEX = "0"
|
||||
ALIGN_MODEL = "WAV2VEC2_ASR_LARGE_LV60K_960H"
|
||||
BATCH_SIZE = "16"
|
||||
COMPUTE_TYPE = "float32"
|
||||
MAX_LINE_WIDTH = "45"
|
||||
MAX_LINE_COUNT = "1"
|
||||
TASK = "transcribe"
|
||||
INTERPOLATE_METHOD = "nearest"
|
||||
# NO_ALIGN = "False"
|
||||
# RETURN_CHAR_ALIGNMENTS = "False"
|
||||
VAD_METHOD = "pyannote" # or "silero" / # pyannote provides robust, precise segmentation for challenging audio (at higher computational cost) while silero is lighter and faster but may be less accurate in complex scenarios.
|
||||
VAD_ONSET = "0.500"
|
||||
VAD_OFFSET = "0.363"
|
||||
CHUNK_SIZE = "30"
|
||||
DIARIZE = "True"
|
||||
# MIN_SPEAKERS = None
|
||||
# MAX_SPEAKERS = None
|
||||
TEMPERATURE = "0"
|
||||
BEST_OF = "5"
|
||||
BEAM_SIZE = "5"
|
||||
PATIENCE = "1.0"
|
||||
LENGTH_PENALTY = "1.0"
|
||||
SUPPRESS_TOKENS = "-1"
|
||||
SUPPRESS_NUMERALS = "False"
|
||||
# INITIAL_PROMPT = None
|
||||
# CONDITION_ON_PREVIOUS_TEXT = "False"
|
||||
# FP16 = "True"
|
||||
TEMPERATURE_INCREMENT_ON_FALLBACK = "0.2"
|
||||
COMPRESSION_RATIO_THRESHOLD = "2.4"
|
||||
LOGPROB_THRESHOLD = "-1.0"
|
||||
NO_SPEECH_THRESHOLD = "0.6"
|
||||
# HIGHLIGHT_WORDS = "False"
|
||||
SEGMENT_RESOLUTION = "chunk"
|
||||
THREADS = "8"
|
||||
HF_TOKEN = "hf_fdgdfgdfg" #replace with valid hf key, you also need to signin and ask the three linked url for apporval
|
||||
OUTPUT_FORMAT = "json" # "all" or "json" or "srt" / Output json-only and do post-processing to create a new .srt file with a more reasonable number of words per segment. Or output all formats to get also the clean .txt and delete unneeded files in post-processing.
|
||||
PRINT_PROGRESS = "False"
|
||||
|
||||
cmd = [
|
||||
"whisperx",
|
||||
mp4_path,
|
||||
"--verbose", VERBOSE,
|
||||
"--model", MODEL,
|
||||
"--device", DEVICE,
|
||||
"--device_index", DEVICE_INDEX,
|
||||
# "--align_model", ALIGN_MODEL, # 250329-1534 removing to try to fix the overlapping segments issue
|
||||
"--batch_size", BATCH_SIZE,
|
||||
"--compute_type", COMPUTE_TYPE,
|
||||
"--max_line_width", MAX_LINE_WIDTH,
|
||||
"--max_line_count", MAX_LINE_COUNT,
|
||||
"--language", LANGUAGE,
|
||||
"--task", TASK,
|
||||
# "--interpolate_method", INTERPOLATE_METHOD, # 250329-1534 removing to try to fix the overlapping segments issue
|
||||
# "--no_align", NO_ALIGN,
|
||||
# "--return_char_alignments", RETURN_CHAR_ALIGNMENTS,
|
||||
# "--vad_method", VAD_METHOD, # 250329-1534 removing to try to fix the overlapping segments issue
|
||||
# "--vad_onset", VAD_ONSET, # 250329-1534 removing to try to fix the overlapping segments issue
|
||||
# "--vad_offset", VAD_OFFSET, # 250329-1534 removing to try to fix the overlapping segments issue
|
||||
# "--chunk_size", CHUNK_SIZE, # 250329-1534 removing to try to fix the overlapping segments issue
|
||||
"--diarize", # Flag only, no value - enables speaker diarization
|
||||
"--hf_token", HF_TOKEN, # Required for diarization
|
||||
# "--min_speakers", MIN_SPEAKERS,
|
||||
# "--max_speakers", MAX_SPEAKERS,
|
||||
# "--temperature", TEMPERATURE,
|
||||
# "--best_of", BEST_OF,
|
||||
# "--beam_size", BEAM_SIZE,
|
||||
# "--patience", PATIENCE,
|
||||
# "--length_penalty", LENGTH_PENALTY,
|
||||
# "--suppress_tokens", SUPPRESS_TOKENS,
|
||||
# "--initial_prompt", INITIAL_PROMPT,
|
||||
# "--condition_on_previous_text", CONDITION_ON_PREVIOUS_TEXT,
|
||||
# "--fp16", FP16,
|
||||
# "--temperature_increment_on_fallback", TEMPERATURE_INCREMENT_ON_FALLBACK,
|
||||
# "--compression_ratio_threshold", COMPRESSION_RATIO_THRESHOLD,
|
||||
# "--logprob_threshold", LOGPROB_THRESHOLD,
|
||||
# "--no_speech_threshold", NO_SPEECH_THRESHOLD,
|
||||
# "--highlight_words", HIGHLIGHT_WORDS,
|
||||
# "--segment_resolution", SEGMENT_RESOLUTION,
|
||||
# "--threads", THREADS,
|
||||
# "--hf_token", HF_TOKEN,
|
||||
# "--print_progress", PRINT_PROGRESS,
|
||||
"--output_dir", output_dir,
|
||||
"--output_format", OUTPUT_FORMAT
|
||||
]
|
||||
|
||||
# Add boolean flags without values
|
||||
# if MODEL_CACHE_ONLY.lower() == "true":
|
||||
# cmd.append("--model_cache_only")
|
||||
# if VERBOSE.lower() == "true":
|
||||
# cmd.append("--verbose")
|
||||
# if SUPPRESS_NUMERALS.lower() == "true":
|
||||
# cmd.append("--suppress_numerals")
|
||||
|
||||
# print(f"\n🔊 Generating 🇬🇧 English SRT for: {os.path.basename(mp4_path)}\n")
|
||||
subprocess.run(cmd, check=True)
|
||||
|
||||
# Determine the output SRT path
|
||||
base_name = os.path.basename(mp4_path).rsplit(".", 1)[0]
|
||||
srt_path = os.path.join(output_dir, f"{base_name}.srt")
|
||||
json_path = os.path.join(output_dir, f"{base_name}.json")
|
||||
|
||||
|
||||
# Post-processing to create a new .srt file with a more reasonable number of words per segment using the json output from whisperx
|
||||
with open(json_path, "r") as f:
|
||||
data = json.load(f)
|
||||
|
||||
# data["segments"] has a list of segments, each with "words" that have individual timestamps.
|
||||
new_segments = []
|
||||
max_words_per_segment = 10
|
||||
|
||||
for seg in data["segments"]:
|
||||
words = seg["words"]
|
||||
current_chunk = []
|
||||
for word_info in words:
|
||||
current_chunk.append(word_info)
|
||||
if len(current_chunk) >= max_words_per_segment:
|
||||
# finalize chunk
|
||||
start_ts = current_chunk[0]["start"]
|
||||
end_ts = current_chunk[-1]["end"]
|
||||
text = " ".join([w["word"] for w in current_chunk])
|
||||
new_segments.append((start_ts, end_ts, text))
|
||||
current_chunk = []
|
||||
|
||||
# leftover words in this segment
|
||||
if current_chunk:
|
||||
start_ts = current_chunk[0]["start"]
|
||||
end_ts = current_chunk[-1]["end"]
|
||||
text = " ".join([w["word"] for w in current_chunk])
|
||||
new_segments.append((start_ts, end_ts, text))
|
||||
|
||||
# now write new_segments to SRT format:
|
||||
def srt_time(sec):
|
||||
"""Convert float seconds to SRT time format (HH:MM:SS,mmm)"""
|
||||
hours = int(sec // 3600)
|
||||
minutes = int((sec % 3600) // 60)
|
||||
seconds = int(sec % 60)
|
||||
milliseconds = int((sec * 1000) % 1000)
|
||||
return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}"
|
||||
|
||||
with open(srt_path, "w") as srt:
|
||||
for i, (start, end, text) in enumerate(new_segments, start=1):
|
||||
srt.write(f"{i}\n")
|
||||
srt.write(f"{srt_time(start)} --> {srt_time(end)}\n")
|
||||
srt.write(text.strip() + "\n\n")
|
||||
|
||||
print(f"\n✅ Generated SRT file: {srt_path}\n")
|
||||
|
||||
|
||||
# Create a clean .txt version of the SRT file
|
||||
txt_path = os.path.join(output_dir, f"{base_name}.txt")
|
||||
|
||||
try:
|
||||
with open(srt_path, "r") as srt_file, open(txt_path, "w") as txt_file:
|
||||
for line in srt_file:
|
||||
line = line.strip()
|
||||
# Skip empty lines, lines with timestamps (containing '-->'), and lines starting with digits (subtitle numbers)
|
||||
if line and not line.startswith(tuple('0123456789')) and '-->' not in line:
|
||||
txt_file.write(line + "\n")
|
||||
|
||||
print(f"✅ Generated clean TXT file: {txt_path}")
|
||||
except Exception as e:
|
||||
print(f"❌ Error creating TXT file: {str(e)}")
|
||||
|
||||
|
||||
|
||||
# Delete the .tsv and .vtt files created in the same folder
|
||||
base_path = os.path.join(output_dir, base_name)
|
||||
tsv_path = f"{base_path}.tsv"
|
||||
vtt_path = f"{base_path}.vtt"
|
||||
|
||||
if os.path.exists(tsv_path):
|
||||
try:
|
||||
os.remove(tsv_path)
|
||||
# print(f"🗑️ Deleted TSV file: {tsv_path}")
|
||||
except Exception as e:
|
||||
print(f"❌ Error deleting TSV file: {str(e)}")
|
||||
|
||||
if os.path.exists(vtt_path):
|
||||
try:
|
||||
os.remove(vtt_path)
|
||||
# print(f"🗑️ Deleted VTT file: {vtt_path}")
|
||||
except Exception as e:
|
||||
print(f"❌ Error deleting VTT file: {str(e)}")
|
||||
|
||||
return srt_path
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python run-nic.py <audio_file_path>")
|
||||
sys.exit(1)
|
||||
|
||||
audio_path = sys.argv[1]
|
||||
generate_en_srt(audio_path)
|
||||
Reference in New Issue
Block a user