read nic

2026-01-18 02:15:57 +02:00
parent 03615b9702
commit 2a8d6273ee
2 changed files with 241 additions and 71 deletions
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
-# Amuta Space Talkers
+# Space Talkers

-A diarization viewer for Whisper transcription output, featuring a visual "space" display of speakers and waveform-based audio navigation.
+A diarization viewer for Whisper transcription output

 ![Demo Screenshot](screenshots/demo.jpg)

@@ -11,6 +11,16 @@ A diarization viewer for Whisper transcription output, featuring a visual "space
 - **Waveform navigation**: Click/drag on the waveform to seek through the audio
 - **Keyboard controls**: Space to play/pause, Arrow keys to seek

+
+## Keyboard Shortcuts
+
+| Key | Action |
+|-----|--------|
+| `Space` | Play/Pause |
+| `←` / `A` | Seek back 10 seconds |
+| `→` / `D` | Seek forward 10 seconds |
+| `Shift` + `←`/`→` | Seek 60 seconds |
+
 ## Quick Start

 1. Place your audio file in `input/`
@@ -51,34 +61,6 @@ node scripts/generate-waveform.js input/amuta_2026-01-12_1.opus outputs/float32/
 # Or let it auto-generate the output path
 node scripts/generate-waveform.js input/meeting.opus
 ```
-
-### Waveform JSON Format
-
-The generated JSON file has this structure:
-
-```json
-{
-  "version": 1,
-  "source": "meeting.opus",
-  "duration": 7200.5,
-  "sampleRate": 48000,
-  "columns": 1000,
-  "peaks": [
-    { "min": -0.82, "max": 0.91 },
-    { "min": -0.45, "max": 0.52 }
-  ]
-}
-```
-
-| Field | Description |
-|-------|-------------|
-| `version` | Schema version for future compatibility |
-| `source` | Original audio filename |
-| `duration` | Audio duration in seconds |
-| `sampleRate` | Original sample rate |
-| `columns` | Number of data points |
-| `peaks` | Array of min/max amplitude pairs |
-
 ## Configuration

 Edit the paths at the top of `app.js`:
@@ -91,22 +73,15 @@ const waveformPath = "outputs/float32/amuta_2026-01-12_1.waveform.json";
 ### Speaker Labels

 Map speaker IDs to display names in `app.js`:
+supports merging names

 ```javascript
 const SPEAKER_LABELS = {
  "SPEAKER_01": "Maya",
-  "SPEAKER_02": "David",
+  "SPEAKER_02": "SPEAKER_23","SPEAKER_4",
 };
 ```

-## Keyboard Shortcuts
-
-| Key | Action |
-|-----|--------|
-| `Space` | Play/Pause |
-| `←` / `A` | Seek back 10 seconds |
-| `→` / `D` | Seek forward 10 seconds |
-| `Shift` + `←`/`→` | Seek 60 seconds |

 ## File Structure

@@ -120,53 +95,28 @@ amuta-meetings/
 ├── input/                  # Audio files (gitignored)
 ├── outputs/
 │   └── float32/            # Transcript and waveform JSON
-└── plans/
-    └── waveform-optimization.md  # Architecture documentation
 ```

-## Performance Notes
+## Transcription with WhisperX (GPU or CPU)

- **Waveform JSON (~20KB)** loads in milliseconds vs decoding 50-100MB audio in 5-15 seconds
- The waveform is loaded immediately on page load for instant display
- Audio is only downloaded once (by the `<audio>` element)
-
-## Transcription with WhisperX (Rented GPU)
-
-For transcribing audio with speaker diarization, use [WhisperX](https://github.com/m-bain/whisperX) on a rented GPU service (e.g., [RunPod](https://runpod.io), [Vast.ai](https://vast.ai), [Lambda Labs](https://lambdalabs.com)).
-
-### Recommended GPU Configuration
-
- **GPU**: NVIDIA A10, A100, or RTX 4090 (24GB+ VRAM recommended)
- **compute_type**: `float16` (optimal for GPU speed/quality balance)
- **batch_size**: `16-32` (increase for faster processing on high-VRAM GPUs)
+For transcribing audio with speaker diarization, we used [WhisperX](https://github.com/m-bain/whisperX) on a rented GPU service (verda) or book one of tami's P40s, or do cpp.

 ### WhisperX CLI Command

-```bash
-whisperx input.opus \
-  --model large-v3 \
-  --language he \
-  --task transcribe \
-  --diarize \
-  --compute_type float16 \
-  --batch_size 16 \
-  --device cuda \
-  --output_format json \
-  --output_dir ./outputs/float32/ \
-  --hf_token YOUR_HUGGINGFACE_TOKEN
-```
+the code to save as json and convert to srt for quick anima runs
+from https://notes.nicolasdeville.com/python/library-whisperx/

+we adapted to add diarization (see below for huginface hug)
 ### Key Arguments

 | Argument | Description |
 |----------|-------------|
-| `--model` | Whisper model: `turbo` (fastest), `large-v2`, `large-v3` (best quality) |
-| `--language` | Source language code (e.g., `he` for Hebrew, `en` for English) |
+| `--model` | Whisper model: `large-v3` (best quality), `turbo` (fastest), `large-v2` |
+| `--language` | Source language code (e.g., `en` for English,  country ISO codes) |
 | `--diarize` | Enable speaker diarization (requires HuggingFace token) |
 | `--compute_type` | `float16` (GPU), `int8` (CPU/low memory), `float32` (highest accuracy) |
 | `--batch_size` | Higher = faster but uses more VRAM (16-32 for 24GB GPUs) |
 | `--hf_token` | HuggingFace token for PyAnnote diarization models |
-| `--min_speakers` / `--max_speakers` | Hint for expected speaker count |

 ### Performance Benchmarks (from [Nic's notes](https://notes.nicolasdeville.com/python/library-whisperx/))

--- a/whisperX-nic.py
+++ b/whisperX-nic.py
@@ -0,0 +1,220 @@
+import json
+import os
+import subprocess
+import sys
+
+# Fix cuDNN library path for CUDA support
+cudnn_path = os.path.join(
+    os.path.dirname(os.path.abspath(__file__)),
+    ".venv/lib/python3.10/site-packages/nvidia/cudnn/lib"
+)
+if os.path.exists(cudnn_path):
+    original_ld_path = os.environ.get("LD_LIBRARY_PATH", "")
+    os.environ["LD_LIBRARY_PATH"] = cudnn_path + ":" + original_ld_path
+
+
+def generate_en_srt(mp4_path):
+
+    output_dir = os.path.dirname(mp4_path) or "."
+
+    # WhisperX Configuration
+    LANGUAGE = "en"
+    VERBOSE = "False"
+    MODEL = "turbo"
+    MODEL_CACHE_ONLY = "False"
+    MODEL_DIR = None
+    DEVICE = "cuda"
+    DEVICE_INDEX = "0"
+    ALIGN_MODEL = "WAV2VEC2_ASR_LARGE_LV60K_960H"
+    BATCH_SIZE = "16"
+    COMPUTE_TYPE = "float32"
+    MAX_LINE_WIDTH = "45"
+    MAX_LINE_COUNT = "1"
+    TASK = "transcribe"
+    INTERPOLATE_METHOD = "nearest"
+    # NO_ALIGN = "False"
+    # RETURN_CHAR_ALIGNMENTS = "False"
+    VAD_METHOD = "pyannote" # or "silero" / # pyannote provides robust, precise segmentation for challenging audio (at higher computational cost) while silero is lighter and faster but may be less accurate in complex scenarios.
+    VAD_ONSET = "0.500"
+    VAD_OFFSET = "0.363"
+    CHUNK_SIZE = "30"
+    DIARIZE = "True"
+    # MIN_SPEAKERS = None
+    # MAX_SPEAKERS = None
+    TEMPERATURE = "0"
+    BEST_OF = "5"
+    BEAM_SIZE = "5"
+    PATIENCE = "1.0"
+    LENGTH_PENALTY = "1.0"
+    SUPPRESS_TOKENS = "-1"
+    SUPPRESS_NUMERALS = "False"
+    # INITIAL_PROMPT = None
+    # CONDITION_ON_PREVIOUS_TEXT = "False"
+    # FP16 = "True"
+    TEMPERATURE_INCREMENT_ON_FALLBACK = "0.2"
+    COMPRESSION_RATIO_THRESHOLD = "2.4"
+    LOGPROB_THRESHOLD = "-1.0"
+    NO_SPEECH_THRESHOLD = "0.6"
+    # HIGHLIGHT_WORDS = "False"
+    SEGMENT_RESOLUTION = "chunk"
+    THREADS = "8"
+    HF_TOKEN = "hf_fdgdfgdfg" #replace with valid hf key, you also need to signin and ask the three linked url for apporval
+    OUTPUT_FORMAT = "json" # "all" or "json" or "srt" / Output json-only and do post-processing to create a new .srt file with a more reasonable number of words per segment. Or output all formats to get also the clean .txt and delete unneeded files in post-processing.
+    PRINT_PROGRESS = "False"
+
+    cmd = [
+        "whisperx",
+        mp4_path,
+        "--verbose", VERBOSE,
+        "--model", MODEL,
+        "--device", DEVICE,
+        "--device_index", DEVICE_INDEX,
+        # "--align_model", ALIGN_MODEL, # 250329-1534 removing to try to fix the overlapping segments issue
+        "--batch_size", BATCH_SIZE,
+        "--compute_type", COMPUTE_TYPE,
+        "--max_line_width", MAX_LINE_WIDTH,
+        "--max_line_count", MAX_LINE_COUNT,
+        "--language", LANGUAGE,
+        "--task", TASK,
+        # "--interpolate_method", INTERPOLATE_METHOD, # 250329-1534 removing to try to fix the overlapping segments issue
+        # "--no_align", NO_ALIGN,
+        # "--return_char_alignments", RETURN_CHAR_ALIGNMENTS,
+        # "--vad_method", VAD_METHOD, # 250329-1534 removing to try to fix the overlapping segments issue
+        # "--vad_onset", VAD_ONSET, # 250329-1534 removing to try to fix the overlapping segments issue
+        # "--vad_offset", VAD_OFFSET, # 250329-1534 removing to try to fix the overlapping segments issue
+        # "--chunk_size", CHUNK_SIZE, # 250329-1534 removing to try to fix the overlapping segments issue
+        "--diarize",  # Flag only, no value - enables speaker diarization
+        "--hf_token", HF_TOKEN,  # Required for diarization
+        # "--min_speakers", MIN_SPEAKERS,
+        # "--max_speakers", MAX_SPEAKERS,
+        # "--temperature", TEMPERATURE,
+        # "--best_of", BEST_OF,
+        # "--beam_size", BEAM_SIZE,
+        # "--patience", PATIENCE,
+        # "--length_penalty", LENGTH_PENALTY,
+        # "--suppress_tokens", SUPPRESS_TOKENS,
+        # "--initial_prompt", INITIAL_PROMPT,
+        # "--condition_on_previous_text", CONDITION_ON_PREVIOUS_TEXT,
+        # "--fp16", FP16,
+        # "--temperature_increment_on_fallback", TEMPERATURE_INCREMENT_ON_FALLBACK,
+        # "--compression_ratio_threshold", COMPRESSION_RATIO_THRESHOLD,
+        # "--logprob_threshold", LOGPROB_THRESHOLD,
+        # "--no_speech_threshold", NO_SPEECH_THRESHOLD,
+        # "--highlight_words", HIGHLIGHT_WORDS,
+        # "--segment_resolution", SEGMENT_RESOLUTION,
+        # "--threads", THREADS,
+        # "--hf_token", HF_TOKEN,
+        # "--print_progress", PRINT_PROGRESS,
+        "--output_dir", output_dir,
+        "--output_format", OUTPUT_FORMAT
+    ]
+
+    # Add boolean flags without values
+    # if MODEL_CACHE_ONLY.lower() == "true":
+    #     cmd.append("--model_cache_only")
+    # if VERBOSE.lower() == "true":
+    #     cmd.append("--verbose")
+    # if SUPPRESS_NUMERALS.lower() == "true":
+    #     cmd.append("--suppress_numerals")
+
+    # print(f"\n🔊 Generating 🇬🇧 English SRT for: {os.path.basename(mp4_path)}\n")
+    subprocess.run(cmd, check=True)
+
+    # Determine the output SRT path
+    base_name = os.path.basename(mp4_path).rsplit(".", 1)[0]
+    srt_path = os.path.join(output_dir, f"{base_name}.srt")
+    json_path = os.path.join(output_dir, f"{base_name}.json")
+
+
+    # Post-processing to create a new .srt file with a more reasonable number of words per segment using the json output from whisperx
+    with open(json_path, "r") as f:
+        data = json.load(f)
+
+        # data["segments"] has a list of segments, each with "words" that have individual timestamps.
+        new_segments = []
+        max_words_per_segment = 10
+
+        for seg in data["segments"]:
+            words = seg["words"]
+            current_chunk = []
+            for word_info in words:
+                current_chunk.append(word_info)
+                if len(current_chunk) >= max_words_per_segment:
+                    # finalize chunk
+                    start_ts = current_chunk[0]["start"]
+                    end_ts = current_chunk[-1]["end"]
+                    text = " ".join([w["word"] for w in current_chunk])
+                    new_segments.append((start_ts, end_ts, text))
+                    current_chunk = []
+
+            # leftover words in this segment
+            if current_chunk:
+                start_ts = current_chunk[0]["start"]
+                end_ts = current_chunk[-1]["end"]
+                text = " ".join([w["word"] for w in current_chunk])
+                new_segments.append((start_ts, end_ts, text))
+
+        # now write new_segments to SRT format:
+        def srt_time(sec):
+            """Convert float seconds to SRT time format (HH:MM:SS,mmm)"""
+            hours = int(sec // 3600)
+            minutes = int((sec % 3600) // 60)
+            seconds = int(sec % 60)
+            milliseconds = int((sec * 1000) % 1000)
+            return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}"
+
+        with open(srt_path, "w") as srt:
+            for i, (start, end, text) in enumerate(new_segments, start=1):
+                srt.write(f"{i}\n")
+                srt.write(f"{srt_time(start)} --> {srt_time(end)}\n")
+                srt.write(text.strip() + "\n\n")
+
+        print(f"\n✅ Generated SRT file: {srt_path}\n")
+
+
+    # Create a clean .txt version of the SRT file
+    txt_path = os.path.join(output_dir, f"{base_name}.txt")
+
+    try:
+        with open(srt_path, "r") as srt_file, open(txt_path, "w") as txt_file:
+            for line in srt_file:
+                line = line.strip()
+                # Skip empty lines, lines with timestamps (containing '-->'), and lines starting with digits (subtitle numbers)
+                if line and not line.startswith(tuple('0123456789')) and '-->' not in line:
+                    txt_file.write(line + "\n")
+
+        print(f"✅ Generated clean TXT file: {txt_path}")
+    except Exception as e:
+        print(f"❌ Error creating TXT file: {str(e)}")
+
+
+
+    # Delete the .tsv and .vtt files created in the same folder
+    base_path = os.path.join(output_dir, base_name)
+    tsv_path = f"{base_path}.tsv"
+    vtt_path = f"{base_path}.vtt"
+
+    if os.path.exists(tsv_path):
+        try:
+            os.remove(tsv_path)
+            # print(f"🗑️ Deleted TSV file: {tsv_path}")
+        except Exception as e:
+            print(f"❌ Error deleting TSV file: {str(e)}")
+
+    if os.path.exists(vtt_path):
+        try:
+            os.remove(vtt_path)
+            # print(f"🗑️ Deleted VTT file: {vtt_path}")
+        except Exception as e:
+            print(f"❌ Error deleting VTT file: {str(e)}")
+
+    return srt_path
+
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print("Usage: python run-nic.py <audio_file_path>")
+        sys.exit(1)
+    
+    audio_path = sys.argv[1]
+    generate_en_srt(audio_path)