StellaScript — Béranger Thomas

Context

Initially developed as part of sovereign AI initiatives at the Metropolis of Lyon, I continued developing StellaScript independently. It is an end-to-end audio transcription pipeline designed to ensure data privacy and sovereignty.

Approach

The pipeline chains several specialized modules:

Audio enhancement (optional): noise reduction via DeepFilterNet or vocal source separation via Demucs, to improve input audio clarity
Voice Activity Detection (VAD) with Silero-VAD, to isolate speech segments and prevent transcription model hallucinations
Speaker diarization – two methods available:
- pyannote (default): end-to-end pyannote/speaker-diarization-3.1 pipeline, robust on overlapping speech
- cluster: speaker embeddings extracted by SpeechBrain (ECAPA-TDNN), then grouped via agglomerative clustering on cosine similarity
Transcription via WhisperX (optimized implementation of OpenAI’s Whisper), with timestamping at the block, segment, or word level depending on the chosen mode

The pipeline runs in live mode (microphone input) or file mode (.wav), with an intelligent chunking system to balance quality and latency.

Features

100% local: no data leaves the machine after initial model download (Hugging Face token only required for pyannote)
Three output modes: block (per-speaker paragraphs, maximum readability), segment (timestamped subtitles), word (word-level timestamps)
Multilingual: all languages supported by Whisper
Optional GPU acceleration via PyTorch CUDA
Open source under MIT license, with full documentation on GitHub Pages

Impact

This project demonstrates that professional-quality transcriptions can be produced without relying on cloud services, meeting the confidentiality requirements of any sensitive context.