Context
Initially developed as part of sovereign AI initiatives at the Métropole de Lyon, I continued developing StellaScript independently. It is an end-to-end audio transcription pipeline designed to ensure data privacy and sovereignty.
Architecture
The pipeline chains several specialized modules:
- Audio enhancement (optional): noise reduction via
DeepFilterNetor vocal source separation viaDemucs, to improve input audio clarity - Voice Activity Detection (VAD) with
Silero-VAD, to isolate speech segments and prevent transcription model hallucinations - Speaker diarization — two methods available:
pyannote(default): end-to-endpyannote/speaker-diarization-3.1pipeline, robust on overlapping speechcluster: speaker embeddings extracted bySpeechBrain(ECAPA-TDNN), then grouped via agglomerative clustering on cosine similarity
- Transcription via
WhisperX(optimized implementation of OpenAI’s Whisper), with timestamping at the block, segment, or word level depending on the chosen mode
The pipeline runs in live mode (microphone input) or file mode (.wav), with an intelligent chunking system to balance quality and latency.
Features
- 100% local: no data leaves the machine after initial model download (Hugging Face token only required for
pyannote) - Three output modes:
block(per-speaker paragraphs, maximum readability),segment(timestamped subtitles),word(word-level timestamps) - Multilingual: French, English, Spanish, German, and all languages supported by Whisper
- Optional GPU acceleration via PyTorch CUDA
- Open source under MIT license, with full documentation on GitHub Pages
Impact
This project demonstrates that professional-quality transcriptions — interviews, focus groups, live presentations — can be produced without relying on cloud services, meeting the confidentiality requirements of any sensitive context.