Béranger Thomas
Python Speech Processing WhisperX Diarization Pyannote SpeechBrain Open Source

StellaScript

StellaScript

Local Python audio transcription pipeline with speaker diarization, usable in real time (microphone) or on file. Works offline after initial model download.

Context

Initially developed as part of sovereign AI initiatives at the Métropole de Lyon, I continued developing StellaScript independently. It is an end-to-end audio transcription pipeline designed to ensure data privacy and sovereignty.

Architecture

The pipeline chains several specialized modules:

  1. Audio enhancement (optional): noise reduction via DeepFilterNet or vocal source separation via Demucs, to improve input audio clarity
  2. Voice Activity Detection (VAD) with Silero-VAD, to isolate speech segments and prevent transcription model hallucinations
  3. Speaker diarization — two methods available:
    • pyannote (default): end-to-end pyannote/speaker-diarization-3.1 pipeline, robust on overlapping speech
    • cluster: speaker embeddings extracted by SpeechBrain (ECAPA-TDNN), then grouped via agglomerative clustering on cosine similarity
  4. Transcription via WhisperX (optimized implementation of OpenAI’s Whisper), with timestamping at the block, segment, or word level depending on the chosen mode

The pipeline runs in live mode (microphone input) or file mode (.wav), with an intelligent chunking system to balance quality and latency.

Features

Impact

This project demonstrates that professional-quality transcriptions — interviews, focus groups, live presentations — can be produced without relying on cloud services, meeting the confidentiality requirements of any sensitive context.

View demo → GitHub ↗