Context
Developed during my internship at the Metropolis of Lyon, SmartWatch automates the updating of opening hours for public facilities on the data.grandlyon.com portal. Previously, several technicians from the Metropolitan Data Service performed this check manually for nearly 200 sites (town halls, swimming pools, media libraries).
Architecture
The pipeline combines several technologies:
- Web scraping with Playwright to fetch website content
- Markdown conversion with cleanup of superfluous characters
- Semantic filtering via embeddings to identify relevant sections
- Structured extraction via LLM (devstral) with constrained JSON output
- Comparison with existing data in data.grandlyon.com (OSM format)
- Reporting: standalone HTML report sent by email
Results & Impact
- 75% reduction in manual processing time (from ~4 days to ~1 day)
- Uses sovereign, open-source models (devstral, hosted on local cluster)
- CO2 footprint measurement via CodeCarbon
- Comprehensive documentation (Sphinx) and unit tests (Pytest)
- MIT-licensed open-source project — replicable for other automation use cases
Challenges
- Stabilizing LLM outputs (temperature at 0, fixed seed, refined system prompt)
- Prompt optimisation to avoid hallucinations
- Working around local cluster limits via semantic filtering
- Handling edge cases (pop-ups, conflicting schedules, complex tables)