Béranger Thomas
Python LLM Web Scraping Embeddings NLP

SmartWatch

SmartWatch

Scraping pipeline for Lyon public facility opening hours. Leverages embeddings and LLM to produce structured, unambiguous output, compared against data.grandlyon.com.

Context

Developed during my internship at the Metropolis of Lyon, SmartWatch automates the updating of opening hours for public facilities on the data.grandlyon.com portal. Previously, several technicians from the Metropolitan Data Service performed this check manually for nearly 200 sites (town halls, swimming pools, media libraries).

Architecture

The pipeline combines several technologies:

  1. Web scraping with Playwright to fetch website content
  2. Markdown conversion with cleanup of superfluous characters
  3. Semantic filtering via embeddings to identify relevant sections
  4. Structured extraction via LLM (devstral) with constrained JSON output
  5. Comparison with existing data in data.grandlyon.com (OSM format)
  6. Reporting: standalone HTML report sent by email

Results & Impact

Challenges

View demo → GitHub ↗