Work/2026

SEIN Contract Intelligence

Scraping, OCR and agentic RAG over Peru’s free-market electricity contracts — answers that cite or abstain.

TimelineMar 2026 — present
StatusOpen source · Local AI platform
TypeAgentic RAG · OCR · Data engineering
SEIN Contract Intelligence — main visual
$0marginal cost per question
3 reposscraper → OCR → RAG
cite/abstainevery answer, verified

A three-repo platform that turns Osinergmin’s public archive of free-market (SEIN) electricity supply contracts — thousands of scanned PDFs — into a chat interface that answers questions with citations down to document, section and page. Born from my energy origination work, where contract intelligence took hours of manual reading per deal. The entire stack runs on a 16GB Mac mini with open-source models, so the marginal cost per question is zero.

Highlights

  • Incremental acquisition. An async scraper (httpx + Polars) reads the regulator’s contract registry and anti-joins against append-only download logs, so each run fetches only new contracts — bounded concurrency, auditable per-run logs, email summaries.
  • OCR you can re-run blindly. Local vision OCR (GLM-OCR 0.9B primary, Qianfan 4B fallback from a different model family) with per-page checkpoints, SHA-256 content idempotency, quality gates against empty pages and repetition loops, atomic writes, and a JSONL manifest. Output is RAG-ready Markdown with auto-populated YAML frontmatter.
  • Answers that cite or abstain. A LangGraph agent (analyze → retrieve → grade → rewrite → generate → verify) over hybrid Qdrant retrieval — BGE-M3 dense + full-text fused with RRF so exact RUCs and dates survive — plus a groundedness check on every answer and a golden-set eval harness gating any prompt, model or chunking change.
Built with
PythonPolarsLangGraphQdrantBGE-M3OllamaFastAPIReactCloudflare Tunnel