Technical documentation

(01)

Overview

compete monitors a configurable list of companies and turns changes on their public web presence into structured records called signals. A signal has a type (pricing change, product launch, job posting, and so on), a significance score from 1 to 5, a summary, and a link back to the source page.

The system is a batch pipeline, not a streaming one. Each run collects pages, detects which ones actually changed, sends only the changed content to an LLM for extraction, stores the results in a local DuckDB file, and rebuilds a set of dbt models that the dashboard reads through a FastAPI service.

Everything runs from files on disk. There is no managed database, no queue, and no cloud dependency. A weekly GitHub Actions cron is enough to keep it current, and the whole thing can run on a laptop.

(02)

Architecture

collect            extract              transform           serve
┌──────────┐     ┌──────────────┐     ┌─────────────┐     ┌──────────┐
│ httpx    │     │ change gate  │     │ DuckDB raw  │     │ FastAPI  │
│ feedparser│ →  │ LLM (instructor) │ → │ dbt staging │  →  │ Next.js  │
│ Playwright│     │ + validation │     │ dbt marts   │     │ dashboard│
│ ATS APIs │     │ retry        │     │             │     │          │
└──────────┘     └──────────────┘     └─────────────┘     └──────────┘
      ↓                  ↓
 Parquet landing    raw.llm_calls
 zone (partitioned) (cost log)

The pipeline stages are decoupled by the database: collection writes raw pages, extraction reads pending pages and writes signals, dbt reads signals and builds marts, and the API only reads marts and raw tables. Any stage can be re-run independently, and a failed run can be retried without touching the others.

A single Typer CLI exposes each stage (compete collect, compete extract, compete report) plus compete run-all, which chains sync, collect, extract, dbt build, and report generation.

(03)

Collection

Each tracked URL declares a source type in competitors.yaml, and a registry dispatches it to the right collector. All four collectors implement the same interface and return a list of raw page records.

Collector	Used for	Implementation
`static`	Plain HTML pages (pricing, about, news indexes)	httpx for fetching, trafilatura for boilerplate removal and main-content extraction
`rss`	Blogs and changelogs that publish feeds	feedparser; each entry becomes its own page record
`dynamic`	Pages that require JavaScript to render	Playwright with headless Chromium, then the same trafilatura pass
`jobs`	Careers pages	Greenhouse, Lever, and Ashby public JSON APIs, with a JSON-LD JobPosting fallback for everything else

The collectors respect robots.txt, including Crawl-delay, and apply per-host throttling with retries and exponential backoff. Pages are stored twice: in the DuckDB raw.raw_pages table for the pipeline, and as Parquet files partitioned by competitor and date, which serve as a replayable landing zone if the database ever needs to be rebuilt.

A full collection run across the five seeded companies fetches a few hundred pages. The jobs collector accounts for most of that volume because each posting is a separate record.

(04)

Extraction

Extraction is the only place the system calls an LLM, and the design treats the model as an unreliable component. The client is provider agnostic: Gemini Flash is the default, with Groq and Ollama as alternatives and a deterministic mock provider used by the test suite and for offline development. Switching providers is one environment variable.

Responses are parsed into Pydantic models through the instructor library. If validation fails, the request is retried exactly once with the validation errors appended to the prompt. If it fails again, the page is marked failed and the run continues. There is no silent fallback that fabricates a signal.

Every call logs its token counts and estimated cost to raw.llm_calls, so the cost of a run is a query away rather than a guess. A full extraction pass on the seeded dataset stays comfortably inside Gemini's free tier.

(05)

Change detection

Most pages do not change between runs, and sending unchanged content to an LLM is wasted money. Before extraction, each page goes through a two-stage gate:

First, an exact content hash comparison against the previous snapshot. If the hash matches, the page is skipped. Second, for pages that did change, an embedding cosine similarity check filters out trivial edits (dates, footers, rotating links). Only pages below the similarity threshold proceed to the LLM.

The default embedding backend is a hashing vectorizer with no model download and no external calls. MiniLM and Gemini embeddings are available behind the same interface when better semantic accuracy is worth the dependency.

Idempotency is enforced with a uniqueness constraint on the URL and source hash pair, so re-running extraction on the same data produces zero new signals rather than duplicates. This was verified by running the pipeline twice over identical input.

(06)

Warehouse

Storage is a single DuckDB file. Raw tables hold pages, signals, detected changes, LLM call logs, and reports. dbt builds staging views on top of raw, then six marts that the API queries:

Mart	What it answers
`dim_competitors`	Who is tracked, with signal counts and metadata
`fct_changes`	Every detected change with a weighted significance score
`fct_hiring`	Open roles parsed from job postings, by location and team
`fct_pricing_history`	Plan prices over time, extracted heuristically from pricing pages
`agg_weekly_competitor`	Weekly activity rollups per competitor
`fct_signal_duplicates`	Near-duplicate signals found via embedding cosine similarity at or above 0.95

The dbt project carries schema tests (not null, accepted values, relationships) on both staging and marts; the full build runs about fifty checks. Choosing DuckDB over Postgres was deliberate: the data is small, the access pattern is analytical, a file is trivially portable, and DuckDB's vector functions (list_cosine_similarity) made the dedup mart a plain SQL model instead of a Python job.

(07)

API

A FastAPI service reads the warehouse and exposes it as JSON. The interesting parts are less the endpoints than the edge handling:

DuckDB allows one writer, so the API holds a read connection and serializes any write through a lock. Endpoints that read marts degrade gracefully when marts have not been built yet (fresh clone, no dbt run) instead of returning 500s. List endpoints share a generic Page[T] envelope with limit and offset, and errors use one envelope shape everywhere.

Router	Endpoints
competitors	CRUD for tracked companies
signals, changes	Filterable, paginated lists (competitor, type, significance, date)
analytics	Pricing history, hiring, content cadence, stats overview
reports	List, read, and download weekly briefs as PDF
pipeline	Trigger a collection run as a background task

Interactive docs are served at /docs by FastAPI. The Next.js app proxies /api/* to the service through a rewrite, so the frontend never hardcodes the API origin.

(08)

Frontend

The dashboard is Next.js 14 (App Router) with TypeScript. UI primitives (button, card, dialog, select, tabs) are hand-built in the shadcn style on Radix, charts are Tremor, and server state goes through TanStack Query with typed fetch wrappers. The marketing site and this documentation page live in the same app and share one design system: warm paper and ink surfaces, a single lime accent, Space Grotesk for UI text, and a serif italic for editorial accents.

Pages are deliberately boring in structure: overview, per-competitor detail, a filterable changes table, reports with a PDF download, and settings. Every data view has explicit loading, empty, and error states, which is most of what separates a dashboard that feels finished from one that does not.

Animation is used twice on the marketing side (GSAP scroll reveals, Framer Motion in the hero) and once in the app (a short route transition). Decorative motion is disabled when the user requests reduced motion.

(09)

Reports and alerts

A weekly brief is generated from the marts: top changes by weighted significance, per-competitor activity, and pricing movements. The body of the report is deterministic. When a real LLM provider is configured, an executive summary is generated and prepended; without one, a template summary is used, so report generation never depends on an API key.

PDFs are rendered with fpdf2, which is pure Python. The first implementation used WeasyPrint, which produces nicer output but drags in GTK system dependencies on Windows; that tradeoff was not worth it for a portfolio project that should clone and run anywhere.

Alerts fire when a signal's weighted significance crosses a threshold, through a Slack webhook or SMTP email. Delivery is disabled by default and the alert path logs what it would have sent when no channel is configured, which makes the behavior testable without secrets.

(10)

Operations

Local setup, from a fresh clone:

uv sync --extra dev --extra dbt --extra api   # Python env + deps
uv run compete init-db                        # create the DuckDB schema
uv run python scripts/seed_demo.py --build    # demo data + dbt build
uv run compete-api                            # FastAPI on :8000
cd web && npm install && npm run dev          # dashboard on :3000

Tooling is uv for Python environments, ruff and black for linting and formatting, pytest for the Python suite (collectors, extraction, change detection, API), and ESLint plus a strict TypeScript config on the frontend. A GitHub Actions workflow runs the pipeline weekly and uploads the warehouse file as an artifact.

Running cost is zero by design: GitHub Actions free tier for compute, a file for storage, and an LLM free tier for extraction. The deployment doc in the repo walks through the paid options if the project ever needed them.

(11)

Limitations

Things that are honest to call out, in rough order of how much they would matter in production:

Pricing extraction is heuristic. It works on conventional pricing pages and produces sparse history elsewhere; a production system would want per-site adapters or a vision model reading rendered pages.

The API has no authentication. It is built to run on localhost or behind a private proxy, and adding auth was out of scope for a single-user demo.

DuckDB's single-writer model is fine for a batch pipeline plus a read-mostly API, but it would not survive multiple concurrent writers. The Parquet landing zone is the escape hatch: the warehouse can be rebuilt from it into any engine.

Significance scoring is a weighted formula over type and LLM-assigned importance, tuned by eye against the seeded dataset. It has no feedback loop; signals you ignore do not teach it anything yet.

The demo dataset is curated. The collectors are real and run against live sites, but the dashboard you are looking at is seeded so that it demonstrates every feature without waiting a month for history to accumulate.