pipeline.amicusdata.dev — Phase 003: Domain Classification

PHASE 003

Domain Classification

Steps 003a → 003d · Every 30 days · Output: a final verdict (V_CONFIRMED / NON_V_CONFIRMED / etc.) per candidate domain

The Spec

How Phase 003 is supposed to behave.

Phase 003 — Domain Classification. Design intent. Reference, not reality.

Duty: For every verified candidate domain, answer the question: “Is this actually a personal-injury law firm?” Use the cheapest signal that resolves the question. Escalate to a smarter (more expensive) model only when the cheap signal is ambiguous.

Schedule: Fires every 30 days as part of the intake cascade, immediately after Phase 002.

End state: Every CID record from Phase 001 has a final_verdict (one of V_CONFIRMED, V_LIKELY, UNCERTAIN, NON_V_LIKELY, NON_V_CONFIRMED, NO_DATA, DNS_DEAD) plus a reason and per-pass audit trail.

What Phase 003 does, plain English

Phase 002 hands over candidate domains tagged with how the site responded (live, 404, cloudflare, etc.) and raw HTML for the ones that responded. Phase 003’s job is to take that and decide which of those domains are actually personal-injury law firms — not paralegals, not real-estate offices that happened to come up in a Maps search, not LLLTs (Limited License Legal Technicians, which Washington State allows but are not attorneys).

It does this in four stages of increasing expense:

003a — Match the Google Maps category fields (cheap, no AI) against a reference list of 35 known attorney categories.
003b — For domains DFS couldn’t reach (Cloudflare-blocked, 403, etc.), try again with a real headless browser + stealth. Recovers content from sites that block bots.
003c — Send the page content to Claude Haiku in 3 escalating passes. Most domains resolve at Pass 1. Only the ambiguous ones cost Pass 2 or Pass 3 (Sonnet).
003d — Join the domain-grain verdicts back onto the CID-grain table so every Google Business listing has a verdict.

The 4 sub-steps, in order

Step	What it does	Provider	Reads	Writes
`003a`	Three-field category match against 35 attorney categories from `vertical_config.yaml`. Tags each CID `in_vertical` or `not_in_vertical`. Cleans text fields (`title`, `snippet`, `address`).	—	`08_domain_verified.json` (002 output)	`09_categorized.json`
`003b`	Playwright headless browser + stealth fetch for domains DFS could not crawl. DNS pre-check first. httpx fallback if Playwright also blocked. Validates content (real title / 1+ heading / 3+ paragraphs).	Playwright + httpx	`tags_only/`, `events.jsonl` (002 output)	overwrites `tags_only/{domain}.json` + `09b_playwright_results.json`
`003c`	3-pass escalation classifier — Pass 1 Haiku (light signals), Pass 2 Haiku (+body), Pass 3 Sonnet (full). LLLT pre-classification short-circuit for attorney vertical.	Anthropic Haiku + Sonnet	`tags_only/`, `09_categorized.json`, `09b_playwright_results.json`	`10_haiku_classified.json` + `enrichment_003c_classification` (BQ)
`003d`	Join domain-grain verdicts onto CID-grain table. Domains not seen by 003c are marked `NO_DATA` or `DNS_DEAD` with a specific reason assembled from upstream signals.	—	`09_categorized.json`, `10_haiku_classified.json`, `events.jsonl`, `09b_playwright_results.json`	`11_gold.json`

The 3-pass escalation classifier (003c)

This is the heart of Phase 003. Each pass uses progressively more signal and more expensive models:

Pass	Model	Signals sent	Concurrency	Max tokens	Terminal if…
Pass 1	Haiku	gmap_title, gmap_category, page_title, headings, nav_links	20	60	`V_CONFIRMED` only
Pass 2	Haiku	+ first 4,000 chars of `<p>` body	20	80	`V_CONFIRMED` or `NON_V_CONFIRMED`
Pass 3	Sonnet	+ full body, meta tags — everything	5	120	final arbiter — whatever Sonnet says is the answer

Before any model call, LLLT pre-classification runs on attorney-vertical records: if the GMB title or page title contains the word “LLLT”, the domain is auto-classified NON_V_CONFIRMED with reason “Limited License Legal Technician, not an attorney.” LLLT in body copy only is ignored (law firms can employ LLLTs).

Model IDs are resolved at runtime by querying https://api.anthropic.com/v1/models and picking the newest haiku and sonnet. Then each model gets verified with a 1-token ping. No hardcoded model strings. (This rule exists because two prior runs failed when hardcoded model IDs were retired by Anthropic.)

How the data moves (and where it lives)

Input · Phase 002 deliverable

08_domain_verified.json
+ tags_only/ + events.jsonl

CID list with crawl_class
and raw page tags per domain

↓

003a · category match

step_003a_category_match.py

Tags each CID record:
in_vertical / not_in_vertical
Writes 09_categorized.json

↓

003b · playwright fetch

step_003b_playwright_fetch.py

For DFS-blocked domains:
Playwright + stealth → httpx fallback
Overwrites tags_only/{domain}.json

↓

003c · 3-pass classifier

step_003c_haiku_classify.py

Pass 1 Haiku → Pass 2 Haiku → Pass 3 Sonnet
Writes 10_haiku_classified.json
+ BQ enrichment_003c_classification

↓

003d · join to CID grain

step_003d_join_gold.py

Joins domain verdicts back onto
the CID-grain table from 003a

↓

Output · the deliverable

11_gold.json

CID-grain, every record has a final_verdict
(V_CONFIRMED / V_LIKELY / UNCERTAIN /
NON_V_LIKELY / NON_V_CONFIRMED / NO_DATA / DNS_DEAD)

↓

Hands off to

Phase 004 — Specialties

(out of scope for this page)

Where to look — file & table reference

Thing	Path or table
The 4 scripts	`/mnt/workspace/amicus/pipeline/steps/003_domain_classification/step_003*.py`
Reference attorney categories	`pipeline/steps/verticals/attorney/config.yaml` (35 categories) — fallback CSV `dfs_law_categories_combined.csv`
003a output	`output/<profile_id>/09_categorized.json`
003b manifest	`output/<profile_id>/09b_playwright_results.json`
003c output	`output/<profile_id>/10_haiku_classified.json`
Phase deliverable	`output/<profile_id>/11_gold.json` (CID-grain, every record has final_verdict)
BQ table (domain-grain)	`amicus_pipeline.enrichment_003c_classification`
Per-step logs	`pipeline/steps/000_log_files/step_003_.log`
Anthropic key	`ANTHROPIC_API_KEY_B_SERIES` in `.env`

Cost per fire

003a, 003b, and 003d cost nothing in API spend. 003c is the only step that calls a paid model. Cost varies based on how many domains escalate past Pass 1 — ambiguous markets cost more than clear-cut ones.

Line item	Volume	Per unit	Subtotal
003a — category match (Python, no external API)	—	—	$0.00
003b — Playwright fetch (VM compute only, ~4 concurrent workers)	~20-40 blocked	—	$0.00
003c — Pass 1 Haiku (`~$1/$5` per M tokens, ~600 in / 30 out per call)	~3,004	~$0.001	~$3.00
003c — Pass 2 Haiku (escalations from Pass 1, larger context — assume ~50%)	~1,500	~$0.002	~$3.00
003c — Pass 3 Sonnet (`$3/$15` per M tokens, full context — assume ~15% of Pass 1)	~450	~$0.020	~$9.00
003d — join (Python, no external API)	—	—	$0.00
Total per 30-day fire			~$15

Pass 1 volume is anchored on atty_wa_seattle silver = 3,004 unique domains (queried 2026-05-16 from amicus_pipeline.silver). Pass 2 / Pass 3 escalation rates are estimates — measure on the next real run to refine. A market with clear branding (every site has “Law” in the title) resolves at Pass 1 cheaply; ambiguous markets escalate more and cost more.

Schedule

Frequency

Every 30 days. Once. Same intake cycle as Phases 001 + 002.

Trigger

Runs immediately after Phase 002 in the intake cascade. Not a separate cron.

Execution mode

Sequential 003a → 003b → 003c → 003d. 003c is async-parallel within itself.

Concurrency

003b: 4 Playwright workers. 003c: 20 Haiku concurrent, 5 Sonnet concurrent.

Model resolution

Runtime query to /v1/models — no hardcoded model IDs ever.

Output guarantee

Every CID record gets a verdict. Domains 003c never saw get NO_DATA or DNS_DEAD with a specific reason.

What's Fucked

Phase 003 is not running the spec. Here’s exactly how.

Forensic findings, 2026-05-16. Same intake-tier gap as 001/002, plus a couple of 003-specific issues.

Finding 1 — All four 003 steps are tagged `intake`. None auto-fire.

pipeline/steps/cadence.py tags 003a, 003b, 003c, and 003d as intake. The intake tier has runs_on: [] — never auto-fires. Same root-cause gap as the rest of intake.

Step	Cadence tag	Auto-fires?
`003a`	intake	never
`003b`	intake	never
`003c`	intake	never
`003d`	intake	never

Finding 2 — 003c is where the 2026-04-06 cascade burned real money.

The manual fire on 2026-04-06 made it as far as 003c before someone killed it. pipeline_runs shows failed_step=003c on multiple runs that day, and api_cost_log shows Anthropic spend that hour clustered on 003c retries.

Root cause: 3-pass classifier with no per-domain failure budget. If a Pass 1 call fails (network blip, rate limit), it returns "ERROR" as the verdict, which is not in {V_CONFIRMED}, so the record escalates to Pass 2. Pass 2 also fails, escalates to Pass 3 (Sonnet, 5x cost). Sonnet returns "ERROR", which becomes the final_verdict. A single transient failure costs Pass 1 + Pass 2 + Pass 3 instead of just Pass 1.

When that fails for many domains at once (e.g. an Anthropic-side hiccup), you escalate the entire batch to Sonnet and burn 5x the budget for zero useful classifications.

Finding 3 — 003d writes only to disk. The `gold_domains` BQ table is written by Phase 004b, not Phase 003.

The pipeline overview page’s older cards described 003d as “promote to gold_domains.” That language is misleading. The actual step_003d_join_gold.py writes only to 11_gold.json on disk. It has no BQ writer import.

The amicus_pipeline.gold_domains table does get populated — but by step_004b_join_gold.py at the very end of Phase 004, not by Phase 003. (See Phase 004 for the actual write path: gold_domains + gold_cids + gold_paid are split out and written in one place at the close of intake.)

Practical implication for Phase 003: 003 has no BQ deliverable on its own. 003c writes enrichment_003c_classification, 003d only writes 11_gold.json. If 004 never runs, the verdicts are stuck on the VM disk with no BQ-side reflection.

Finding 4 — `enrichment_003c_classification` BQ write is gated on `PIPELINE_RUN_ID`.

Same gotcha as Phase 002. step_003c_haiku_classify.py at line 922 only writes to BigQuery if PIPELINE_RUN_ID is set in the environment:

if run_id and all_records: ... elif not run_id: print("BQ write skipped: no PIPELINE_RUN_ID (standalone mode)")

Manual SSH fires almost never set this variable. So every manual 003c run since the orchestrator broke has silently skipped its BQ write — the data lives only on the VM disk as 10_haiku_classified.json.

Finding 5 — Input dependencies are files, not BQ tables.

003a reads 08_domain_verified.json. 003b reads tags_only/ and events.jsonl. 003c reads 09_categorized.json + tags_only/ + (optionally) 09b_playwright_results.json. 003d reads four different JSON files.

That means Phase 003 cannot be re-run independently of Phase 002’s on-disk output. If the VM disk gets wiped, or the output dir gets cleared, the entire intake cascade must rerun from 001 — even if BQ tables are fully populated. There is no “rerun classification from BQ” path.

Finding 6 — LLLT pre-classification only catches LLLT in the title.

The check_lllt() function (line 363 of step_003c_haiku_classify.py) deliberately checks only gmap_title and page_title — not body copy. That decision is documented (law firms may employ LLLTs without being LLLT-only practices), but it means LLLT-only practices that don’t put LLLT in their business name slip through to Haiku, which then has to figure it out from context. Haiku usually gets it right, but it’s a known leak.

The bottom line

Where Phase 003 Stands Today

Phase 003 has not produced a fresh verdict set since whenever Phase 002 last produced fresh tags_only files. When it last ran manually (2026-04-06), it cascaded retry failures into Sonnet calls and burned the Anthropic budget. The BQ deliverable enrichment_003c_classification only populates when the orchestrator sets PIPELINE_RUN_ID, which has rarely happened recently. The pipeline-wide gold_domains table that older docs attribute to Phase 003 is actually written at the end of Phase 004 — so if 004 doesn’t run, none of Phase 003’s verdicts ever land in BQ.

The Fix

What we’ll do to make Phase 003 match the spec.

Concrete remediation. Inherits Phase 001’s cron fix, adds 003-specific budget & error-handling work.

Six fixes. Most rely on Phase 001 Fix 3 landing first. The dedicated 003-specific work is the failure-budget guard and the gold-domains BQ writer.

FIX 1 Add a per-domain failure budget to 003c. ~45 min

Before any future intake fire, the 3-pass classifier needs a budget guard. Specifically:

If call_model() returns verdict="ERROR", do not auto-escalate. Treat ERROR as a terminal state for that domain. Mark final_verdict ERROR with the underlying exception.
Cap Pass 3 (Sonnet) at 15% of total candidate count by default. If Pass 2 produces more escalations than that, stop, log the issue, and write the partial result. A run where 50% escalate to Sonnet means something is wrong upstream, not that Sonnet is needed.
Surface the Anthropic API cost in real time — print cumulative cost after every 50 calls so an operator can kill the run before it spirals.

This is the single highest-leverage fix — it prevents the 2026-04-06 cascade from repeating.

FIX 2 Update the overview docs — Phase 003 produces 11_gold.json, not gold_domains. ~5 min

gold_domains is Phase 004’s output, not Phase 003’s. Phase 003’s deliverable is 11_gold.json on disk (CID-grain, every record has a final_verdict).

The fix here is purely documentation:

The pipeline overview card for 003d should say “Join → CID grain” or “Write 11_gold.json” — not “Promote to gold_domains.”
Any orchestrator log message or comment that says “003 writes gold_domains” should be reworded to point at Phase 004 for the BQ-side gold tables.

No code logic change in Phase 003. The actual BQ gold_domains write already exists — it just lives at the close of Phase 004 (see Phase 004).

FIX 3 Inherit the Phase 001 intake cron. no work — lands with Phase 001 Fix 3

Same as Phase 002. Once Phase 001 Fix 3 installs the 0 10 1 * * intake crontab entry running orchestrator.py --cadence intake, all four 003 steps fire in topological order along with the rest of intake. No 003-specific cron line.

Verify the orchestrator's topological order is 001 → 002 → 003a → 003b → 003c → 003d → 004. If 003c fires before 003b, the playwright recovery doesn't happen and a chunk of domains end up classified from missing content.

FIX 4 Smoke-test 003c on a known set with PIPELINE_RUN_ID set. ~15 min + ~10 min model wait

Pick a small set of pre-resolved candidate domains (5 known PI firms + 5 known non-firms) from a recent 09_categorized.json. Run:

PIPELINE_RUN_ID=smoke_2026_05_16 \ PIPELINE_PROFILE_ID=atty_wa_seattle \ PIPELINE_VERTICAL=attorney \ python step_003c_haiku_classify.py

Verify:

All 5 known firms get V_CONFIRMED or V_LIKELY
All 5 known non-firms get NON_V_CONFIRMED or NON_V_LIKELY
10 rows land in enrichment_003c_classification with run_id=smoke_2026_05_16
Total cost is < $0.10 (sanity bound — 10 domains should never cost more)

FIX 5 Tighten 003b’s blocked-class set. ~15 min

003b currently routes "unknown" and "auth_required" into the BLOCKED_CRAWL_CLASSES set, meaning it spawns Playwright workers for them. For genuinely dead sites that DFS classified as "unknown", this wastes browser time.

Action: review the manifest from a real run after Fix 4 lands. For each crawl_class in the BLOCKED set, measure the Playwright recovery rate:

If <10% recover, move that class to the DEAD set
If 10-30% recover, leave it
If >30% recover, it’s a high-value class — leave it

Should cut Playwright wall-time by 20-40% with no loss of classification quality.

FIX 6 Add a fail-loud verification at the end of 003d. ~20 min

After 11_gold.json is written, assert:

Every CID record has a non-empty final_verdict field (no blanks)
Count of V_CONFIRMED + V_LIKELY > 30% of total CID records (a real market is never <30% confirmed)
Count of ERROR verdicts is 0 (Fix 1 catches this earlier, but check again)

If any assertion fails, exit non-zero. Same pattern as Phase 002 Fix 4 and Phase 001 Fix 6.

After all 6 fixes

The intake cron from Phase 001 Fix 3 fires on the 1st of every month. Orchestrator runs 001 → 002 → 003a → 003b → 003c → 003d in order. The Pass 1/Pass 2/Pass 3 cascade resolves cheaply for clear cases, only escalates ambiguous ones to Sonnet, and stops cold if too many domains escalate. Every CID gets a verdict in enrichment_003c_classification (domain-grain) and 11_gold.json (CID-grain). The gold_domains BQ table either gets written by 003d (Fix 2 Option A) or is officially retired from the language.

Then we move on to Phase 004.

Generated 2026-05-16 from /mnt/workspace/amicus/sites/pipeline/ on amicus-dev VM.