pipeline.amicusdata.dev — Phase 004: Specialties

PHASE 004

Specialties

Steps 004a → 004b · Every 30 days · Closes intake · Output: gold_domains, gold_cids, gold_paid BQ tables

The Spec

How Phase 004 is supposed to behave.

Phase 004 — Specialties. Design intent. Reference, not reality.

Duty: For every confirmed firm from Phase 003, identify which kinds of law they practice (personal injury, family law, criminal defense, etc.). Then join everything — verdicts, specialties, page content, GMaps signals — into the final gold layer that every weekly/monthly/biweekly enrichment phase reads from.

Schedule: Fires every 30 days as the final step of intake. After 004 completes, intake is done for that cycle.

End state: Three BigQuery tables exist with fresh rows for this intake run: gold_domains (one row per firm), gold_cids (one row per Google Business listing), gold_paid (one row per paid ad seen on the SERP).

What Phase 004 does, plain English

Phase 003 told us which domains are PI law firms. Phase 004 answers two follow-up questions:

What do they practice? A firm that says “personal injury” on its homepage gets tagged personal_injury_lawyer as primary specialty, and any other practice areas it advertises become secondaries. The classifier is constrained to a fixed list of 24 attorney specialties from the vertical config — it cannot invent its own categories.
How do we publish all this? The final gold layer combines every signal we have — the Maps SERP row (CID-grain), the verified domain status, the Haiku verdict, the specialties, the search keyword we’d use to find them on Google — into three BigQuery tables that downstream phases read.

Phase 004 is also where false positives from Phase 003 get caught. If 003 confirmed a domain as a law firm but the Google Maps category says “Certified Public Accountant”, 004b overrides the verdict to NON_V_OVERRIDE. The classifier got fooled by lawyer-adjacent content; the GMaps category is the trusted human-verified signal.

The 2 sub-steps, in order

Step	What it does	Provider	Reads	Writes
`004a`	Single-pass Haiku specialty classifier on every V_CONFIRMED domain. Returns `primary_specialty`, `secondary_specialties`, `declared_practice_areas`, `firm_summary`. Remaps invented categories to the closest valid one.	Anthropic Haiku	`tags_only/`, `11_gold.json`	`c1_specialties.json` + BQ `enrichment_004a_specialties`
`004b`	Joins every enrichment onto the CID-grain table. Applies `NON_V_OVERRIDE` post-filter. Splits into 3 BQ tables: `gold_domains`, `gold_cids`, `gold_paid`. Maps each primary specialty to a search keyword for downstream Maps SERP fetches.	—	`09_categorized.json`, `10_haiku_classified.json`, `c1_specialties.json`, `events.jsonl`, `09b_playwright_results.json`	`13_gold.json` + BQ `gold_domains` + `gold_cids` + `gold_paid`

The 24 attorney specialties (004a’s output space)

004a’s system prompt is built dynamically from the vertical config (pipeline/profiles/_configs/attorney.yaml). For the attorney vertical, the fallback (hardcoded) list contains these 24 categories — Haiku must pick primary_specialty from this set:

general_practice_attorney · family_law_attorney · criminal_law_attorney · tax_attorney · personal_injury_lawyer · divorce_attorney · real_estate_attorney · estate_planning_attorney · civil_law_attorney · trial_attorney · employment_attorney · labor_relations_attorney · administrative_attorney · immigration_attorney · bankruptcy_attorney · social_security_attorney · insurance_attorney · patent_attorney · elder_law_attorney · business_attorney · medical_lawyer · estate_litigation_attorney · probate_attorney · environmental_attorney

Category remap dict: when Haiku invents a plausible-but-out-of-set category (e.g. wrongful_death_attorney, commercial_litigation_attorney, workers_compensation_attorney), 004a remaps to the closest valid one (e.g. personal_injury_lawyer, business_attorney, labor_relations_attorney). 14 such remaps are hardcoded.

The post-filter (004b’s defense against 003 false positives)

If 003c said V_CONFIRMED but the Google Maps category is in NON_VERTICAL_CATEGORIES, 004b overrides the verdict to NON_V_OVERRIDE. The 5 hardcoded categories are:

certified public accountant · accountant · doctor · driving school · educational consultant

A CPA office whose website mentions “tax law” sometimes fools 003c. The GMaps category is human-curated and reliable — if it says “Certified Public Accountant,” the firm is a CPA, not a tax attorney.

The search_keyword map (input for Phase 008/010)

Every gold record gets a search_keyword field derived from its primary_specialty. Downstream phases (e.g. 008a Maps SERP grid, 010 keyword tracking) use this string verbatim when querying DataForSEO. The 21-entry map looks like:

primary_specialty	search_keyword used downstream
`personal_injury`	personal injury lawyer
`family_law`	divorce lawyer
`criminal_defense`	criminal defense lawyer
`business_law`	business lawyer
`estate_planning`	estate planning lawyer
… (16 more)
fallback when no specialty	lawyer

Mismatch warning: the keys in the search_keyword map (e.g. personal_injury) do not match the 24-category fallback list above (which uses personal_injury_lawyer). The real-runtime category list comes from vertical_config.yaml, which uses the short form. See forensic Finding 4 below.

How the data moves (and where it lives)

Input · Phase 003 deliverable

11_gold.json

CID-grain, with final_verdict
per Phase 003

Input · Phase 002 cache

tags_only/{domain}.json

Page HTML skeleton for
V_CONFIRMED domains

↓

004a · single-pass Haiku

step_004a_specialties.py

For each V_CONFIRMED domain:
primary + secondary + declared + summary
Writes c1_specialties.json
+ BQ enrichment_004a_specialties

↓

004b · join + post-filter + split

step_004b_join_gold.py

Apply NON_V_OVERRIDE post-filter
Null specialties on dead sites
Compute search_keyword
Split CID-grain into 3 BQ tables

↓

Output · gold_domains

amicus_pipeline.gold_domains

one row per unique domain,
with verdict + specialty + cid_count

Output · gold_cids

amicus_pipeline.gold_cids

one row per Google Business listing
(may be multiple per domain)

Output · gold_paid

amicus_pipeline.gold_paid

paid ad rows from the SERPs
(usually small)

↓

Hands off to

Phases 005 → 010 — enrichment tiers

Every monthly/weekly/biweekly enrichment
reads from gold_domains / gold_cids

Where to look — file & table reference

Thing	Path or table
The 2 scripts	`/mnt/workspace/amicus/pipeline/steps/004_specialties/step_004*.py`
Specialty list (live)	`pipeline/profiles/_configs/attorney.yaml` via `shared/vertical_config.py`
Specialty list (fallback)	hardcoded in `step_004a_specialties.py:76`
Category remap dict	`step_004a_specialties.py:_CATEGORY_REMAP` (14 entries)
NON_VERTICAL_CATEGORIES	`step_004b_join_gold.py:123` (5 entries)
search_keyword maps	`step_004b_join_gold.py:SPECIALTY_TO_SEARCH_KEYWORD` + `TOWING_SPECIALTY_TO_SEARCH_KEYWORD`
004a output	`output/<profile_id>/c1_specialties.json`
Phase deliverable (BQ)	`amicus_pipeline.gold_domains`, `amicus_pipeline.gold_cids`, `amicus_pipeline.gold_paid`
Phase deliverable (disk)	`output/<profile_id>/13_gold.json`
BQ enrichment table (004a)	`amicus_pipeline.enrichment_004a_specialties`
Per-step logs	`pipeline/steps/000_log_files/step_004_.log`

Cost per fire

004a is the only step that calls a paid API. Single-pass Haiku per V_CONFIRMED domain. 004b is pure Python + BigQuery writes (no per-call cost).

Line item	Volume	Per unit	Subtotal
004a — single-pass Haiku (`~$1/$5` per M tokens, ~2000 in / 100 out per call)	~2,247 V_CONFIRMED	~$0.003	~$6.75
004b — join + 3 BQ table writes (no external API)	—	—	$0.00
Total per 30-day fire			~$7

V_CONFIRMED count = 2,247 unique domains for atty_wa_seattle from enrichment_004a_specialties in run atty_production_001 (queried 2026-05-16). That’s a real specialty-classified row count, not an estimate. If a future Phase 003 run produces more or fewer V_CONFIRMED firms, 004a cost scales linearly.

Schedule

Frequency

Every 30 days. Once. Final step of intake.

Trigger

Runs immediately after Phase 003 in the intake cascade. Not a separate cron.

Execution mode

Sequential 004a → 004b. 004a is async-parallel (20 concurrent Haiku calls).

Concurrency

004a: 20 Haiku concurrent. 004b: serial Python.

Model resolution

Runtime query to /v1/models — no hardcoded model IDs.

Retry behavior

004a: per-domain 2-attempt retry. If both fail, record gets error_message but the run continues. --backfill flag re-processes only error rows.

Output guarantee

Three BQ tables populated for this run_id — gold_domains, gold_cids, gold_paid. Intake is done.

What's Fucked

Phase 004 is not running the spec. Here’s exactly how.

Forensic findings, 2026-05-16. Same intake-tier gap as 001/002/003, plus 004-specific drift.

Finding 1 — Both 004 steps are tagged `intake`. Neither auto-fires.

pipeline/steps/cadence.py tags 004a and 004b as intake. The intake tier has runs_on: [] — never auto-fires. Same gap as the rest of intake.

Step	Cadence tag	Auto-fires?
`004a`	intake	never
`004b`	intake	never

Finding 2 — All three gold tables silently skip BQ when `PIPELINE_RUN_ID` isn’t set.

This is the third Phase in a row with the same gotcha. step_004b_join_gold.py at line 400 only writes gold_domains + gold_cids + gold_paid if PIPELINE_RUN_ID is in the environment:

if run_id and cid_records: ...write three gold tables... elif not run_id: logger.info("BQ write skipped: no PIPELINE_RUN_ID (standalone mode)")

Every manual SSH fire of 004b since intake broke has silently skipped writing the gold tables. This is the highest-impact instance of the bug — gold_domains, gold_cids, gold_paid are the canonical inputs for every weekly/monthly/biweekly enrichment step in Phases 005-010.

Finding 3 — Specialty-key mismatch between 004a output and 004b’s search_keyword map.

004a’s fallback VALID_CATEGORIES set (line 76 of step_004a_specialties.py) uses long form like personal_injury_lawyer, family_law_attorney. 004b’s SPECIALTY_TO_SEARCH_KEYWORD map (line 147 of step_004b_join_gold.py) uses short form: personal_injury, family_law. These don’t match.

What actually happens at runtime depends on which list is authoritative. Two cases:

If vertical_config.yaml uses short form (personal_injury) — the search_keyword map matches; the fallback hardcoded list in 004a is dead code that would never execute (because vertical config loads successfully and the fallback only kicks in when config is missing). The bug only surfaces if vertical config goes missing.
If vertical_config.yaml uses long form (personal_injury_lawyer) — every domain ends up with search_keyword="lawyer" (the fallback default), losing all specialty-specific keyword routing. Downstream Phase 008/010 then fetches SERPs for generic “lawyer” for every firm regardless of practice area.

Unverified which case is real. Resolving this is Fix 2.

Finding 4 — The NON_V_OVERRIDE post-filter only catches 5 categories.

004b’s NON_VERTICAL_CATEGORIES set (line 123) is just 5 hardcoded GMaps categories: CPA, accountant, doctor, driving school, educational consultant. Any other lawyer-adjacent business that fooled 003c (e.g. mediator, paralegal services, legal aid clinic, court reporter) gets through into gold_domains as a confirmed firm.

This list is too short relative to the false-positive surface. Phase 003c is good but not perfect; the post-filter is the last line of defense and it has 5 entries.

Finding 5 — The category remap dict needs runtime measurement to confirm it’s catching real cases.

004a’s _CATEGORY_REMAP has 14 hardcoded mappings for categories Haiku tends to invent (wrongful_death_attorney → personal_injury_lawyer, etc.). Whether Haiku still invents these in current model versions is unverified. Some entries may be stale (mapping a category Haiku no longer produces) and some real invented categories may slip through unmapped.

The script does log every remap, so a single run’s log file would tell us: which remaps fire, which never fire, which invented categories show up that aren’t in the dict at all. That measurement has not been done.

Finding 6 — Without 004 completing, every downstream phase reads stale gold.

Phases 005-010 (weekly, monthly, biweekly) all read from gold_domains / gold_cids. If 004b never runs, those tables stop getting fresh rows for new run_ids. Downstream phases keep operating on the most recent successful gold-table run — which is whenever 004b last succeeded with PIPELINE_RUN_ID set.

That means every Monday/Thursday biweekly fire, every Monday weekly fire, every monthly fire is enriching the same fixed firm list. New firms that opened since the last successful 004b fire are invisible. Firms that closed are still being enriched as if they were live.

The bottom line

Where Phase 004 Stands Today

The three gold tables (gold_domains, gold_cids, gold_paid) have not been written by any scheduled fire because intake never auto-runs, and manual fires almost always skip the BQ write due to missing PIPELINE_RUN_ID. This is the load-bearing failure of the entire pipeline. Every weekly/monthly/biweekly enrichment downstream of Phase 004 is enriching whichever firm list 004b last successfully published — and that’s the canonical “why is the pipeline broken” answer.

The Fix

What we’ll do to make Phase 004 match the spec.

Concrete remediation. The single highest-leverage fix in the whole pipeline lives here.

Six fixes. Fix 1 is the load-bearing one — once 004b actually writes gold tables on a schedule, the entire downstream pipeline starts working again.

FIX 1 Make the gold-table BQ write unconditional (or fail loud). ~20 min

The current behavior — silently skipping the BQ write when PIPELINE_RUN_ID is missing — is the wrong default. Two acceptable shapes:

Option A: if PIPELINE_RUN_ID is missing, generate one (e.g. standalone_2026_05_16_HHMM) and write anyway. Manual fires populate BQ.
Option B: if PIPELINE_RUN_ID is missing, exit non-zero with a clear error. No silent skip ever.

Recommended: Option B. Phase 004’s deliverable IS the BQ tables; a run that doesn’t produce them is a failed run, not a successful one. Apply the same fix to 002, 003c, and 004a’s enrichment writes for consistency.

FIX 2 Resolve the specialty-key mismatch in 004b’s search_keyword map. ~30 min

Concretely:

Open pipeline/profiles/_configs/attorney.yaml. Determine whether it uses personal_injury or personal_injury_lawyer as specialty keys.
Make the fallback list in step_004a_specialties.py:76 match the config exactly. Either both short form or both long form.
Make step_004b_join_gold.py:SPECIALTY_TO_SEARCH_KEYWORD keys match.
After a real run, query: SELECT primary_specialty, search_keyword, COUNT(*) FROM gold_domains GROUP BY 1, 2 ORDER BY 3 DESC — if every row has search_keyword='lawyer', the map is missing keys.

This fix is invisible until Fix 1 lands and we see real data flowing.

FIX 3 Inherit the Phase 001 intake cron. no work — lands with Phase 001 Fix 3

Once Phase 001 Fix 3 installs the 0 10 1 * * intake crontab entry, 004a and 004b fire automatically at the end of every intake cascade. The orchestrator’s topological order ends with 004a → 004b.

No 004-specific cron line. Verify the orchestrator topology: 003d must complete before 004a, and 004a must complete before 004b. If the orchestrator parallelizes 004a and 004b, 004b reads an empty c1_specialties.json and produces gold rows with no specialties.

FIX 4 Expand the NON_V_OVERRIDE post-filter list. ~30 min

After Fix 1 lands and we have fresh gold_domains rows, query for false positives:

SELECT category, primary_specialty, COUNT(*) FROM gold_domains WHERE final_verdict = 'V_CONFIRMED' GROUP BY 1, 2 ORDER BY 3 DESC

Look for GMaps categories that aren’t actually law firms but slipped through (e.g. “legal services”, “mediator”, “paralegal”, “legal aid”, “notary public”, “court reporter”). For each, decide:

Definitively not an attorney → add to NON_VERTICAL_CATEGORIES in step_004b_join_gold.py.
Sometimes an attorney → leave it. Let the verdict stand.

Target: grow the 5-entry list to ~15-25 entries based on observed false positives.

FIX 5 Measure the category-remap dict against real Haiku output. ~20 min

After a real 004a run completes, grep the log:

grep "remapped primary" 000_log_files/step_004a_*.log

Three buckets to identify:

Remaps that fire: keep them.
Remaps that never fire (no log hit in 12 months of runs): remove from _CATEGORY_REMAP as stale.
Invented categories appearing in logs but not in the dict: add a remap entry.

Same goal as Fix 4 — keep the dictionaries current with what models actually produce.

FIX 6 Add fail-loud verification at the end of 004b. ~25 min

After all three gold tables are written, assert:

gold_domains row count for this run_id is > some minimum (50 for a real market)
Count of final_verdict='V_CONFIRMED' rows is > 30% of total (a real market is never <30% confirmed)
search_keyword distribution has at least 3 distinct values (if everything is “lawyer”, Fix 2 didn’t land)
Count of primary_specialty IS NULL rows among V_CONFIRMED is < 10% (most confirmed firms should have a specialty)

If any assertion fails, exit non-zero with a clear error. Same pattern as Phases 001-003 fail-loud fixes.

After all 6 fixes

The intake cron from Phase 001 Fix 3 fires on the 1st of every month. Orchestrator runs 001 → 002 → 003 → 004 in order. 004a classifies every V_CONFIRMED domain into one of 24 specialties. 004b applies post-filters, computes search keywords, and writes fresh rows to gold_domains, gold_cids, gold_paid. Every Monday/Thursday biweekly fire and every Monday weekly fire now reads from a current firm list. Intake is whole.

Then we move on to Phase 005 (OnPage crawl) — the first enrichment phase, and where the cost numbers start getting serious.

Generated 2026-05-16 from /mnt/workspace/amicus/sites/pipeline/ on amicus-dev VM.

Specialties

How Phase 004 is supposed to behave.

What Phase 004 does, plain English

The 2 sub-steps, in order

The 24 attorney specialties (004a’s output space)

The post-filter (004b’s defense against 003 false positives)

The search_keyword map (input for Phase 008/010)

How the data moves (and where it lives)

Where to look — file & table reference

Cost per fire

Schedule

Phase 004 is not running the spec. Here’s exactly how.

Finding 1 — Both 004 steps are tagged intake. Neither auto-fires.

Finding 2 — All three gold tables silently skip BQ when PIPELINE_RUN_ID isn’t set.

Finding 3 — Specialty-key mismatch between 004a output and 004b’s search_keyword map.

Finding 4 — The NON_V_OVERRIDE post-filter only catches 5 categories.

Finding 5 — The category remap dict needs runtime measurement to confirm it’s catching real cases.

Finding 6 — Without 004 completing, every downstream phase reads stale gold.

The bottom line

What we’ll do to make Phase 004 match the spec.

After all 6 fixes

Finding 1 — Both 004 steps are tagged `intake`. Neither auto-fires.

Finding 2 — All three gold tables silently skip BQ when `PIPELINE_RUN_ID` isn’t set.