The Spec
How Phase 004 is supposed to behave.
Phase 004 — Specialties. Design intent. Reference, not reality.
Duty: For every confirmed firm from Phase 003, identify which kinds of law they practice (personal injury, family law, criminal defense, etc.). Then join everything — verdicts, specialties, page content, GMaps signals — into the final gold layer that every weekly/monthly/biweekly enrichment phase reads from.
Schedule: Fires every 30 days as the final step of intake. After 004 completes, intake is done for that cycle.
End state: Three BigQuery tables exist with fresh rows for this intake run: gold_domains (one row per firm), gold_cids (one row per Google Business listing), gold_paid (one row per paid ad seen on the SERP).
What Phase 004 does, plain English
Phase 003 told us which domains are PI law firms. Phase 004 answers two follow-up questions:
- What do they practice? A firm that says “personal injury” on its homepage gets tagged
personal_injury_lawyer as primary specialty, and any other practice areas it advertises become secondaries. The classifier is constrained to a fixed list of 24 attorney specialties from the vertical config — it cannot invent its own categories.
- How do we publish all this? The final gold layer combines every signal we have — the Maps SERP row (CID-grain), the verified domain status, the Haiku verdict, the specialties, the search keyword we’d use to find them on Google — into three BigQuery tables that downstream phases read.
Phase 004 is also where false positives from Phase 003 get caught. If 003 confirmed a domain as a law firm but the Google Maps category says “Certified Public Accountant”, 004b overrides the verdict to NON_V_OVERRIDE. The classifier got fooled by lawyer-adjacent content; the GMaps category is the trusted human-verified signal.
The 2 sub-steps, in order
| Step | What it does | Provider | Reads | Writes |
004a | Single-pass Haiku specialty classifier on every V_CONFIRMED domain. Returns primary_specialty, secondary_specialties, declared_practice_areas, firm_summary. Remaps invented categories to the closest valid one. | Anthropic Haiku | tags_only/, 11_gold.json | c1_specialties.json + BQ enrichment_004a_specialties |
004b | Joins every enrichment onto the CID-grain table. Applies NON_V_OVERRIDE post-filter. Splits into 3 BQ tables: gold_domains, gold_cids, gold_paid. Maps each primary specialty to a search keyword for downstream Maps SERP fetches. | — | 09_categorized.json, 10_haiku_classified.json, c1_specialties.json, events.jsonl, 09b_playwright_results.json | 13_gold.json + BQ gold_domains + gold_cids + gold_paid |
The 24 attorney specialties (004a’s output space)
004a’s system prompt is built dynamically from the vertical config (pipeline/profiles/_configs/attorney.yaml). For the attorney vertical, the fallback (hardcoded) list contains these 24 categories — Haiku must pick primary_specialty from this set:
general_practice_attorney · family_law_attorney · criminal_law_attorney · tax_attorney · personal_injury_lawyer · divorce_attorney · real_estate_attorney · estate_planning_attorney · civil_law_attorney · trial_attorney · employment_attorney · labor_relations_attorney · administrative_attorney · immigration_attorney · bankruptcy_attorney · social_security_attorney · insurance_attorney · patent_attorney · elder_law_attorney · business_attorney · medical_lawyer · estate_litigation_attorney · probate_attorney · environmental_attorney
Category remap dict: when Haiku invents a plausible-but-out-of-set category (e.g. wrongful_death_attorney, commercial_litigation_attorney, workers_compensation_attorney), 004a remaps to the closest valid one (e.g. personal_injury_lawyer, business_attorney, labor_relations_attorney). 14 such remaps are hardcoded.
The post-filter (004b’s defense against 003 false positives)
If 003c said V_CONFIRMED but the Google Maps category is in NON_VERTICAL_CATEGORIES, 004b overrides the verdict to NON_V_OVERRIDE. The 5 hardcoded categories are:
certified public accountant · accountant · doctor · driving school · educational consultant
A CPA office whose website mentions “tax law” sometimes fools 003c. The GMaps category is human-curated and reliable — if it says “Certified Public Accountant,” the firm is a CPA, not a tax attorney.
The search_keyword map (input for Phase 008/010)
Every gold record gets a search_keyword field derived from its primary_specialty. Downstream phases (e.g. 008a Maps SERP grid, 010 keyword tracking) use this string verbatim when querying DataForSEO. The 21-entry map looks like:
| primary_specialty | search_keyword used downstream |
personal_injury | personal injury lawyer |
family_law | divorce lawyer |
criminal_defense | criminal defense lawyer |
business_law | business lawyer |
estate_planning | estate planning lawyer |
| … (16 more) | |
| fallback when no specialty | lawyer |
Mismatch warning: the keys in the search_keyword map (e.g. personal_injury) do not match the 24-category fallback list above (which uses personal_injury_lawyer). The real-runtime category list comes from vertical_config.yaml, which uses the short form. See forensic Finding 4 below.
How the data moves (and where it lives)
↓
004a · single-pass Haiku
step_004a_specialties.py
For each V_CONFIRMED domain:
primary + secondary + declared + summary
Writes c1_specialties.json
+ BQ enrichment_004a_specialties
↓
004b · join + post-filter + split
step_004b_join_gold.py
Apply NON_V_OVERRIDE post-filter
Null specialties on dead sites
Compute search_keyword
Split CID-grain into 3 BQ tables
↓
Output · gold_domains
amicus_pipeline.gold_domains
one row per unique domain,
with verdict + specialty + cid_count
Output · gold_cids
amicus_pipeline.gold_cids
one row per Google Business listing
(may be multiple per domain)
Output · gold_paid
amicus_pipeline.gold_paid
paid ad rows from the SERPs
(usually small)
↓
Hands off to
Phases 005 → 010 — enrichment tiers
Every monthly/weekly/biweekly enrichment
reads from gold_domains / gold_cids
Where to look — file & table reference
| Thing | Path or table |
| The 2 scripts | /mnt/workspace/amicus/pipeline/steps/004_specialties/step_004*.py |
| Specialty list (live) | pipeline/profiles/_configs/attorney.yaml via shared/vertical_config.py |
| Specialty list (fallback) | hardcoded in step_004a_specialties.py:76 |
| Category remap dict | step_004a_specialties.py:_CATEGORY_REMAP (14 entries) |
| NON_VERTICAL_CATEGORIES | step_004b_join_gold.py:123 (5 entries) |
| search_keyword maps | step_004b_join_gold.py:SPECIALTY_TO_SEARCH_KEYWORD + TOWING_SPECIALTY_TO_SEARCH_KEYWORD |
| 004a output | output/<profile_id>/c1_specialties.json |
| Phase deliverable (BQ) | amicus_pipeline.gold_domains, amicus_pipeline.gold_cids, amicus_pipeline.gold_paid |
| Phase deliverable (disk) | output/<profile_id>/13_gold.json |
| BQ enrichment table (004a) | amicus_pipeline.enrichment_004a_specialties |
| Per-step logs | pipeline/steps/000_log_files/step_004*_*.log |
Cost per fire
004a is the only step that calls a paid API. Single-pass Haiku per V_CONFIRMED domain. 004b is pure Python + BigQuery writes (no per-call cost).
| Line item | Volume | Per unit | Subtotal |
004a — single-pass Haiku (~$1/$5 per M tokens, ~2000 in / 100 out per call) |
~2,247 V_CONFIRMED | ~$0.003 | ~$6.75 |
| 004b — join + 3 BQ table writes (no external API) |
— | — | $0.00 |
| Total per 30-day fire |
| | ~$7 |
V_CONFIRMED count = 2,247 unique domains for atty_wa_seattle from enrichment_004a_specialties in run atty_production_001 (queried 2026-05-16). That’s a real specialty-classified row count, not an estimate. If a future Phase 003 run produces more or fewer V_CONFIRMED firms, 004a cost scales linearly.
Schedule
Frequency
Every 30 days. Once. Final step of intake.
Trigger
Runs immediately after Phase 003 in the intake cascade. Not a separate cron.
Execution mode
Sequential 004a → 004b. 004a is async-parallel (20 concurrent Haiku calls).
Concurrency
004a: 20 Haiku concurrent. 004b: serial Python.
Model resolution
Runtime query to /v1/models — no hardcoded model IDs.
Retry behavior
004a: per-domain 2-attempt retry. If both fail, record gets error_message but the run continues. --backfill flag re-processes only error rows.
Output guarantee
Three BQ tables populated for this run_id — gold_domains, gold_cids, gold_paid. Intake is done.
What's Fucked
Phase 004 is not running the spec. Here’s exactly how.
Forensic findings, 2026-05-16. Same intake-tier gap as 001/002/003, plus 004-specific drift.
Finding 1 — Both 004 steps are tagged intake. Neither auto-fires.
pipeline/steps/cadence.py tags 004a and 004b as intake. The intake tier has runs_on: [] — never auto-fires. Same gap as the rest of intake.
| Step | Cadence tag | Auto-fires? |
004a | intake | never |
004b | intake | never |
Finding 2 — All three gold tables silently skip BQ when PIPELINE_RUN_ID isn’t set.
This is the third Phase in a row with the same gotcha. step_004b_join_gold.py at line 400 only writes gold_domains + gold_cids + gold_paid if PIPELINE_RUN_ID is in the environment:
if run_id and cid_records:
...write three gold tables...
elif not run_id:
logger.info("BQ write skipped: no PIPELINE_RUN_ID (standalone mode)")
Every manual SSH fire of 004b since intake broke has silently skipped writing the gold tables. This is the highest-impact instance of the bug — gold_domains, gold_cids, gold_paid are the canonical inputs for every weekly/monthly/biweekly enrichment step in Phases 005-010.
Finding 3 — Specialty-key mismatch between 004a output and 004b’s search_keyword map.
004a’s fallback VALID_CATEGORIES set (line 76 of step_004a_specialties.py) uses long form like personal_injury_lawyer, family_law_attorney. 004b’s SPECIALTY_TO_SEARCH_KEYWORD map (line 147 of step_004b_join_gold.py) uses short form: personal_injury, family_law. These don’t match.
What actually happens at runtime depends on which list is authoritative. Two cases:
- If
vertical_config.yaml uses short form (personal_injury) — the search_keyword map matches; the fallback hardcoded list in 004a is dead code that would never execute (because vertical config loads successfully and the fallback only kicks in when config is missing). The bug only surfaces if vertical config goes missing.
- If
vertical_config.yaml uses long form (personal_injury_lawyer) — every domain ends up with search_keyword="lawyer" (the fallback default), losing all specialty-specific keyword routing. Downstream Phase 008/010 then fetches SERPs for generic “lawyer” for every firm regardless of practice area.
Unverified which case is real. Resolving this is Fix 2.
Finding 4 — The NON_V_OVERRIDE post-filter only catches 5 categories.
004b’s NON_VERTICAL_CATEGORIES set (line 123) is just 5 hardcoded GMaps categories: CPA, accountant, doctor, driving school, educational consultant. Any other lawyer-adjacent business that fooled 003c (e.g. mediator, paralegal services, legal aid clinic, court reporter) gets through into gold_domains as a confirmed firm.
This list is too short relative to the false-positive surface. Phase 003c is good but not perfect; the post-filter is the last line of defense and it has 5 entries.
Finding 5 — The category remap dict needs runtime measurement to confirm it’s catching real cases.
004a’s _CATEGORY_REMAP has 14 hardcoded mappings for categories Haiku tends to invent (wrongful_death_attorney → personal_injury_lawyer, etc.). Whether Haiku still invents these in current model versions is unverified. Some entries may be stale (mapping a category Haiku no longer produces) and some real invented categories may slip through unmapped.
The script does log every remap, so a single run’s log file would tell us: which remaps fire, which never fire, which invented categories show up that aren’t in the dict at all. That measurement has not been done.
Finding 6 — Without 004 completing, every downstream phase reads stale gold.
Phases 005-010 (weekly, monthly, biweekly) all read from gold_domains / gold_cids. If 004b never runs, those tables stop getting fresh rows for new run_ids. Downstream phases keep operating on the most recent successful gold-table run — which is whenever 004b last succeeded with PIPELINE_RUN_ID set.
That means every Monday/Thursday biweekly fire, every Monday weekly fire, every monthly fire is enriching the same fixed firm list. New firms that opened since the last successful 004b fire are invisible. Firms that closed are still being enriched as if they were live.
The bottom line
Where Phase 004 Stands Today
The three gold tables (gold_domains, gold_cids, gold_paid) have not been written by any scheduled fire because intake never auto-runs, and manual fires almost always skip the BQ write due to missing PIPELINE_RUN_ID. This is the load-bearing failure of the entire pipeline. Every weekly/monthly/biweekly enrichment downstream of Phase 004 is enriching whichever firm list 004b last successfully published — and that’s the canonical “why is the pipeline broken” answer.
The Fix
What we’ll do to make Phase 004 match the spec.
Concrete remediation. The single highest-leverage fix in the whole pipeline lives here.
Six fixes. Fix 1 is the load-bearing one — once 004b actually writes gold tables on a schedule, the entire downstream pipeline starts working again.
The current behavior — silently skipping the BQ write when PIPELINE_RUN_ID is missing — is the wrong default. Two acceptable shapes:
- Option A: if
PIPELINE_RUN_ID is missing, generate one (e.g. standalone_2026_05_16_HHMM) and write anyway. Manual fires populate BQ.
- Option B: if
PIPELINE_RUN_ID is missing, exit non-zero with a clear error. No silent skip ever.
Recommended: Option B. Phase 004’s deliverable IS the BQ tables; a run that doesn’t produce them is a failed run, not a successful one. Apply the same fix to 002, 003c, and 004a’s enrichment writes for consistency.
Concretely:
- Open
pipeline/profiles/_configs/attorney.yaml. Determine whether it uses personal_injury or personal_injury_lawyer as specialty keys.
- Make the fallback list in
step_004a_specialties.py:76 match the config exactly. Either both short form or both long form.
- Make
step_004b_join_gold.py:SPECIALTY_TO_SEARCH_KEYWORD keys match.
- After a real run, query:
SELECT primary_specialty, search_keyword, COUNT(*) FROM gold_domains GROUP BY 1, 2 ORDER BY 3 DESC — if every row has search_keyword='lawyer', the map is missing keys.
This fix is invisible until Fix 1 lands and we see real data flowing.
Once Phase 001 Fix 3 installs the 0 10 1 * * intake crontab entry, 004a and 004b fire automatically at the end of every intake cascade. The orchestrator’s topological order ends with 004a → 004b.
No 004-specific cron line. Verify the orchestrator topology: 003d must complete before 004a, and 004a must complete before 004b. If the orchestrator parallelizes 004a and 004b, 004b reads an empty c1_specialties.json and produces gold rows with no specialties.
After Fix 1 lands and we have fresh gold_domains rows, query for false positives:
SELECT category, primary_specialty, COUNT(*)
FROM gold_domains
WHERE final_verdict = 'V_CONFIRMED'
GROUP BY 1, 2 ORDER BY 3 DESC
Look for GMaps categories that aren’t actually law firms but slipped through (e.g. “legal services”, “mediator”, “paralegal”, “legal aid”, “notary public”, “court reporter”). For each, decide:
- Definitively not an attorney → add to
NON_VERTICAL_CATEGORIES in step_004b_join_gold.py.
- Sometimes an attorney → leave it. Let the verdict stand.
Target: grow the 5-entry list to ~15-25 entries based on observed false positives.
After a real 004a run completes, grep the log:
grep "remapped primary" 000_log_files/step_004a_*.log
Three buckets to identify:
- Remaps that fire: keep them.
- Remaps that never fire (no log hit in 12 months of runs): remove from
_CATEGORY_REMAP as stale.
- Invented categories appearing in logs but not in the dict: add a remap entry.
Same goal as Fix 4 — keep the dictionaries current with what models actually produce.
After all three gold tables are written, assert:
gold_domains row count for this run_id is > some minimum (50 for a real market)
- Count of
final_verdict='V_CONFIRMED' rows is > 30% of total (a real market is never <30% confirmed)
search_keyword distribution has at least 3 distinct values (if everything is “lawyer”, Fix 2 didn’t land)
- Count of
primary_specialty IS NULL rows among V_CONFIRMED is < 10% (most confirmed firms should have a specialty)
If any assertion fails, exit non-zero with a clear error. Same pattern as Phases 001-003 fail-loud fixes.
After all 6 fixes
The intake cron from Phase 001 Fix 3 fires on the 1st of every month. Orchestrator runs 001 → 002 → 003 → 004 in order. 004a classifies every V_CONFIRMED domain into one of 24 specialties. 004b applies post-filters, computes search keywords, and writes fresh rows to gold_domains, gold_cids, gold_paid. Every Monday/Thursday biweekly fire and every Monday weekly fire now reads from a current firm list. Intake is whole.
Then we move on to Phase 005 (OnPage crawl) — the first enrichment phase, and where the cost numbers start getting serious.