The Spec
How Phase 002 is supposed to behave.
Phase 002 — Domain Verification. Design intent. Reference, not reality.
Duty: Take the candidate-domain list from Phase 001’s silver table, hit each domain through DataForSEO’s OnPage API, and classify each one into one of 22 categories — live, 404, cloudflare challenge, wix disconnected, etc. Conditionally extract HTML, microdata, and structural tags depending on the classification.
Schedule: Fires every 30 days as part of the intake cascade, immediately after Phase 001.
End state: Every candidate domain has a crawl_class, a task_id, a duration, and (for live sites) raw HTML + microdata + redirect chains on disk — ready for Phase 003 to decide which ones are actually PI law firms.
What Phase 002 does, plain English
Phase 002 is the reality check on Phase 001’s candidate list. Phase 001 hands over a clean list of domains it scraped from Google Maps — but Google Maps lies. The listing may point at a domain that:
- Doesn’t resolve in DNS
- Resolves but returns 404 or 500
- Is sitting behind a Cloudflare challenge that bots can’t pass
- Is a Wix or Squarespace placeholder for an expired account
- Redirects forever in a loop
- Or, ideally, is a live website with content
Phase 002 figures out which one of those each candidate is, in a single classifier pass, with zero AI calls. The classification uses DFS’s extended_crawl_status field plus regex on the raw HTML <title> and body. Cheap, deterministic, fast.
The one sub-step
| Step | What it does | Provider | Reads | Writes |
002a | Submit each candidate domain to DFS OnPage. Poll for finish. Classify into one of 22 categories. Conditionally extract HTML/microdata/redirects. | DataForSEO OnPage | silver (Phase 001 output) | 6 JSON files per domain on disk + enrichment_002a_verification in BQ |
The 22 crawl_class categories
The classifier is defined in classify_crawl() in step_002a_domain_verification.py. Layer 1 reads DFS’s extended_crawl_status directly; Layer 1b sub-classifies anything that didn’t cleanly resolve by regex-matching the page title and body. First match wins.
| Category | Meaning | Extraction |
live | Site responded, no errors. Real website. | full |
site_down | DNS or network unreachable. | none |
redirect_loop | Too many redirects. | partial |
cloudflare_challenge | Bot-protection JS challenge. | partial |
cloudflare_access_denied | Cloudflare blocked the request. | partial |
cloudflare_dns_error | Cloudflare returned a DNS-resolution error. | partial |
cloudflare_block | Cloudflare “Attention Required” interstitial. | partial |
cloudflare_server_error | Cloudflare 520/522/CNAME error. | partial |
403_forbidden | Title starts with “403”. | full |
403_unauthorized | Body mentions “403 unauthorized” / “access blocked”. | full |
404_not_found | Title contains “404 not found”. | full |
auth_required | HTTP 401 Authorization Required. | full |
access_denied | Title contains “access denied”, non-Cloudflare. | full |
server_error | 5xx status + error title. | full |
temp_server_error | 5xx status + real title that doesn’t scream error. | full |
database_error | WordPress / DB error pages. | full |
site_not_found | Title says “site not found” / “site inactive”. | full |
wix_disconnected | Wix “Connect Your Domain” placeholder. | full |
squarespace_expired | Squarespace “Website Expired” page. | full |
vercel_block | Vercel Security Checkpoint. | full |
fastly_error | Fastly error page. | full |
unknown | Catch-all when nothing else matched. | full |
15 categories trigger full extraction (pages + raw_html + microdata + summary + redirect_chains + tags_only). 6 categories trigger partial extraction (skip microdata + tags_only — Cloudflare states give us nothing useful there). site_down triggers no extraction at all — we only save the summary we already have.
How the data moves (and where it lives)
↓
Code · runs on the VM
step_002a_domain_verification.py
/mnt/workspace/amicus/
pipeline/steps/002_domain_verification/
↓
async POST in batches of 100 → poll every 10s → classify + extract
Per-domain audit · VM disk
output/<date>/
6 dirs per domain:
pages/ · raw_html/ · microdata/ · tags_only/
summary/ · redirect_chains/
Run-level audit · VM disk
manifest.jsonl
task_status_v2.csv
submissions.jsonl · events.jsonl
one row per domain with state,
task_id, crawl_class, duration_ms
↓
Output · the deliverable
amicus_pipeline.enrichment_002a_verification
one row per (run_id, domain)
with crawl_class, task_id, onpage_status, raw_json
↓
Hands off to
Phase 003 — Domain Classification
(out of scope for this page)
Where to look — file & table reference
| Thing | Path or table |
| The script | /mnt/workspace/amicus/pipeline/steps/002_domain_verification/step_002a_domain_verification.py |
| Per-domain raw responses | pipeline/steps/002_domain_verification/output/<YYYY-MM-DD>/{pages,raw_html,microdata,tags_only,summary,redirect_chains}/ |
| Per-run manifest | output/<date>/manifest.jsonl |
| Per-run task status | output/<date>/task_status_v2.csv |
| Crash-recovery log | output/<date>/submissions.jsonl (used by --resume-from) |
| Streaming events | output/<date>/events.jsonl |
| BQ deliverable | amicus_pipeline.enrichment_002a_verification |
| Per-step log | pipeline/steps/000_log_files/step_002a_*.log |
| Per-step accounting | amicus_logs.step_logs (run_id, status, duration) |
Cost per fire
One DFS task per candidate domain. The OnPage API charges Tier 1 rate per task, regardless of how many endpoints we hit on that task afterward (those are free reads of an already-paid task).
| Line item | Tasks | Per task | Subtotal |
002a — DFS OnPage task_post for atty_wa_seattle silver (3,004 unique domains) |
~3,004 | $0.000125 | ~$0.38 |
Poll + extract endpoints (/summary · /pages · /raw_html · /microdata · /redirect_chains) |
free | — | $0.00 |
| Total per 30-day fire |
~3,004 | | ~$0.38 |
Phase 002 is still cheap relative to Phase 005 (the OnPage deep crawl, ~$70+ per fire). It’s a single task per unique silver domain at $0.000125 each. Silver count queried 2026-05-16 from amicus_pipeline.silver for atty_wa_seattle.
Schedule
Frequency
Every 30 days. Once. Same intake cycle as Phase 001.
Trigger
Runs immediately after Phase 001 in the intake cascade. Not a separate cron.
Execution mode
Single async script. Batches POSTs, polls in parallel, extracts in parallel.
Concurrency
1,600 DFS RPM, 150 concurrent connections, 100 domains per POST batch.
Per-task timeout
31 min hard cap per domain. 30 min stall → STUCK. 2 hr global cap.
Crash recovery
--resume-from submissions.jsonl re-attaches to in-flight tasks.
Output guarantee
Every candidate domain ends in a terminal state. DONE, EXTRACT_FAILED, FAILED, TIMED_OUT, STUCK, or NO_PAGES_RETURNED. No PENDING after the run.
What's Fucked
Phase 002 is not running the spec. Here’s exactly how.
Forensic findings, 2026-05-16. Same root cause as Phase 001, propagated downstream.
Finding 1 — 002a is tagged intake, which never auto-fires.
In pipeline/steps/cadence.py, 002a is tagged as intake. The intake tier has runs_on: [] — meaning never auto-fires. Same gap as the rest of Phase 001. There is no biweekly/weekly/monthly cron line that picks up 002a.
| Step | Cadence tag | Auto-fires? |
002a | intake | never |
Finding 2 — Stale upstream input: silver hasn’t been refreshed since mid-April.
Even if 002a fired today, it would be classifying a stale candidate list. Phase 001’s silver table has not had fresh rows written since some date in mid-April 2026 (see Phase 001 forensic). Every law firm that opened, closed, or rebranded after that date is missing from the input that 002a would read.
Result: even a successful Phase 002 fire today is verifying a list of domains that no longer represents the current market.
Finding 3 — enrichment_002a_verification has no rows from any scheduled fire.
The write_bq_enrichment() method in step_002a_domain_verification.py only writes to BigQuery when PIPELINE_RUN_ID is set in the environment (see lines 1407–1409 of the script):
if not run_id:
self.log.info("BQ write skipped: no PIPELINE_RUN_ID (standalone mode)")
return
PIPELINE_RUN_ID is set by the orchestrator. If 002a is fired manually by SSH-ing in and running the script directly — which has been the only way it has fired since Phase 001 broke — BQ writes silently skip. So even the manual fires have not populated the BQ deliverable.
Finding 4 — 002a was never reached on the 2026-04-06 cascade.
That run died at 001d (geofence) and 001s (bronze→silver) before the orchestrator ever invoked 002a. amicus_logs.pipeline_runs for 2026-04-06 shows failed_step=001d and failed_step=001s on every run row — 002a is not listed as a failed step because it was never invoked.
The script may or may not work end-to-end against current DFS API behavior. That is unverified. The last time it was confirmed to fire cleanly is unclear from the audit trail.
Finding 5 — The single-domain-loop trap is structurally avoided… for now.
The script POSTs in batches of 100 (line 67: POST_BATCH_SIZE = 100) and the BQ writer accumulates rows in memory then appends with a single append_rows() call at the end of the run (line 1454). This is correct. Phase 002 does not have the “one BQ load_table_from_json per row” bug that destroyed the 2026-05 cost ledger.
However: the BQ write path has only been exercised in dev. Without scheduled fires it has not been hit in production. If the schema for enrichment_002a_verification has drifted from the row dict that write_bq_enrichment() builds at line 1424, the first scheduled fire will fail at the append step.
The bottom line
Where Phase 002 Stands Today
No candidate domain has been verified on a schedule. The BQ deliverable table is unpopulated by automated runs. Manual ad-hoc fires (when they happen) write JSON files to disk on the VM but skip the BQ write because PIPELINE_RUN_ID isn’t set. Phase 003 has no fresh classification input. Phase 002 is a working car with a working engine and no key in the ignition.
The Fix
What we’ll do to make Phase 002 match the spec.
Concrete remediation. Mostly inherits from Phase 001’s fix.
Five fixes. Most of Phase 002’s fix is downstream of Phase 001’s fix — once intake actually fires, 002a runs along with it. The dedicated 002-specific work is small.
Read pipeline/steps/bq/schemas/enrichment.py for enrichment_002a_verification. Compare every field name and type to the row dict assembled at lines 1424–1446 of step_002a_domain_verification.py:
run_id, run_type, vertical, state, profile_id, org_id
domain, cid (always None — intentional?), ingestion_timestamp
crawl_class, task_id, onpage_status, playwright_status (always None at this stage)
status_code, pages_crawled, tags_html_chars (all None — not populated in 002a)
error_message, raw_json
If a field is missing on either side, fix the dict or the schema. Verify on the table itself: bq show --schema --format=prettyjson amicus_pipeline.enrichment_002a_verification.
Pick 5 known-live attorney domains from a recent silver snapshot (or hardcode 5 obvious ones). Run:
PIPELINE_RUN_ID=test_2026_05_16 \
PIPELINE_PROFILE_ID=atty_wa_seattle \
PIPELINE_VERTICAL=attorney \
python step_002a_domain_verification.py /path/to/silver.json --test 5 --no-confirm
Verify:
- All 5 domains end in
DONE or another terminal state — no PENDING
- 6 output dirs exist with one JSON each per domain
- 5 rows landed in
enrichment_002a_verification with run_id=test_2026_05_16
- At least 3 of 5 classified as
live (sanity check — if zero, something is wrong)
Once Phase 001 Fix 3 installs the 0 10 1 * * intake crontab entry that runs orchestrator.py --cadence intake, 002a fires automatically — it’s already tagged intake in cadence.py and the orchestrator’s topological order is 001→002→003→004.
No 002-specific cron work needed. Don’t add a separate 002 cron. The intake cascade is one cron firing the whole intake tier in order.
Same pattern as Phase 001 Fix 6. After write_bq_enrichment() completes, add a final assertion:
- Count of rows written must equal count of input domains
- Fraction of
crawl_class = 'live' must be > 0 (a real market never has 100% dead domains; if it does, something is wrong upstream)
- No domain may end in
PENDING — every row has a terminal state
If any check fails, exit non-zero with a specific error. Prevents a future silent zero-write run from being marked “completed”.
Each fire writes one JSON per (domain, endpoint). At ~3,004 candidate domains × 6 endpoints = ~18,000 small files per fire on the VM disk. After 12 monthly fires that’s ~216,000 files in the output directory. The 50 GB data disk still has headroom (small JSON each), but the output/<date>/ tree grows unbounded.
Two options:
- Leave it. Disk is cheap, the files are tiny, and we’d rather have the audit trail than not.
- Tar + GCS. After each successful fire, tarball the dated dir and SCP to
gs://amicus-pipeline-archives/002a/<date>.tar.gz. Delete local. Adds 1 step to the script’s tail.
Default: leave it until the disk hits 80% full. Then revisit.
After all 5 fixes
The intake cron from Phase 001 Fix 3 fires on the 1st of every month. Orchestrator runs 001 to completion, then runs 002a against the fresh silver output. Every candidate domain ends up in enrichment_002a_verification with a real crawl_class. Phase 003 picks up from there.
Then we move on to Phase 003.