pipeline.amicusdata.dev — Phase 002: Domain Verification

PHASE 002

Domain Verification

Step 002a · Every 30 days · Output: 22-category crawl_class per candidate domain

The Spec

How Phase 002 is supposed to behave.

Phase 002 — Domain Verification. Design intent. Reference, not reality.

Duty: Take the candidate-domain list from Phase 001’s silver table, hit each domain through DataForSEO’s OnPage API, and classify each one into one of 22 categories — live, 404, cloudflare challenge, wix disconnected, etc. Conditionally extract HTML, microdata, and structural tags depending on the classification.

Schedule: Fires every 30 days as part of the intake cascade, immediately after Phase 001.

End state: Every candidate domain has a crawl_class, a task_id, a duration, and (for live sites) raw HTML + microdata + redirect chains on disk — ready for Phase 003 to decide which ones are actually PI law firms.

What Phase 002 does, plain English

Phase 002 is the reality check on Phase 001’s candidate list. Phase 001 hands over a clean list of domains it scraped from Google Maps — but Google Maps lies. The listing may point at a domain that:

Doesn’t resolve in DNS
Resolves but returns 404 or 500
Is sitting behind a Cloudflare challenge that bots can’t pass
Is a Wix or Squarespace placeholder for an expired account
Redirects forever in a loop
Or, ideally, is a live website with content

Phase 002 figures out which one of those each candidate is, in a single classifier pass, with zero AI calls. The classification uses DFS’s extended_crawl_status field plus regex on the raw HTML <title> and body. Cheap, deterministic, fast.

The one sub-step

Step	What it does	Provider	Reads	Writes
`002a`	Submit each candidate domain to DFS OnPage. Poll for finish. Classify into one of 22 categories. Conditionally extract HTML/microdata/redirects.	DataForSEO OnPage	`silver` (Phase 001 output)	6 JSON files per domain on disk + `enrichment_002a_verification` in BQ

The 22 crawl_class categories

The classifier is defined in classify_crawl() in step_002a_domain_verification.py. Layer 1 reads DFS’s extended_crawl_status directly; Layer 1b sub-classifies anything that didn’t cleanly resolve by regex-matching the page title and body. First match wins.

Category	Meaning	Extraction
`live`	Site responded, no errors. Real website.	full
`site_down`	DNS or network unreachable.	none
`redirect_loop`	Too many redirects.	partial
`cloudflare_challenge`	Bot-protection JS challenge.	partial
`cloudflare_access_denied`	Cloudflare blocked the request.	partial
`cloudflare_dns_error`	Cloudflare returned a DNS-resolution error.	partial
`cloudflare_block`	Cloudflare “Attention Required” interstitial.	partial
`cloudflare_server_error`	Cloudflare 520/522/CNAME error.	partial
`403_forbidden`	Title starts with “403”.	full
`403_unauthorized`	Body mentions “403 unauthorized” / “access blocked”.	full
`404_not_found`	Title contains “404 not found”.	full
`auth_required`	HTTP 401 Authorization Required.	full
`access_denied`	Title contains “access denied”, non-Cloudflare.	full
`server_error`	5xx status + error title.	full
`temp_server_error`	5xx status + real title that doesn’t scream error.	full
`database_error`	WordPress / DB error pages.	full
`site_not_found`	Title says “site not found” / “site inactive”.	full
`wix_disconnected`	Wix “Connect Your Domain” placeholder.	full
`squarespace_expired`	Squarespace “Website Expired” page.	full
`vercel_block`	Vercel Security Checkpoint.	full
`fastly_error`	Fastly error page.	full
`unknown`	Catch-all when nothing else matched.	full

15 categories trigger full extraction (pages + raw_html + microdata + summary + redirect_chains + tags_only). 6 categories trigger partial extraction (skip microdata + tags_only — Cloudflare states give us nothing useful there). site_down triggers no extraction at all — we only save the summary we already have.

How the data moves (and where it lives)

Input · Phase 001 deliverable

amicus_pipeline.silver

candidate domains, one row each
(or 07_domains_unique.json upstream)

Input · Live API

DataForSEO OnPage

/task_post → /summary → /pages → /raw_html → /microdata → /redirect_chains

↓

Code · runs on the VM

step_002a_domain_verification.py

/mnt/workspace/amicus/
pipeline/steps/002_domain_verification/

↓ async POST in batches of 100 → poll every 10s → classify + extract

Per-domain audit · VM disk

output/<date>/

6 dirs per domain:
pages/ · raw_html/ · microdata/ · tags_only/
summary/ · redirect_chains/

Run-level audit · VM disk

manifest.jsonl
task_status_v2.csv
submissions.jsonl · events.jsonl

one row per domain with state,
task_id, crawl_class, duration_ms

↓

Output · the deliverable

amicus_pipeline.enrichment_002a_verification

one row per (run_id, domain)
with crawl_class, task_id, onpage_status, raw_json

↓

Hands off to

Phase 003 — Domain Classification

(out of scope for this page)

Where to look — file & table reference

Thing	Path or table
The script	`/mnt/workspace/amicus/pipeline/steps/002_domain_verification/step_002a_domain_verification.py`
Per-domain raw responses	`pipeline/steps/002_domain_verification/output/<YYYY-MM-DD>/{pages,raw_html,microdata,tags_only,summary,redirect_chains}/`
Per-run manifest	`output/<date>/manifest.jsonl`
Per-run task status	`output/<date>/task_status_v2.csv`
Crash-recovery log	`output/<date>/submissions.jsonl` (used by `--resume-from`)
Streaming events	`output/<date>/events.jsonl`
BQ deliverable	`amicus_pipeline.enrichment_002a_verification`
Per-step log	`pipeline/steps/000_log_files/step_002a_*.log`
Per-step accounting	`amicus_logs.step_logs` (run_id, status, duration)

Cost per fire

One DFS task per candidate domain. The OnPage API charges Tier 1 rate per task, regardless of how many endpoints we hit on that task afterward (those are free reads of an already-paid task).

Line item	Tasks	Per task	Subtotal
002a — DFS OnPage `task_post` for `atty_wa_seattle` silver (3,004 unique domains)	~3,004	$0.000125	~$0.38
Poll + extract endpoints (`/summary` · `/pages` · `/raw_html` · `/microdata` · `/redirect_chains`)	free	—	$0.00
Total per 30-day fire	~3,004		~$0.38

Phase 002 is still cheap relative to Phase 005 (the OnPage deep crawl, ~$70+ per fire). It’s a single task per unique silver domain at $0.000125 each. Silver count queried 2026-05-16 from amicus_pipeline.silver for atty_wa_seattle.

Schedule

Frequency

Every 30 days. Once. Same intake cycle as Phase 001.

Trigger

Runs immediately after Phase 001 in the intake cascade. Not a separate cron.

Execution mode

Single async script. Batches POSTs, polls in parallel, extracts in parallel.

Concurrency

1,600 DFS RPM, 150 concurrent connections, 100 domains per POST batch.

Per-task timeout

31 min hard cap per domain. 30 min stall → STUCK. 2 hr global cap.

Crash recovery

--resume-from submissions.jsonl re-attaches to in-flight tasks.

Output guarantee

Every candidate domain ends in a terminal state. DONE, EXTRACT_FAILED, FAILED, TIMED_OUT, STUCK, or NO_PAGES_RETURNED. No PENDING after the run.

What's Fucked

Phase 002 is not running the spec. Here’s exactly how.

Forensic findings, 2026-05-16. Same root cause as Phase 001, propagated downstream.

Finding 1 — 002a is tagged `intake`, which never auto-fires.

In pipeline/steps/cadence.py, 002a is tagged as intake. The intake tier has runs_on: [] — meaning never auto-fires. Same gap as the rest of Phase 001. There is no biweekly/weekly/monthly cron line that picks up 002a.

Step	Cadence tag	Auto-fires?
`002a`	intake	never

Finding 2 — Stale upstream input: `silver` hasn’t been refreshed since mid-April.

Even if 002a fired today, it would be classifying a stale candidate list. Phase 001’s silver table has not had fresh rows written since some date in mid-April 2026 (see Phase 001 forensic). Every law firm that opened, closed, or rebranded after that date is missing from the input that 002a would read.

Result: even a successful Phase 002 fire today is verifying a list of domains that no longer represents the current market.

Finding 3 — `enrichment_002a_verification` has no rows from any scheduled fire.

The write_bq_enrichment() method in step_002a_domain_verification.py only writes to BigQuery when PIPELINE_RUN_ID is set in the environment (see lines 1407–1409 of the script):

if not run_id: self.log.info("BQ write skipped: no PIPELINE_RUN_ID (standalone mode)") return

PIPELINE_RUN_ID is set by the orchestrator. If 002a is fired manually by SSH-ing in and running the script directly — which has been the only way it has fired since Phase 001 broke — BQ writes silently skip. So even the manual fires have not populated the BQ deliverable.

Finding 4 — 002a was never reached on the 2026-04-06 cascade.

That run died at 001d (geofence) and 001s (bronze→silver) before the orchestrator ever invoked 002a. amicus_logs.pipeline_runs for 2026-04-06 shows failed_step=001d and failed_step=001s on every run row — 002a is not listed as a failed step because it was never invoked.

The script may or may not work end-to-end against current DFS API behavior. That is unverified. The last time it was confirmed to fire cleanly is unclear from the audit trail.

Finding 5 — The single-domain-loop trap is structurally avoided… for now.

The script POSTs in batches of 100 (line 67: POST_BATCH_SIZE = 100) and the BQ writer accumulates rows in memory then appends with a single append_rows() call at the end of the run (line 1454). This is correct. Phase 002 does not have the “one BQ load_table_from_json per row” bug that destroyed the 2026-05 cost ledger.

However: the BQ write path has only been exercised in dev. Without scheduled fires it has not been hit in production. If the schema for enrichment_002a_verification has drifted from the row dict that write_bq_enrichment() builds at line 1424, the first scheduled fire will fail at the append step.

The bottom line

Where Phase 002 Stands Today

No candidate domain has been verified on a schedule. The BQ deliverable table is unpopulated by automated runs. Manual ad-hoc fires (when they happen) write JSON files to disk on the VM but skip the BQ write because PIPELINE_RUN_ID isn’t set. Phase 003 has no fresh classification input. Phase 002 is a working car with a working engine and no key in the ignition.

The Fix

What we’ll do to make Phase 002 match the spec.

Concrete remediation. Mostly inherits from Phase 001’s fix.

Five fixes. Most of Phase 002’s fix is downstream of Phase 001’s fix — once intake actually fires, 002a runs along with it. The dedicated 002-specific work is small.

FIX 1 Verify enrichment_002a_verification schema matches the row dict. ~10 min

Read pipeline/steps/bq/schemas/enrichment.py for enrichment_002a_verification. Compare every field name and type to the row dict assembled at lines 1424–1446 of step_002a_domain_verification.py:

run_id, run_type, vertical, state, profile_id, org_id
domain, cid (always None — intentional?), ingestion_timestamp
crawl_class, task_id, onpage_status, playwright_status (always None at this stage)
status_code, pages_crawled, tags_html_chars (all None — not populated in 002a)
error_message, raw_json

If a field is missing on either side, fix the dict or the schema. Verify on the table itself: bq show --schema --format=prettyjson amicus_pipeline.enrichment_002a_verification.

FIX 2 Smoke-test 002a against current DFS behavior on a 5-domain sample. ~10 min (+ ~5 min DFS wait)

Pick 5 known-live attorney domains from a recent silver snapshot (or hardcode 5 obvious ones). Run:

PIPELINE_RUN_ID=test_2026_05_16 \ PIPELINE_PROFILE_ID=atty_wa_seattle \ PIPELINE_VERTICAL=attorney \ python step_002a_domain_verification.py /path/to/silver.json --test 5 --no-confirm

Verify:

All 5 domains end in DONE or another terminal state — no PENDING
6 output dirs exist with one JSON each per domain
5 rows landed in enrichment_002a_verification with run_id=test_2026_05_16
At least 3 of 5 classified as live (sanity check — if zero, something is wrong)

FIX 3 Inherit the Phase 001 intake cron. no work — lands with Phase 001 Fix 3

Once Phase 001 Fix 3 installs the 0 10 1 * * intake crontab entry that runs orchestrator.py --cadence intake, 002a fires automatically — it’s already tagged intake in cadence.py and the orchestrator’s topological order is 001→002→003→004.

No 002-specific cron work needed. Don’t add a separate 002 cron. The intake cascade is one cron firing the whole intake tier in order.

FIX 4 Add a fail-loud verification at the end of 002a. ~20 min

Same pattern as Phase 001 Fix 6. After write_bq_enrichment() completes, add a final assertion:

Count of rows written must equal count of input domains
Fraction of crawl_class = 'live' must be > 0 (a real market never has 100% dead domains; if it does, something is wrong upstream)
No domain may end in PENDING — every row has a terminal state

If any check fails, exit non-zero with a specific error. Prevents a future silent zero-write run from being marked “completed”.

FIX 5 Decide whether 002a’s 6 output dirs need to stop accumulating. ~10 min decision · ~30 min if act

Each fire writes one JSON per (domain, endpoint). At ~3,004 candidate domains × 6 endpoints = ~18,000 small files per fire on the VM disk. After 12 monthly fires that’s ~216,000 files in the output directory. The 50 GB data disk still has headroom (small JSON each), but the output/<date>/ tree grows unbounded.

Two options:

Leave it. Disk is cheap, the files are tiny, and we’d rather have the audit trail than not.
Tar + GCS. After each successful fire, tarball the dated dir and SCP to gs://amicus-pipeline-archives/002a/<date>.tar.gz. Delete local. Adds 1 step to the script’s tail.

Default: leave it until the disk hits 80% full. Then revisit.

After all 5 fixes

The intake cron from Phase 001 Fix 3 fires on the 1st of every month. Orchestrator runs 001 to completion, then runs 002a against the fresh silver output. Every candidate domain ends up in enrichment_002a_verification with a real crawl_class. Phase 003 picks up from there.

Then we move on to Phase 003.

Generated 2026-05-16 from /mnt/workspace/amicus/sites/pipeline/ on amicus-dev VM.