PHASE 002

Domain Verification

Step 002a · Every 30 days · Output: 22-category crawl_class per candidate domain

The Spec

How Phase 002 is supposed to behave.

Duty: Take the candidate-domain list from Phase 001’s silver table, hit each domain through DataForSEO’s OnPage API, and classify each one into one of 22 categories — live, 404, cloudflare challenge, wix disconnected, etc. Conditionally extract HTML, microdata, and structural tags depending on the classification.

Schedule: Fires every 30 days as part of the intake cascade, immediately after Phase 001.

End state: Every candidate domain has a crawl_class, a task_id, a duration, and (for live sites) raw HTML + microdata + redirect chains on disk — ready for Phase 003 to decide which ones are actually PI law firms.

What Phase 002 does, plain English

Phase 002 is the reality check on Phase 001’s candidate list. Phase 001 hands over a clean list of domains it scraped from Google Maps — but Google Maps lies. The listing may point at a domain that:

  • Doesn’t resolve in DNS
  • Resolves but returns 404 or 500
  • Is sitting behind a Cloudflare challenge that bots can’t pass
  • Is a Wix or Squarespace placeholder for an expired account
  • Redirects forever in a loop
  • Or, ideally, is a live website with content

Phase 002 figures out which one of those each candidate is, in a single classifier pass, with zero AI calls. The classification uses DFS’s extended_crawl_status field plus regex on the raw HTML <title> and body. Cheap, deterministic, fast.

The one sub-step

StepWhat it doesProviderReadsWrites
002aSubmit each candidate domain to DFS OnPage. Poll for finish. Classify into one of 22 categories. Conditionally extract HTML/microdata/redirects.DataForSEO OnPagesilver (Phase 001 output)6 JSON files per domain on disk + enrichment_002a_verification in BQ

The 22 crawl_class categories

The classifier is defined in classify_crawl() in step_002a_domain_verification.py. Layer 1 reads DFS’s extended_crawl_status directly; Layer 1b sub-classifies anything that didn’t cleanly resolve by regex-matching the page title and body. First match wins.

CategoryMeaningExtraction
liveSite responded, no errors. Real website.full
site_downDNS or network unreachable.none
redirect_loopToo many redirects.partial
cloudflare_challengeBot-protection JS challenge.partial
cloudflare_access_deniedCloudflare blocked the request.partial
cloudflare_dns_errorCloudflare returned a DNS-resolution error.partial
cloudflare_blockCloudflare “Attention Required” interstitial.partial
cloudflare_server_errorCloudflare 520/522/CNAME error.partial
403_forbiddenTitle starts with “403”.full
403_unauthorizedBody mentions “403 unauthorized” / “access blocked”.full
404_not_foundTitle contains “404 not found”.full
auth_requiredHTTP 401 Authorization Required.full
access_deniedTitle contains “access denied”, non-Cloudflare.full
server_error5xx status + error title.full
temp_server_error5xx status + real title that doesn’t scream error.full
database_errorWordPress / DB error pages.full
site_not_foundTitle says “site not found” / “site inactive”.full
wix_disconnectedWix “Connect Your Domain” placeholder.full
squarespace_expiredSquarespace “Website Expired” page.full
vercel_blockVercel Security Checkpoint.full
fastly_errorFastly error page.full
unknownCatch-all when nothing else matched.full

15 categories trigger full extraction (pages + raw_html + microdata + summary + redirect_chains + tags_only). 6 categories trigger partial extraction (skip microdata + tags_only — Cloudflare states give us nothing useful there). site_down triggers no extraction at all — we only save the summary we already have.

How the data moves (and where it lives)

Input · Phase 001 deliverable
amicus_pipeline.silver
candidate domains, one row each
(or 07_domains_unique.json upstream)
Input · Live API
DataForSEO OnPage
/task_post → /summary → /pages → /raw_html → /microdata → /redirect_chains
Code · runs on the VM
step_002a_domain_verification.py
/mnt/workspace/amicus/
pipeline/steps/002_domain_verification/
async POST in batches of 100 → poll every 10s → classify + extract
Per-domain audit · VM disk
output/<date>/
6 dirs per domain:
pages/ · raw_html/ · microdata/ · tags_only/
summary/ · redirect_chains/
Run-level audit · VM disk
manifest.jsonl
task_status_v2.csv
submissions.jsonl · events.jsonl
one row per domain with state,
task_id, crawl_class, duration_ms
Output · the deliverable
amicus_pipeline.enrichment_002a_verification
one row per (run_id, domain)
with crawl_class, task_id, onpage_status, raw_json
Hands off to
Phase 003 — Domain Classification
(out of scope for this page)

Where to look — file & table reference

ThingPath or table
The script/mnt/workspace/amicus/pipeline/steps/002_domain_verification/step_002a_domain_verification.py
Per-domain raw responsespipeline/steps/002_domain_verification/output/<YYYY-MM-DD>/{pages,raw_html,microdata,tags_only,summary,redirect_chains}/
Per-run manifestoutput/<date>/manifest.jsonl
Per-run task statusoutput/<date>/task_status_v2.csv
Crash-recovery logoutput/<date>/submissions.jsonl (used by --resume-from)
Streaming eventsoutput/<date>/events.jsonl
BQ deliverableamicus_pipeline.enrichment_002a_verification
Per-step logpipeline/steps/000_log_files/step_002a_*.log
Per-step accountingamicus_logs.step_logs (run_id, status, duration)

Cost per fire

One DFS task per candidate domain. The OnPage API charges Tier 1 rate per task, regardless of how many endpoints we hit on that task afterward (those are free reads of an already-paid task).

Line itemTasksPer taskSubtotal
002a — DFS OnPage task_post for atty_wa_seattle silver (3,004 unique domains) ~3,004$0.000125~$0.38
Poll + extract endpoints (/summary · /pages · /raw_html · /microdata · /redirect_chains) free$0.00
Total per 30-day fire ~3,004~$0.38

Schedule

Frequency
Every 30 days. Once. Same intake cycle as Phase 001.
Trigger
Runs immediately after Phase 001 in the intake cascade. Not a separate cron.
Execution mode
Single async script. Batches POSTs, polls in parallel, extracts in parallel.
Concurrency
1,600 DFS RPM, 150 concurrent connections, 100 domains per POST batch.
Per-task timeout
31 min hard cap per domain. 30 min stall → STUCK. 2 hr global cap.
Crash recovery
--resume-from submissions.jsonl re-attaches to in-flight tasks.
Output guarantee
Every candidate domain ends in a terminal state. DONE, EXTRACT_FAILED, FAILED, TIMED_OUT, STUCK, or NO_PAGES_RETURNED. No PENDING after the run.
What's Fucked

Phase 002 is not running the spec. Here’s exactly how.

Finding 1 — 002a is tagged intake, which never auto-fires.

In pipeline/steps/cadence.py, 002a is tagged as intake. The intake tier has runs_on: [] — meaning never auto-fires. Same gap as the rest of Phase 001. There is no biweekly/weekly/monthly cron line that picks up 002a.

StepCadence tagAuto-fires?
002aintakenever

Finding 2 — Stale upstream input: silver hasn’t been refreshed since mid-April.

Even if 002a fired today, it would be classifying a stale candidate list. Phase 001’s silver table has not had fresh rows written since some date in mid-April 2026 (see Phase 001 forensic). Every law firm that opened, closed, or rebranded after that date is missing from the input that 002a would read.

Result: even a successful Phase 002 fire today is verifying a list of domains that no longer represents the current market.

Finding 3 — enrichment_002a_verification has no rows from any scheduled fire.

The write_bq_enrichment() method in step_002a_domain_verification.py only writes to BigQuery when PIPELINE_RUN_ID is set in the environment (see lines 1407–1409 of the script):

if not run_id:
    self.log.info("BQ write skipped: no PIPELINE_RUN_ID (standalone mode)")
    return

PIPELINE_RUN_ID is set by the orchestrator. If 002a is fired manually by SSH-ing in and running the script directly — which has been the only way it has fired since Phase 001 broke — BQ writes silently skip. So even the manual fires have not populated the BQ deliverable.

Finding 4 — 002a was never reached on the 2026-04-06 cascade.

That run died at 001d (geofence) and 001s (bronze→silver) before the orchestrator ever invoked 002a. amicus_logs.pipeline_runs for 2026-04-06 shows failed_step=001d and failed_step=001s on every run row — 002a is not listed as a failed step because it was never invoked.

The script may or may not work end-to-end against current DFS API behavior. That is unverified. The last time it was confirmed to fire cleanly is unclear from the audit trail.

Finding 5 — The single-domain-loop trap is structurally avoided… for now.

The script POSTs in batches of 100 (line 67: POST_BATCH_SIZE = 100) and the BQ writer accumulates rows in memory then appends with a single append_rows() call at the end of the run (line 1454). This is correct. Phase 002 does not have the “one BQ load_table_from_json per row” bug that destroyed the 2026-05 cost ledger.

However: the BQ write path has only been exercised in dev. Without scheduled fires it has not been hit in production. If the schema for enrichment_002a_verification has drifted from the row dict that write_bq_enrichment() builds at line 1424, the first scheduled fire will fail at the append step.

The bottom line

Where Phase 002 Stands Today

No candidate domain has been verified on a schedule. The BQ deliverable table is unpopulated by automated runs. Manual ad-hoc fires (when they happen) write JSON files to disk on the VM but skip the BQ write because PIPELINE_RUN_ID isn’t set. Phase 003 has no fresh classification input. Phase 002 is a working car with a working engine and no key in the ignition.

The Fix

What we’ll do to make Phase 002 match the spec.

Five fixes. Most of Phase 002’s fix is downstream of Phase 001’s fix — once intake actually fires, 002a runs along with it. The dedicated 002-specific work is small.

FIX 1 Verify enrichment_002a_verification schema matches the row dict. ~10 min

Read pipeline/steps/bq/schemas/enrichment.py for enrichment_002a_verification. Compare every field name and type to the row dict assembled at lines 1424–1446 of step_002a_domain_verification.py:

  • run_id, run_type, vertical, state, profile_id, org_id
  • domain, cid (always None — intentional?), ingestion_timestamp
  • crawl_class, task_id, onpage_status, playwright_status (always None at this stage)
  • status_code, pages_crawled, tags_html_chars (all None — not populated in 002a)
  • error_message, raw_json

If a field is missing on either side, fix the dict or the schema. Verify on the table itself: bq show --schema --format=prettyjson amicus_pipeline.enrichment_002a_verification.

FIX 2 Smoke-test 002a against current DFS behavior on a 5-domain sample. ~10 min (+ ~5 min DFS wait)

Pick 5 known-live attorney domains from a recent silver snapshot (or hardcode 5 obvious ones). Run:

PIPELINE_RUN_ID=test_2026_05_16 \
PIPELINE_PROFILE_ID=atty_wa_seattle \
PIPELINE_VERTICAL=attorney \
python step_002a_domain_verification.py /path/to/silver.json --test 5 --no-confirm

Verify:

  • All 5 domains end in DONE or another terminal state — no PENDING
  • 6 output dirs exist with one JSON each per domain
  • 5 rows landed in enrichment_002a_verification with run_id=test_2026_05_16
  • At least 3 of 5 classified as live (sanity check — if zero, something is wrong)
FIX 3 Inherit the Phase 001 intake cron. no work — lands with Phase 001 Fix 3

Once Phase 001 Fix 3 installs the 0 10 1 * * intake crontab entry that runs orchestrator.py --cadence intake, 002a fires automatically — it’s already tagged intake in cadence.py and the orchestrator’s topological order is 001→002→003→004.

No 002-specific cron work needed. Don’t add a separate 002 cron. The intake cascade is one cron firing the whole intake tier in order.

FIX 4 Add a fail-loud verification at the end of 002a. ~20 min

Same pattern as Phase 001 Fix 6. After write_bq_enrichment() completes, add a final assertion:

  • Count of rows written must equal count of input domains
  • Fraction of crawl_class = 'live' must be > 0 (a real market never has 100% dead domains; if it does, something is wrong upstream)
  • No domain may end in PENDING — every row has a terminal state

If any check fails, exit non-zero with a specific error. Prevents a future silent zero-write run from being marked “completed”.

FIX 5 Decide whether 002a’s 6 output dirs need to stop accumulating. ~10 min decision · ~30 min if act

Each fire writes one JSON per (domain, endpoint). At ~3,004 candidate domains × 6 endpoints = ~18,000 small files per fire on the VM disk. After 12 monthly fires that’s ~216,000 files in the output directory. The 50 GB data disk still has headroom (small JSON each), but the output/<date>/ tree grows unbounded.

Two options:

  • Leave it. Disk is cheap, the files are tiny, and we’d rather have the audit trail than not.
  • Tar + GCS. After each successful fire, tarball the dated dir and SCP to gs://amicus-pipeline-archives/002a/<date>.tar.gz. Delete local. Adds 1 step to the script’s tail.

Default: leave it until the disk hits 80% full. Then revisit.

After all 5 fixes

The intake cron from Phase 001 Fix 3 fires on the 1st of every month. Orchestrator runs 001 to completion, then runs 002a against the fresh silver output. Every candidate domain ends up in enrichment_002a_verification with a real crawl_class. Phase 003 picks up from there.

Then we move on to Phase 003.