PHASE 005

OnPage

Steps 005a → 005j · Monthly tier · First paid enrichment · Output: bronze + silver tables for every firm’s website

The Spec

How Phase 005 is supposed to behave.

Duty: For every confirmed firm in gold_domains, deeply crawl their entire website (up to 2,000 pages each), audit it through three different lenses (DFS OnPage, DFS Lighthouse, Google PageSpeed Insights), and count the attorneys listed on the team page. Land everything in BigQuery so downstream analytics can query SEO-grade signal per firm.

Schedule: Monthly tier — designed to fire once per 30 days.

End state: ~2,247 V_CONFIRMED firms × ~155 pages avg = ~348,000 page-rows in bronze tables; one row per firm in 8 enrichment tables; one attorney-count estimate per firm.

What Phase 005 does, plain English

Phases 001-004 produced a clean list of confirmed PI law firms with primary specialties. Phase 005 is where we actually look at their websites. For each firm we want to know:

  • How many pages does the site have? How fast does each page load? Which pages return 404s or redirect chains?
  • What does Google Lighthouse score it for performance, accessibility, SEO, and best practices?
  • What do real Chrome users see — Core Web Vitals from Google’s field-data dataset?
  • How many attorneys actually work there, based on their “Our Team” / “Attorneys” page?

This is the first enrichment phase in the pipeline (Phases 001-004 are intake, Phases 005-010 are enrichment). It’s also the first phase that costs real money — the OnPage deep crawl alone burns ~$2-3 per fire.

The 10 sub-steps

StepWhat it doesProviderOutput table
005aDFS OnPage deep crawl — up to 2,000 pages per domain, 10 endpoints per crawl (summary, pages, links, dup_tags, dup_content, redirects, non_indexable, waterfall, microdata, raw_html). The big paid step.DataForSEO OnPagebronze_005_onpage, bronze_005_perpage, bronze_005_raw_html, enrichment_005a_deep_crawl
005bParse /summary bronze into per-domain stats (total pages, status code distribution, indexing issues).enrichment_005b_onpage_summary
005cParse /pages bronze into per-URL detail (title, meta description, H1, word count).enrichment_005c_pages
005dParse /links bronze into internal & external link graph.enrichment_005d_links
005eParse /duplicate_tags bronze — pages sharing <title> or <meta description>.enrichment_005e_duplicate_tags
005fParse /redirect_chains bronze — multi-hop redirects and redirect loops.enrichment_005f_redirect_chains
005gParse /non_indexable bronze — pages blocked from Google (noindex, robots disallow).enrichment_005g_non_indexable
005hDFS Lighthouse audit per homepage — Performance, Accessibility, SEO, Best Practices scores + Core Web Vitals.DataForSEO Lighthouseenrichment_005h_lighthouse
005iGoogle PageSpeed Insights — real-user Core Web Vitals (LCP / CLS / INP) from Chrome field data.Google PSIenrichment_005i_pagespeed
005jHaiku attorney-count estimator — reads the firm’s “Our Team” page from 005a’s raw_html, classifies into a bucket: solo · 2 · 3_to_5 · 6_to_10 · 11_plus · unknown.Anthropic Haikuenrichment_005j_attorney_count

005a is the load-bearing step

005a (the deep crawl) is the source of truth for every step b through g. Those steps are parsers — they read 005a’s bronze tables and decompose them into silver-layer tables. If 005a doesn’t run, the rest of the phase has nothing to parse.

005a fires up to 10 distinct DFS endpoint classes per domain:

EndpointGrainLands in
/summaryper-domainbronze_005_onpage
/pagesper-domain (paginated)bronze_005_onpage
/linksper-domain (paginated)bronze_005_onpage
/duplicate_tagsper-domainbronze_005_onpage
/redirect_chainsper-domainbronze_005_onpage
/non_indexableper-domainbronze_005_onpage
/waterfallper-URL (every crawled page)bronze_005_perpage
/duplicate_contentper-URLbronze_005_perpage
/microdataper-URLbronze_005_perpage
/raw_htmlper-URL (every crawled page)bronze_005_perpage + legacy homepage-only bronze_005_raw_html

DFS OnPage call volume & rate limits

The OnPage retrieval endpoints (everything except task_post) do not support task stacking. From DataForSEO docs:

“All other endpoints of OnPage API do not recommend sending several tasks in one POST call as it may result in system overload and 4xx or 5xx errors.”

Concretely, per the validated April 7 baseline run:

ConstraintValue
Account-wide rate cap2,000 calls/min
Pipeline pacing (85%)1,700 RPM — strict 35.3 ms between any two acquires
Max simultaneous in-flight30 requests
Tasks per HTTP POST1 (only task_post, instant_pages, page_screenshot allow stacking)
Max crawl pages per domain2,000
Per-task timeout2 hours hard cap
Global run timeout8 hours hard cap

How the data moves

Input · Phase 004 deliverable
gold_domains
Filter: final_verdict in {V_CONFIRMED, V_LIKELY}
+ is_domain_primary = TRUE
005a · DFS OnPage deep crawl
step_005a_onpage_crawl.py
For each firm: POST task → poll → retrieve 10 endpoints
Up to 2,000 pages per domain
Writes bronze_005_onpage + bronze_005_perpage
005b → 005g · per-endpoint parsers
step_005[b-g]_*.py
Read bronze, write enrichment_005{b,c,d,e,f,g}_* silver tables
(currently empty — see forensic)
005h · DFS Lighthouse
step_005h_lighthouse_dfs.py
Per-homepage Lighthouse audit
Performance / A11y / SEO / Best Practices scores
005i · Google PSI
step_005i_pagespeed_google.py
CrUX field data — real-user Web Vitals
(free, Google API)
005j · attorney count
step_005j_attorney_count.py
Reads raw_html for “Our Team” page
Haiku classifies into bucket: solo, 2, 3-5, 6-10, 11+, unknown
Outputs · BigQuery
bronze_005_onpage · bronze_005_perpage · bronze_005_raw_html
+ enrichment_005a → enrichment_005j
Every firm has bronze payloads for 10 endpoints,
plus silver enrichment rows per parser

Where to look — file & table reference

ThingPath or table
The scripts/mnt/workspace/amicus/pipeline/steps/005_onpage/step_005*.py
OnPage rewrite (partial)step1_summary.py · step2_pages.py (shipped 2026-05-06) — meant to replace the 005a monolith
Old monolith (still in use)step_005a_onpage_crawl.py (1,967 lines)
Rewrite plan / statuspipeline/steps/005_onpage/README.md
April 7 baseline run reportpipeline/steps/utilities/documentation/first_production_run_report.md
Bronze tablesamicus_pipeline.bronze_005_onpage · bronze_005_perpage · bronze_005_raw_html
Enrichment tablesamicus_pipeline.enrichment_005a_deep_crawl · enrichment_005b..g_* · enrichment_005h_lighthouse · enrichment_005i_pagespeed · enrichment_005j_attorney_count
API keysDFS_USERNAME · DFS_PASSWORD · PAGESPEED_API_KEY · ANTHROPIC_API_KEY_B_SERIES in .env

Cost per fire

First phase that costs real money. 005a is dominant; 005h is secondary; 005j is small; 005b-g + 005i are free.

Line itemVolumePer unitSubtotal
005a — DFS OnPage Tier 1 ($0.000125/page, ~155 pages/domain avg, 2,247 V_CONFIRMED firms) ~348,000 pages$0.000125~$43
005b → 005g — silver parsers (Python only, no external API) $0.00
005h — DFS Lighthouse audits (~$0.005 per audit, one per homepage) ~2,247~$0.005~$11
005i — Google PSI (free tier, 25,000 queries/day) ~2,247$0$0.00
005j — Haiku attorney count (~8K input + ~100 output per call) ~2,247~$0.009~$20
Total per monthly fire ~$74

Schedule

Frequency
Monthly tier — designed to fire once per 30 days.
Trigger
Orchestrator cron with --cadence monthly.
Execution mode
005a runs first (it’s upstream of b-g). 005h, 005i, 005j can parallelize with each other.
Concurrency
005a: 1,700 RPM strict, 30 in-flight, 6 extraction workers, 4-round retry. 005j: 40 Haiku concurrent.
Crash recovery
--resume-from submissions.jsonl re-attaches to in-flight tasks. --tier2 escalates tier-1-failed tasks to tier-2 pricing.
Output guarantee
Every input domain ends in a terminal state. Bronze tables get one row per (domain, endpoint).
What's Fucked

Phase 005 is not running the spec. Here’s exactly how.

Finding 1 — The “monthly” cron expression isn’t monthly.

The single orchestrator crontab entry on the VM is 30 10 * * 1,4 — that fires every Monday and Thursday at 10:30 UTC, NOT once per month. It runs the orchestrator with the monthly/weekly/biweekly tiers together.

Consequence: if 005 is tagged monthly in cadence.py, it fires on every Mon + Thu. That’s ~8-9 fires per month, not 1. At ~$74 per fire (2,247 V_CONFIRMED firms × the cost spec above), the monthly tier alone is bleeding ~$590-665/month instead of ~$74.

(See Monthly cron schedule for the broader cron-vs-spec finding.)

Finding 2 — Input is stale: gold_domains hasn’t been refreshed since Phase 004 last produced rows.

Phase 005 reads gold_domains filtered to final_verdict in {V_CONFIRMED, V_LIKELY}. Phase 004 hasn’t written fresh gold rows since whenever PIPELINE_RUN_ID was last set in a Phase 004 run (see Phase 004 forensic Finding 2). Every Mon + Thu fire of 005 is therefore operating on a frozen firm list, paying ~$3 to re-crawl the same domains.

Finding 3 — 005a is mid-rewrite. Old monolith + half-done replacement coexist.

Per 005_onpage/README.md: as of 2026-05-06 the team is replacing the 1,967-line step_005a_onpage_crawl.py with a four-step decomposition (step1_summary.pystep4_perpage.py). Status:

FilePurposeStatus
step1_summary.pyPhase 2 — summary retrievalshipped 2026-05-06
step2_pages.pyPhase 3 — /pages retrievalshipped 2026-05-06
step3_independent.pyPhase 4 — links / dup_tags / redirects / non_indexableTODO
step4_perpage.pyPhase 5 — raw_html per URLTODO
run_all.pySequential orchestratorTODO
step_005a_onpage_crawl.pyOLD monolithkept until new flow validates end-to-end

Two execution paths exist on the VM. Until the rewrite finishes, the monolith is what fires — with all the shared-state bugs the rewrite was meant to fix.

Finding 4 — The silver parsers (005b-g) have never written rows.

Per the April 7 baseline run summary in README.md:

“Empty tables (parsers wired but never wrote): enrichment_005b through enrichment_005g. The bronze JSON payloads ARE preserved; the silver-layer parsing has never successfully run end-to-end.”

The bronze data exists. The parsers exist. They’ve never connected. Today, step_005b_onpage_summary.py contains:

print("step_005b_onpage_summary: endpoint retrieval currently handled by step_005a_onpage_crawl.py")
print("This script will be the standalone entry point after BQ-first refactor.")
sys.exit(0)

Yes, that’s the entire body of 005b. It’s a stub. The other parsers (005c-g) similarly do not actually parse bronze into enrichment_005{b-g} tables in production.

Finding 5 — The fire-and-forget shared-state bugs the rewrite was meant to fix are still in the monolith.

Direct quote from README.md about the monolith:

“Every fix this past month touched shared state, exposed a new bug, repeat. By 2026-05-06 we’d burned 17 days fighting six interlocking bugs (token-bucket miscounts, semaphore stacking, OOM accumulator, --resume CLI mismatch, shared output dirs, /pages-with-non-2xx URLs).”

Many were patched in place. None were structurally eliminated — the architecture that causes them (one file doing seven jobs) is still what runs every Mon + Thu.

Finding 6 — Per-task and global timeouts are generous.

Defaults baked into step_005a_onpage_crawl.py:

BoundValueWhat happens at limit
Per-task timeout2 hoursmarked TIMED_OUT
Per-task stall (no page-count progress)60 minmarked STUCK
Global run timeout8 hoursremaining tasks marked timed_out
Poll interval30 sec

An 8-hour global timeout fires every Mon + Thu. If 005 ever starts a real intake-driven run while the cron is also firing the monthly tier, two 8-hour windows could overlap, doubling the DFS load and tripling cost. Today this doesn’t happen because intake never fires — but if Phase 001 Fix 3 lands and the cron-overlap issue isn’t addressed, this becomes live.

The bottom line

Where Phase 005 Stands Today

005a (the monolith) fires every Mon + Thu against a stale gold list, paying ~$74 per fire (8-9 fires/month × 2,247 V_CONFIRMED firms) — ~$590-665/month for data no downstream silver layer consumes. The bronze tables get fresh rows (re-crawls of the same firms), but enrichment_005a_deep_crawl currently shows 0 rows for the latest atty production runs in BQ — the enrichment write path appears broken. 005b-g enrichment tables stay empty (parsers are stubs). 005h, 005i, 005j run alongside 005a. The rewrite is half-done and the monolith is still authoritative.

The Fix

What we’ll do to make Phase 005 match the spec.

Seven fixes. The first two are pure cost-recovery and don’t require any new code. The rest finish the rewrite and land the silver parsers.

FIX 1 Split the cron — monthly fires once per month, not every Mon + Thu. ~20 min

The single 30 10 * * 1,4 entry conflates three cadences. Split into:

  • 30 10 * * 1,4 ... --cadence biweekly — Mon + Thu (009 bulk endpoints, etc.)
  • 30 10 * * 1 ... --cadence weekly — Mondays only (008c/d, 010b, etc.)
  • 30 10 1 * * ... --cadence monthly — 1st of month only (005, 006, 008a/b, 009a/b, 010a/c/h)

After this fix, 005 fires once per month. Saves ~$520-590/month immediately (8-9 redundant fires × ~$74 each, the first fire still happens).

FIX 2 Wait for Phase 004 Fix 1 — fresh gold before re-crawling. no work — gated on Phase 004 Fix 1

Until gold_domains gets fresh rows from a working Phase 004 cycle, re-crawling is just paying to re-crawl the same firms. Hold the next 005 fire until Phase 004 Fix 1 has landed and a verified intake cycle has produced new gold.

Verify before firing: SELECT MAX(ingestion_timestamp) FROM amicus_pipeline.gold_domains — should be within the last 30 days.

FIX 3 Finish the rewrite — ship step3_independent.py and step4_perpage.py. ~1 day each

The four-step rewrite stalled after step 2. Per the README, step 3 should retrieve links, duplicate_tags, redirect_chains, non_indexable; step 4 should retrieve raw_html (and per the open question in README, decide between homepage-only and per-URL).

Each step:

  • Reads input from previous step’s JSONL output
  • Makes ONE DFS endpoint class of calls (no shared state)
  • Writes one JSONL output + bronze BQ rows
  • Times itself, reports stats, fails loud

Then ship run_all.py orchestrator. Then retire the monolith.

FIX 4 Wire up the 005b-g silver parsers to actually parse bronze. ~2 hr each, 6 of them

Today all six (005b-g) are stubs that exit 0. For each one:

  • Read the corresponding bronze rows for this run_id from bronze_005_onpage
  • Parse the payload_json field into typed columns
  • Write to enrichment_005{b,c,d,e,f,g}_* using shared.bq_enrichment_writer.write_enrichment_rows (same pattern as 005a/005j)

None of these are paid-API steps. They’re Python transforms over bronze JSON. The data is already there. The parsers just need to exist.

FIX 5 Cap 005a’s max_crawl_pages dynamically. ~30 min

Today every firm gets max_crawl_pages=2000. For a 50-page solo practice site, that’s overkill. For a 5,000-page mega-firm, 2,000 is correctly capped.

Action: use the previous month’s 005b summary row (when Fix 4 lands) to pick a per-firm cap: min(2000, max(50, previous_month_total_pages * 1.2)). New firms with no prior data default to 200.

Should cut 005a cost by 30-50% on a mature firm list.

FIX 6 Add a per-fire cost ceiling. ~40 min

005a already writes a cost ledger. Action: after each batch of POSTs, compute cumulative dollars spent. If cumulative spend exceeds a per-fire ceiling (e.g. $5 for a market the size of atty_wa_seattle), stop submitting and finish the in-flight tasks only.

Prevents a single bad run from blowing the monthly DFS budget. Pairs with the existing tier-1 / tier-2 cost cap (sc=40203).

FIX 7 Add fail-loud verification at the end of every 005 sub-step. ~20 min per step

After each enrichment table write, assert:

  • Row count for this run_id > 0 (no successful run with 0 rows written)
  • For 005a: at least 70% of input domains end in DONE state (not TIMED_OUT, STUCK, FAILED)
  • For 005h Lighthouse: at least 80% of homepages returned a Performance score > 0
  • For 005j: at least 60% of firms got a non-unknown attorney bucket

If any assertion fails, exit non-zero with the specific shortfall. Same pattern as Phases 001-004 fail-loud fixes.

After all 7 fixes

The monthly cron fires Phase 005 once per month against a fresh gold list. 005a runs as the new four-step decomposition, no shared state, easier to debug. 005b-g parsers actually write silver-layer enrichment rows. 005h, 005i, 005j run alongside without contention. Per-firm max_crawl_pages is tuned to actual site size. A run that’s about to blow the budget stops itself. Bad runs fail loud instead of silently producing empty tables.

Then we move on to Phase 006 (Domain Intel — WHOIS + tech stack).