pipeline.amicusdata.dev — Phase 005: OnPage

PHASE 005

OnPage

Steps 005a → 005j · Monthly tier · First paid enrichment · Output: bronze + silver tables for every firm’s website

The Spec

How Phase 005 is supposed to behave.

Phase 005 — OnPage. Design intent. Reference, not reality.

Duty: For every confirmed firm in gold_domains, deeply crawl their entire website (up to 2,000 pages each), audit it through three different lenses (DFS OnPage, DFS Lighthouse, Google PageSpeed Insights), and count the attorneys listed on the team page. Land everything in BigQuery so downstream analytics can query SEO-grade signal per firm.

Schedule: Monthly tier — designed to fire once per 30 days.

End state: ~2,247 V_CONFIRMED firms × ~155 pages avg = ~348,000 page-rows in bronze tables; one row per firm in 8 enrichment tables; one attorney-count estimate per firm.

What Phase 005 does, plain English

Phases 001-004 produced a clean list of confirmed PI law firms with primary specialties. Phase 005 is where we actually look at their websites. For each firm we want to know:

How many pages does the site have? How fast does each page load? Which pages return 404s or redirect chains?
What does Google Lighthouse score it for performance, accessibility, SEO, and best practices?
What do real Chrome users see — Core Web Vitals from Google’s field-data dataset?
How many attorneys actually work there, based on their “Our Team” / “Attorneys” page?

This is the first enrichment phase in the pipeline (Phases 001-004 are intake, Phases 005-010 are enrichment). It’s also the first phase that costs real money — the OnPage deep crawl alone burns ~$2-3 per fire.

The 10 sub-steps

Step	What it does	Provider	Output table
`005a`	DFS OnPage deep crawl — up to 2,000 pages per domain, 10 endpoints per crawl (summary, pages, links, dup_tags, dup_content, redirects, non_indexable, waterfall, microdata, raw_html). The big paid step.	DataForSEO OnPage	`bronze_005_onpage`, `bronze_005_perpage`, `bronze_005_raw_html`, `enrichment_005a_deep_crawl`
`005b`	Parse `/summary` bronze into per-domain stats (total pages, status code distribution, indexing issues).	—	`enrichment_005b_onpage_summary`
`005c`	Parse `/pages` bronze into per-URL detail (title, meta description, H1, word count).	—	`enrichment_005c_pages`
`005d`	Parse `/links` bronze into internal & external link graph.	—	`enrichment_005d_links`
`005e`	Parse `/duplicate_tags` bronze — pages sharing `<title>` or `<meta description>`.	—	`enrichment_005e_duplicate_tags`
`005f`	Parse `/redirect_chains` bronze — multi-hop redirects and redirect loops.	—	`enrichment_005f_redirect_chains`
`005g`	Parse `/non_indexable` bronze — pages blocked from Google (`noindex`, robots disallow).	—	`enrichment_005g_non_indexable`
`005h`	DFS Lighthouse audit per homepage — Performance, Accessibility, SEO, Best Practices scores + Core Web Vitals.	DataForSEO Lighthouse	`enrichment_005h_lighthouse`
`005i`	Google PageSpeed Insights — real-user Core Web Vitals (LCP / CLS / INP) from Chrome field data.	Google PSI	`enrichment_005i_pagespeed`
`005j`	Haiku attorney-count estimator — reads the firm’s “Our Team” page from 005a’s raw_html, classifies into a bucket: `solo` · `2` · `3_to_5` · `6_to_10` · `11_plus` · `unknown`.	Anthropic Haiku	`enrichment_005j_attorney_count`

005a is the load-bearing step

005a (the deep crawl) is the source of truth for every step b through g. Those steps are parsers — they read 005a’s bronze tables and decompose them into silver-layer tables. If 005a doesn’t run, the rest of the phase has nothing to parse.

005a fires up to 10 distinct DFS endpoint classes per domain:

Endpoint	Grain	Lands in
`/summary`	per-domain	`bronze_005_onpage`
`/pages`	per-domain (paginated)	`bronze_005_onpage`
`/links`	per-domain (paginated)	`bronze_005_onpage`
`/duplicate_tags`	per-domain	`bronze_005_onpage`
`/redirect_chains`	per-domain	`bronze_005_onpage`
`/non_indexable`	per-domain	`bronze_005_onpage`
`/waterfall`	per-URL (every crawled page)	`bronze_005_perpage`
`/duplicate_content`	per-URL	`bronze_005_perpage`
`/microdata`	per-URL	`bronze_005_perpage`
`/raw_html`	per-URL (every crawled page)	`bronze_005_perpage` + legacy homepage-only `bronze_005_raw_html`

DFS OnPage call volume & rate limits

The OnPage retrieval endpoints (everything except task_post) do not support task stacking. From DataForSEO docs:

“All other endpoints of OnPage API do not recommend sending several tasks in one POST call as it may result in system overload and 4xx or 5xx errors.”

Concretely, per the validated April 7 baseline run:

Constraint	Value
Account-wide rate cap	2,000 calls/min
Pipeline pacing (85%)	1,700 RPM — strict 35.3 ms between any two acquires
Max simultaneous in-flight	30 requests
Tasks per HTTP POST	1 (only `task_post`, `instant_pages`, `page_screenshot` allow stacking)
Max crawl pages per domain	2,000
Per-task timeout	2 hours hard cap
Global run timeout	8 hours hard cap

How the data moves

Input · Phase 004 deliverable

gold_domains

Filter: final_verdict in {V_CONFIRMED, V_LIKELY}
+ is_domain_primary = TRUE

↓

005a · DFS OnPage deep crawl

step_005a_onpage_crawl.py

For each firm: POST task → poll → retrieve 10 endpoints
Up to 2,000 pages per domain
Writes bronze_005_onpage + bronze_005_perpage

↓

005b → 005g · per-endpoint parsers

step_005[b-g]_*.py

Read bronze, write enrichment_005{b,c,d,e,f,g}_* silver tables
(currently empty — see forensic)

005h · DFS Lighthouse

step_005h_lighthouse_dfs.py

Per-homepage Lighthouse audit
Performance / A11y / SEO / Best Practices scores

↓

005i · Google PSI

step_005i_pagespeed_google.py

CrUX field data — real-user Web Vitals
(free, Google API)

005j · attorney count

step_005j_attorney_count.py

Reads raw_html for “Our Team” page
Haiku classifies into bucket: solo, 2, 3-5, 6-10, 11+, unknown

↓

Outputs · BigQuery

bronze_005_onpage · bronze_005_perpage · bronze_005_raw_html
+ enrichment_005a → enrichment_005j

Every firm has bronze payloads for 10 endpoints,
plus silver enrichment rows per parser

Where to look — file & table reference

Thing	Path or table
The scripts	`/mnt/workspace/amicus/pipeline/steps/005_onpage/step_005*.py`
OnPage rewrite (partial)	`step1_summary.py` · `step2_pages.py` (shipped 2026-05-06) — meant to replace the 005a monolith
Old monolith (still in use)	`step_005a_onpage_crawl.py` (1,967 lines)
Rewrite plan / status	`pipeline/steps/005_onpage/README.md`
April 7 baseline run report	`pipeline/steps/utilities/documentation/first_production_run_report.md`
Bronze tables	`amicus_pipeline.bronze_005_onpage` · `bronze_005_perpage` · `bronze_005_raw_html`
Enrichment tables	`amicus_pipeline.enrichment_005a_deep_crawl` · `enrichment_005b..g_*` · `enrichment_005h_lighthouse` · `enrichment_005i_pagespeed` · `enrichment_005j_attorney_count`
API keys	`DFS_USERNAME` · `DFS_PASSWORD` · `PAGESPEED_API_KEY` · `ANTHROPIC_API_KEY_B_SERIES` in `.env`

Cost per fire

First phase that costs real money. 005a is dominant; 005h is secondary; 005j is small; 005b-g + 005i are free.

Line item	Volume	Per unit	Subtotal
005a — DFS OnPage Tier 1 (`$0.000125`/page, ~155 pages/domain avg, 2,247 V_CONFIRMED firms)	~348,000 pages	$0.000125	~$43
005b → 005g — silver parsers (Python only, no external API)	—	—	$0.00
005h — DFS Lighthouse audits (~$0.005 per audit, one per homepage)	~2,247	~$0.005	~$11
005i — Google PSI (free tier, 25,000 queries/day)	~2,247	$0	$0.00
005j — Haiku attorney count (~8K input + ~100 output per call)	~2,247	~$0.009	~$20
Total per monthly fire			~$74

V_CONFIRMED count = 2,247 unique domains for atty_wa_seattle in run atty_production_001, queried 2026-05-16 from enrichment_004a_specialties. Average ~155 pages/domain comes from the April 7 baseline (~325K crawled URLs ÷ ~2,094 tasks). Lighthouse + PSI + Haiku-attorney-count run once per homepage = once per firm.

Schedule

Frequency

Monthly tier — designed to fire once per 30 days.

Trigger

Orchestrator cron with --cadence monthly.

Execution mode

005a runs first (it’s upstream of b-g). 005h, 005i, 005j can parallelize with each other.

Concurrency

005a: 1,700 RPM strict, 30 in-flight, 6 extraction workers, 4-round retry. 005j: 40 Haiku concurrent.

Crash recovery

--resume-from submissions.jsonl re-attaches to in-flight tasks. --tier2 escalates tier-1-failed tasks to tier-2 pricing.

Output guarantee

Every input domain ends in a terminal state. Bronze tables get one row per (domain, endpoint).

What's Fucked

Phase 005 is not running the spec. Here’s exactly how.

Forensic findings, 2026-05-16. Several findings are different from the intake-tier gap — 005 is on the cron, but the cron itself is wrong.

Finding 1 — The “monthly” cron expression isn’t monthly.

The single orchestrator crontab entry on the VM is 30 10 * * 1,4 — that fires every Monday and Thursday at 10:30 UTC, NOT once per month. It runs the orchestrator with the monthly/weekly/biweekly tiers together.

Consequence: if 005 is tagged monthly in cadence.py, it fires on every Mon + Thu. That’s ~8-9 fires per month, not 1. At ~$74 per fire (2,247 V_CONFIRMED firms × the cost spec above), the monthly tier alone is bleeding ~$590-665/month instead of ~$74.

(See Monthly cron schedule for the broader cron-vs-spec finding.)

Finding 2 — Input is stale: `gold_domains` hasn’t been refreshed since Phase 004 last produced rows.

Phase 005 reads gold_domains filtered to final_verdict in {V_CONFIRMED, V_LIKELY}. Phase 004 hasn’t written fresh gold rows since whenever PIPELINE_RUN_ID was last set in a Phase 004 run (see Phase 004 forensic Finding 2). Every Mon + Thu fire of 005 is therefore operating on a frozen firm list, paying ~$3 to re-crawl the same domains.

Finding 3 — 005a is mid-rewrite. Old monolith + half-done replacement coexist.

Per 005_onpage/README.md: as of 2026-05-06 the team is replacing the 1,967-line step_005a_onpage_crawl.py with a four-step decomposition (step1_summary.py → step4_perpage.py). Status:

File	Purpose	Status
`step1_summary.py`	Phase 2 — summary retrieval	shipped 2026-05-06
`step2_pages.py`	Phase 3 — /pages retrieval	shipped 2026-05-06
`step3_independent.py`	Phase 4 — links / dup_tags / redirects / non_indexable	TODO
`step4_perpage.py`	Phase 5 — raw_html per URL	TODO
`run_all.py`	Sequential orchestrator	TODO
`step_005a_onpage_crawl.py`	OLD monolith	kept until new flow validates end-to-end

Two execution paths exist on the VM. Until the rewrite finishes, the monolith is what fires — with all the shared-state bugs the rewrite was meant to fix.

Finding 4 — The silver parsers (005b-g) have never written rows.

Per the April 7 baseline run summary in README.md:

“Empty tables (parsers wired but never wrote): enrichment_005b through enrichment_005g. The bronze JSON payloads ARE preserved; the silver-layer parsing has never successfully run end-to-end.”

The bronze data exists. The parsers exist. They’ve never connected. Today, step_005b_onpage_summary.py contains:

print("step_005b_onpage_summary: endpoint retrieval currently handled by step_005a_onpage_crawl.py") print("This script will be the standalone entry point after BQ-first refactor.") sys.exit(0)

Yes, that’s the entire body of 005b. It’s a stub. The other parsers (005c-g) similarly do not actually parse bronze into enrichment_005{b-g} tables in production.

Finding 5 — The fire-and-forget shared-state bugs the rewrite was meant to fix are still in the monolith.

Direct quote from README.md about the monolith:

“Every fix this past month touched shared state, exposed a new bug, repeat. By 2026-05-06 we’d burned 17 days fighting six interlocking bugs (token-bucket miscounts, semaphore stacking, OOM accumulator, --resume CLI mismatch, shared output dirs, /pages-with-non-2xx URLs).”

Many were patched in place. None were structurally eliminated — the architecture that causes them (one file doing seven jobs) is still what runs every Mon + Thu.

Finding 6 — Per-task and global timeouts are generous.

Defaults baked into step_005a_onpage_crawl.py:

Bound	Value	What happens at limit
Per-task timeout	2 hours	marked `TIMED_OUT`
Per-task stall (no page-count progress)	60 min	marked `STUCK`
Global run timeout	8 hours	remaining tasks marked timed_out
Poll interval	30 sec	—

An 8-hour global timeout fires every Mon + Thu. If 005 ever starts a real intake-driven run while the cron is also firing the monthly tier, two 8-hour windows could overlap, doubling the DFS load and tripling cost. Today this doesn’t happen because intake never fires — but if Phase 001 Fix 3 lands and the cron-overlap issue isn’t addressed, this becomes live.

The bottom line

Where Phase 005 Stands Today

005a (the monolith) fires every Mon + Thu against a stale gold list, paying ~$74 per fire (8-9 fires/month × 2,247 V_CONFIRMED firms) — ~$590-665/month for data no downstream silver layer consumes. The bronze tables get fresh rows (re-crawls of the same firms), but enrichment_005a_deep_crawl currently shows 0 rows for the latest atty production runs in BQ — the enrichment write path appears broken. 005b-g enrichment tables stay empty (parsers are stubs). 005h, 005i, 005j run alongside 005a. The rewrite is half-done and the monolith is still authoritative.

The Fix

What we’ll do to make Phase 005 match the spec.

Concrete remediation. Two cost-recovery fixes upfront, then the parser landings.

Seven fixes. The first two are pure cost-recovery and don’t require any new code. The rest finish the rewrite and land the silver parsers.

FIX 1 Split the cron — monthly fires once per month, not every Mon + Thu. ~20 min

The single 30 10 * * 1,4 entry conflates three cadences. Split into:

30 10 * * 1,4 ... --cadence biweekly — Mon + Thu (009 bulk endpoints, etc.)
30 10 * * 1 ... --cadence weekly — Mondays only (008c/d, 010b, etc.)
30 10 1 * * ... --cadence monthly — 1st of month only (005, 006, 008a/b, 009a/b, 010a/c/h)

After this fix, 005 fires once per month. Saves ~$520-590/month immediately (8-9 redundant fires × ~$74 each, the first fire still happens).

FIX 2 Wait for Phase 004 Fix 1 — fresh gold before re-crawling. no work — gated on Phase 004 Fix 1

Until gold_domains gets fresh rows from a working Phase 004 cycle, re-crawling is just paying to re-crawl the same firms. Hold the next 005 fire until Phase 004 Fix 1 has landed and a verified intake cycle has produced new gold.

Verify before firing: SELECT MAX(ingestion_timestamp) FROM amicus_pipeline.gold_domains — should be within the last 30 days.

FIX 3 Finish the rewrite — ship step3_independent.py and step4_perpage.py. ~1 day each

The four-step rewrite stalled after step 2. Per the README, step 3 should retrieve links, duplicate_tags, redirect_chains, non_indexable; step 4 should retrieve raw_html (and per the open question in README, decide between homepage-only and per-URL).

Each step:

Reads input from previous step’s JSONL output
Makes ONE DFS endpoint class of calls (no shared state)
Writes one JSONL output + bronze BQ rows
Times itself, reports stats, fails loud

Then ship run_all.py orchestrator. Then retire the monolith.

FIX 4 Wire up the 005b-g silver parsers to actually parse bronze. ~2 hr each, 6 of them

Today all six (005b-g) are stubs that exit 0. For each one:

Read the corresponding bronze rows for this run_id from bronze_005_onpage
Parse the payload_json field into typed columns
Write to enrichment_005{b,c,d,e,f,g}_* using shared.bq_enrichment_writer.write_enrichment_rows (same pattern as 005a/005j)

None of these are paid-API steps. They’re Python transforms over bronze JSON. The data is already there. The parsers just need to exist.

FIX 5 Cap 005a’s max_crawl_pages dynamically. ~30 min

Today every firm gets max_crawl_pages=2000. For a 50-page solo practice site, that’s overkill. For a 5,000-page mega-firm, 2,000 is correctly capped.

Action: use the previous month’s 005b summary row (when Fix 4 lands) to pick a per-firm cap: min(2000, max(50, previous_month_total_pages * 1.2)). New firms with no prior data default to 200.

Should cut 005a cost by 30-50% on a mature firm list.

FIX 6 Add a per-fire cost ceiling. ~40 min

005a already writes a cost ledger. Action: after each batch of POSTs, compute cumulative dollars spent. If cumulative spend exceeds a per-fire ceiling (e.g. $5 for a market the size of atty_wa_seattle), stop submitting and finish the in-flight tasks only.

Prevents a single bad run from blowing the monthly DFS budget. Pairs with the existing tier-1 / tier-2 cost cap (sc=40203).

FIX 7 Add fail-loud verification at the end of every 005 sub-step. ~20 min per step

After each enrichment table write, assert:

Row count for this run_id > 0 (no successful run with 0 rows written)
For 005a: at least 70% of input domains end in DONE state (not TIMED_OUT, STUCK, FAILED)
For 005h Lighthouse: at least 80% of homepages returned a Performance score > 0
For 005j: at least 60% of firms got a non-unknown attorney bucket

If any assertion fails, exit non-zero with the specific shortfall. Same pattern as Phases 001-004 fail-loud fixes.

After all 7 fixes

The monthly cron fires Phase 005 once per month against a fresh gold list. 005a runs as the new four-step decomposition, no shared state, easier to debug. 005b-g parsers actually write silver-layer enrichment rows. 005h, 005i, 005j run alongside without contention. Per-firm max_crawl_pages is tuned to actual site size. A run that’s about to blow the budget stops itself. Bad runs fail loud instead of silently producing empty tables.

Then we move on to Phase 006 (Domain Intel — WHOIS + tech stack).

Generated 2026-05-16 from /mnt/workspace/amicus/sites/pipeline/ on amicus-dev VM.