pipeline.amicusdata.dev — Phase 008: SERP Analysis

PHASE 008

SERP Analysis

Steps 008a → 008e · Mixed monthly + weekly · Output: who outranks each firm on Maps & Search

The Spec

How Phase 008 is supposed to behave.

Phase 008 — SERP Analysis. Design intent. Reference, not reality.

Duty: For every confirmed firm, find out who else shows up when a real person searches Google for the firm’s practice area in the firm’s neighborhood. Five different SERP cuts: 20-point Maps grid around each firm (008a), Google organic (008b), Google Local Finder (008c), autocomplete suggestions (008d), and a uniform geofence-wide carpet scan (008e).

Schedule: Mixed — 008a + 008b are monthly (expensive deep cuts). 008c + 008d are weekly (cheap fresh-data cuts). 008e is a separate carpet-scan job (geofence-wide, not per-firm).

End state: Five BigQuery enrichment tables populated with fresh rows per firm per run. Downstream heatmap and competitor-set analytics read from these.

What Phase 008 does, plain English

So far the pipeline has only looked at each firm in isolation — its own website (Phase 005), its own domain (Phase 006), its own GBP listing (Phase 007). Phase 008 zooms out and asks: when someone in Seattle searches “personal injury lawyer,” who actually shows up?

That’s a different question than “is this firm well-built” — it’s a competitor-set question. A firm might have great Lighthouse scores and rich GBP reviews but still get buried in search results because three bigger firms always rank above them. Phase 008 finds out who’s above them, in which neighborhoods, for which queries.

Five sub-steps cover five different SERP “flavors”:

008a — Maps SERP from a 20-point grid around each firm. Drives local-dominator heatmaps.
008b — Google organic SERP (the “10 blue links”) at the firm’s lat/lon for its search keyword.
008c — Google Local Finder (the “more places” view) at depth 20.
008d — Google Autocomplete suggestions for each unique search_keyword in the gold set.
008e — Geofence carpet scan — uniform 1.3-mile grid across the entire market with 8 fixed practice-area keywords. Catches firms outside the per-CID grids.

The 5 sub-steps

Step	What it does	Tier	Per-?	Cost	Output
`008a`	Maps SERP 20-point circle grid (1 center + 6 inner hex + 13 outer ring, 4mi outer radius). Filters to law-related GBP categories.	monthly	per CID × 20 grid points	$0.0006/task	`enrichment_008a_mapserp_competitors`
`008b`	Google Organic SERP at firm’s lat/lon for its search_keyword, depth=20 (two SERP pages).	monthly	per CID	$0.0012/task	`enrichment_008b_serp_organic`
`008c`	Google Local Finder SERP — expanded “more places” view, depth=20.	weekly	per CID	$0.0006/task	`enrichment_008c_serp_local_finder`
`008d`	Google Autocomplete suggestions for each unique search_keyword in the gold set. Geographically pinned via location_code.	weekly	per unique keyword (NOT per CID)	$0.0006/task	`enrichment_008d_serp_autocomplete`
`008e`	Uniform 1.3-mile grid across geofence polygon × 8 fixed practice-area keywords. Reads a KMZ file, not gold JSON.	(separate)	per grid point × keyword	$0.0006/task	`enrichment_008e_*` (TBD)

008a vs 008e — two very different grid strategies

Both 008a and 008e use the Maps SERP endpoint, but with completely different scan patterns:

Aspect	008a circle grid	008e carpet scan
Anchor	Each ATTORNEY_CONFIRMED CID’s lat/lon	The market’s geofence polygon (KMZ file)
Pattern	20 points in a circle: 1 center + 6 inner hex (2mi radius) + 13 outer ring (4mi)	Uniform ~1.3mi square-grid across the polygon
Keyword	The firm’s own `search_keyword` (from gold)	8 fixed practice-area keywords
Volume for atty_wa_seattle	2,772 CIDs × 20 = 55,440 tasks	~1,551 points × 8 = 12,408 tasks
Cost per fire	~$33.26	~$7.44
Use case	Local-dominator heatmap centered on each firm	Find firms missing from gold; uniform market view

How the data moves

Input · Phase 004 deliverable

gold_cids

ATTORNEY_CONFIRMED CIDs
with search_keyword + lat/lon

Input · 008e only

geofence KMZ

market polygon
(not from gold)

↓

008a · circle grid (monthly)

step_008a_mapserp_competitors.py

20 points × ~2,772 CIDs = 55,440 calls
Filters competitors to LAW_CATEGORIES set

008b · organic SERP (monthly)

step_008b_serp_organic.py

One call per CID, depth=20
Advanced result types (~50 SERP features)

008c · Local Finder (weekly)

step_008c_serp_local_finder.py

One call per CID, depth=20
Expanded local-pack listings

008d · autocomplete (weekly)

step_008d_serp_autocomplete.py

One call per unique search_keyword
Tiny — ~20 calls for atty_wa_seattle

008e · carpet scan

step_008e_mapserp_grid_drape.py

~1,551 points × 8 keywords
Uniform geofence sweep

↓

Output · 5 enrichment tables

enrichment_008{a,b,c,d,e}_*

Heatmaps + competitor sets per firm
+ market-wide carpet view

Where to look — file & table reference

Thing	Path or table
The 5 scripts	`/mnt/workspace/amicus/pipeline/steps/008_serp_analysis/step_008[abcde]_*.py`
008a grid spec	`pipeline/steps/008_serp_analysis/008a_grid_spec.md`
008e geofence KMZ files	`pipeline/steps/008_serp_analysis/*.kmz` (per-market polygons)
Heatmap viewer	`pipeline/steps/008_serp_analysis/grid_drape_viewer.html`
BQ output tables	`enrichment_008a_mapserp_competitors` · `enrichment_008b_serp_organic` · `enrichment_008c_serp_local_finder` · `enrichment_008d_serp_autocomplete` · `enrichment_008e_*`
Async client	`pipeline/steps/shared/dfs_async_client.py`
Per-step logs	`pipeline/steps/000_log_files/step_008_.log`

Cost per fire

Anchored on 2,772 ATTORNEY_CONFIRMED CIDs (queried 2026-05-16). Two distinct fire types: monthly fires (008a + 008b) are big, weekly fires (008c + 008d) are cheap. 008e is a separate ad-hoc job.

Line item	Volume	Per unit	Subtotal
008a — Maps SERP circle grid (20 pts × 2,772 CIDs)	~55,440 tasks	$0.0006	~$33.26
008b — Organic SERP (depth=20)	~2,772 tasks	$0.0012	~$3.33
Monthly fire (008a + 008b)			~$37
008c — Local Finder (depth=20)	~2,772 tasks	$0.0006	~$1.66
008d — Autocomplete (per unique keyword)	~20 tasks	$0.0006	~$0.01
Weekly fire (008c + 008d)			~$1.67
008e — Carpet scan (1,551 points × 8 keywords)	~12,408 tasks	$0.0006	~$7.44
Monthly target total (assuming 008e fires monthly)			~$50

Volume anchors: 2,772 ATTORNEY_CONFIRMED CIDs (queried from enrichment_007a_gbp_info); 008e carpet-scan grid point count from the script header (~1,551 grid points × 8 keywords); cost-per-task constants from each script’s COST_PER_TASK.

Schedule

Monthly (008a + 008b)

Fires once per 30 days. Deep cuts: full circle grid + organic SERP per firm.

Weekly (008c + 008d)

Fires every Monday. Cheap fresh-data cuts — alongside Phase 007.

Concurrency

DFSAsyncClient, 50 in-flight, 2,000 RPM cap (008 doesn’t throttle below cap like 005/007).

Crash recovery

--resume flag for all 4 of 008a-008d.

008e cadence

Not yet on a recurring schedule. Currently a manual / one-off carpet scan.

Output guarantee

Every input CID gets a row in 008a/b/c. 008d gets one row per unique keyword. 008e is per (grid_point × keyword).

What's Fucked

Phase 008 is half working, half broken. Here’s the split.

Forensic findings, 2026-05-16. Three sub-steps have real-world fire history; two have problems — one of them empty.

Finding 1 — 008b (organic SERP) has ZERO rows. Ever.

Direct query of enrichment_008b_serp_organic for atty_wa_seattle as of 2026-05-16 returns zero rows for any run_id. Not a partial coverage problem — complete absence.

The script exists at step_008b_serp_organic.py, the BQ table exists, the schema appears intact. But no run has ever successfully landed organic-SERP data. The most likely explanation is the recurring PIPELINE_RUN_ID gotcha (BQ write silently skips when env var is unset), but it could also be a permissions error, a schema mismatch, or the orchestrator never invoking 008b. Unverified which.

Monthly cost spec says this step should cost ~$3.33 per fire. If it’s being invoked but the BQ write is silently failing, that’s ~$3.33/month being burned for no output.

Finding 2 — 008a (circle grid) has fired ONCE successfully. The grid is in 008a’s table from 2026-04-21.

Direct query of enrichment_008a_mapserp_competitors for atty_wa_seattle:

Run ID	Date	CIDs covered
`d196bd63-7a96-4fca-965b-b8dfd06954a6`	2026-04-21	2,741
`backfill_008a_missing_20260411T022655`	2026-04-11	1,428
`backfill_008a_20260411T010930`	2026-04-11	974

One real cron-driven run on 2026-04-21. Two manual backfills earlier in April. Nothing since. 008a is supposed to be monthly — so we’d expect a fire around 2026-05-21 (next Wednesday/Thursday). Whether the cron will actually trigger it is unverified given the broader cron problems.

Finding 3 — 008c + 008d are firing weekly, alongside Phase 007. (Good news.)

Same run IDs as Phase 007. Same dates: 2026-04-21, 2026-04-28, 2026-05-11. (Same missing 2026-05-04.) Coverage:

Table	2026-04-21	2026-04-28	2026-05-11
`enrichment_008c_serp_local_finder`	2,772 rows	2,772 rows	2,772 rows
`enrichment_008d_serp_autocomplete`	1 row	20 rows	20 rows

008c is solid. 008d had a config change between 2026-04-21 and 2026-04-28 that took the run from 1 row to 20 rows — presumably the search_keyword set expanded from one default to 20 attorney specialties. Whether 1 was a bug or the original spec is unknown.

Finding 4 — 008e (carpet scan) is in the codebase but has no BQ table populated.

The script step_008e_mapserp_grid_drape.py exists, the geofence KMZ files exist (polaris_grid_20pt*.kmz, carpet_scan_v3.kmz, etc.), the HTML viewer exists (grid_drape_viewer.html). But no enrichment_008e_* table appears to have been populated under atty_wa_seattle — or the table name differs from the convention and we don’t know what to query.

008e is also not in the pipeline/steps overview tier system as far as we’ve verified — it appears to be a separate manual job rather than a cron-tier step. The script’s usage comments imply it’s intended to run on-demand against a specific KMZ.

Finding 5 — 008c stores per-keyword, not per-CID, in its enrichment table.

When we queried enrichment_008c_serp_local_finder with COUNT(DISTINCT cid), it returned 0 — meaning the cid column is either NULL or absent. The 2,772 rows are presumably keyed on something else (run_id + search_keyword + lat/lon?). This is a schema fact worth knowing for anyone joining 008c output to the CID-grain tables.

Finding 6 — If 008a fires on the broken cron, it’s a $33-per-fire wrecking ball.

The cron expression 30 10 * * 1,4 fires Mon + Thu. If 008a’s “monthly” cadence gets picked up by every cron fire (like Phase 005 may be doing), 008a alone would cost ~$33 × 8-9 fires/month = $264-$300/month. The empirical data (only ONE successful fire since 2026-04-21) suggests this isn’t actually happening — either the cron isn’t triggering 008a or it’s failing silently. Investigate before fixing other phases, because firing 008a on the broken cron would explode the bill.

The bottom line

Where Phase 008 Stands Today

008c + 008d fire weekly with Phase 007 and look healthy. 008a fired ONCE on 2026-04-21 and hasn’t since — either by design (monthly cadence not yet triggered again) or because something’s broken. 008b has never written a single row. 008e is a manual carpet-scan job, not on any schedule. The split is unusual: half of Phase 008 is the model of how the pipeline should work; the other half hasn’t worked at all.

The Fix

What we’ll do to make Phase 008 match the spec.

Concrete remediation. 008b first — it’s the only sub-step that’s completely broken.

Six fixes. 008b is the urgent one. Everything else is investigation, tuning, or guardrails.

FIX 1 Find out why 008b has zero rows. Make it write. ~30 min diagnosis + ~30 min fix

Three checks, in order:

SSH to VM, run 008b manually with PIPELINE_RUN_ID=manual_test_2026_05_16 set, watch the BQ write step in logs.
If the script reports “BQ write skipped: no PIPELINE_RUN_ID” — same root cause as Phase 002/003/004/005/006 (silent-skip on missing env var). Fix is identical: fail loud or auto-generate a run_id.
If the script reports “BQ write failed: schema mismatch” or similar — reconcile the script’s row dict against bq/schemas/enrichment.py.
If the script never runs at all (orchestrator doesn’t invoke it) — check cadence.py tags and orchestrator topological order.

This is the highest-priority fix in all of Phase 008. Without 008b, the “who outranks this firm on Google search” question has no answer at all.

FIX 2 Confirm 008a is supposed to fire monthly. Then either fire it or fix why it didn’t. ~20 min

008a last fired 2026-04-21. If it’s truly monthly, the next fire should land around 2026-05-21 (5 days from now as of 2026-05-16). Action:

Confirm cadence.py tags 008a as monthly.
Confirm the orchestrator’s monthly tier runs on the cron.
Wait until 2026-05-21 + 1 day; query enrichment_008a_mapserp_competitors for any new run_id; if none, 008a is broken and we have a 30-day data gap to chase.

If 008a HAS fired multiple times under the broken Mon+Thu cron (Finding 6), the bill is the problem — not the data. In that case Phase 005 Fix 1 cron split is the right fix.

FIX 3 Inherit Phase 005 Fix 1 cron split. no work — lands with Phase 005 Fix 1

Same as 005/006/007. Once the cron splits into monthly / weekly / biweekly entries:

008a + 008b fire on the 1st of each month only
008c + 008d continue weekly, only on Mondays
The Mon+Thu confusion is gone

No 008-specific cron line.

FIX 4 Decide what 008e should be: scheduled, on-demand, or retired. ~30 min decision

008e (carpet scan) is currently a manual job. Three real options:

Schedule it monthly alongside 008a + 008b. Adds ~$7.44/month. Useful for finding firms missing from gold.
Keep it on-demand. Operator runs it whenever they want a market-wide carpet view. Document the trigger.
Retire it. If the per-CID circle grids in 008a are giving us enough competitor coverage, 008e is redundant.

Decision blocks Fix 5 (the carpet-scan-specific BQ table needs to exist if 008e is staying).

FIX 5 Land 008e’s BQ schema (if Fix 4 keeps 008e alive). ~45 min

Define enrichment_008e_mapserp_carpet (or similar) schema in bq/schemas/enrichment.py. Grain: one row per (run_id, market_id, grid_lat, grid_lon, keyword, competitor_domain). Then wire 008e’s output writer to land into BQ on every run.

Without this fix, every carpet scan today lands data only on the VM disk — no BQ-side analytics.

FIX 6 Add per-fire row-count assertions for all 5 sub-steps. ~30 min

Same pattern as Phases 002-007 fail-loud fixes. After each 008 sub-step writes its enrichment table, assert:

008a: 20 × input_CID_count grid-point rows (each CID gets 20 grid scans)
008b: ≥ 90% of input CIDs got a row (5-10% slack for genuine SERP failures)
008c: ≥ 95% of input CIDs got a row
008d: row count = unique search_keyword count from input (~20 for atty)
008e: row count ~ (grid_points × keywords) — allow 10% slack for no-results points

If any assertion fails, exit non-zero with the specific shortfall.

After all 6 fixes

The cron split from Phase 005 Fix 1 lands. 008a + 008b fire once per month on the 1st against fresh gold. 008b actually writes rows for the first time (fixing the “competitor organic-SERP” gap). 008c + 008d continue working weekly, alongside Phase 007. 008e is either scheduled monthly or kept on-demand — whichever Fix 4 chose. Every sub-step asserts coverage at the end. Phase 008 joins Phase 007 as a fully working enrichment phase.

Then we move on to Phase 009 (Backlinks — the first phase that uses the biweekly tier).

Generated 2026-05-16 from /mnt/workspace/amicus/site_pipeline_amicusdata/ on amicus-dev VM.