The Spec
How Phase 008 is supposed to behave.
Phase 008 — SERP Analysis. Design intent. Reference, not reality.
Duty: For every confirmed firm, find out who else shows up when a real person searches Google for the firm’s practice area in the firm’s neighborhood. Five different SERP cuts: 20-point Maps grid around each firm (008a), Google organic (008b), Google Local Finder (008c), autocomplete suggestions (008d), and a uniform geofence-wide carpet scan (008e).
Schedule: Mixed — 008a + 008b are monthly (expensive deep cuts). 008c + 008d are weekly (cheap fresh-data cuts). 008e is a separate carpet-scan job (geofence-wide, not per-firm).
End state: Five BigQuery enrichment tables populated with fresh rows per firm per run. Downstream heatmap and competitor-set analytics read from these.
What Phase 008 does, plain English
So far the pipeline has only looked at each firm in isolation — its own website (Phase 005), its own domain (Phase 006), its own GBP listing (Phase 007). Phase 008 zooms out and asks: when someone in Seattle searches “personal injury lawyer,” who actually shows up?
That’s a different question than “is this firm well-built” — it’s a competitor-set question. A firm might have great Lighthouse scores and rich GBP reviews but still get buried in search results because three bigger firms always rank above them. Phase 008 finds out who’s above them, in which neighborhoods, for which queries.
Five sub-steps cover five different SERP “flavors”:
- 008a — Maps SERP from a 20-point grid around each firm. Drives local-dominator heatmaps.
- 008b — Google organic SERP (the “10 blue links”) at the firm’s lat/lon for its search keyword.
- 008c — Google Local Finder (the “more places” view) at depth 20.
- 008d — Google Autocomplete suggestions for each unique search_keyword in the gold set.
- 008e — Geofence carpet scan — uniform 1.3-mile grid across the entire market with 8 fixed practice-area keywords. Catches firms outside the per-CID grids.
The 5 sub-steps
| Step | What it does | Tier | Per-? | Cost | Output |
008a | Maps SERP 20-point circle grid (1 center + 6 inner hex + 13 outer ring, 4mi outer radius). Filters to law-related GBP categories. | monthly | per CID × 20 grid points | $0.0006/task | enrichment_008a_mapserp_competitors |
008b | Google Organic SERP at firm’s lat/lon for its search_keyword, depth=20 (two SERP pages). | monthly | per CID | $0.0012/task | enrichment_008b_serp_organic |
008c | Google Local Finder SERP — expanded “more places” view, depth=20. | weekly | per CID | $0.0006/task | enrichment_008c_serp_local_finder |
008d | Google Autocomplete suggestions for each unique search_keyword in the gold set. Geographically pinned via location_code. | weekly | per unique keyword (NOT per CID) | $0.0006/task | enrichment_008d_serp_autocomplete |
008e | Uniform 1.3-mile grid across geofence polygon × 8 fixed practice-area keywords. Reads a KMZ file, not gold JSON. | (separate) | per grid point × keyword | $0.0006/task | enrichment_008e_* (TBD) |
008a vs 008e — two very different grid strategies
Both 008a and 008e use the Maps SERP endpoint, but with completely different scan patterns:
| Aspect | 008a circle grid | 008e carpet scan |
| Anchor | Each ATTORNEY_CONFIRMED CID’s lat/lon | The market’s geofence polygon (KMZ file) |
| Pattern | 20 points in a circle: 1 center + 6 inner hex (2mi radius) + 13 outer ring (4mi) | Uniform ~1.3mi square-grid across the polygon |
| Keyword | The firm’s own search_keyword (from gold) | 8 fixed practice-area keywords |
| Volume for atty_wa_seattle | 2,772 CIDs × 20 = 55,440 tasks | ~1,551 points × 8 = 12,408 tasks |
| Cost per fire | ~$33.26 | ~$7.44 |
| Use case | Local-dominator heatmap centered on each firm | Find firms missing from gold; uniform market view |
How the data moves
↓
008a · circle grid (monthly)
step_008a_mapserp_competitors.py
20 points × ~2,772 CIDs = 55,440 calls
Filters competitors to LAW_CATEGORIES set
008b · organic SERP (monthly)
step_008b_serp_organic.py
One call per CID, depth=20
Advanced result types (~50 SERP features)
008c · Local Finder (weekly)
step_008c_serp_local_finder.py
One call per CID, depth=20
Expanded local-pack listings
008d · autocomplete (weekly)
step_008d_serp_autocomplete.py
One call per unique search_keyword
Tiny — ~20 calls for atty_wa_seattle
008e · carpet scan
step_008e_mapserp_grid_drape.py
~1,551 points × 8 keywords
Uniform geofence sweep
↓
Output · 5 enrichment tables
enrichment_008{a,b,c,d,e}_*
Heatmaps + competitor sets per firm
+ market-wide carpet view
Where to look — file & table reference
| Thing | Path or table |
| The 5 scripts | /mnt/workspace/amicus/pipeline/steps/008_serp_analysis/step_008[abcde]_*.py |
| 008a grid spec | pipeline/steps/008_serp_analysis/008a_grid_spec.md |
| 008e geofence KMZ files | pipeline/steps/008_serp_analysis/*.kmz (per-market polygons) |
| Heatmap viewer | pipeline/steps/008_serp_analysis/grid_drape_viewer.html |
| BQ output tables | enrichment_008a_mapserp_competitors · enrichment_008b_serp_organic · enrichment_008c_serp_local_finder · enrichment_008d_serp_autocomplete · enrichment_008e_* |
| Async client | pipeline/steps/shared/dfs_async_client.py |
| Per-step logs | pipeline/steps/000_log_files/step_008*_*.log |
Cost per fire
Anchored on 2,772 ATTORNEY_CONFIRMED CIDs (queried 2026-05-16). Two distinct fire types: monthly fires (008a + 008b) are big, weekly fires (008c + 008d) are cheap. 008e is a separate ad-hoc job.
| Line item | Volume | Per unit | Subtotal |
| 008a — Maps SERP circle grid (20 pts × 2,772 CIDs) |
~55,440 tasks | $0.0006 | ~$33.26 |
| 008b — Organic SERP (depth=20) |
~2,772 tasks | $0.0012 | ~$3.33 |
| Monthly fire (008a + 008b) |
| | ~$37 |
| 008c — Local Finder (depth=20) |
~2,772 tasks | $0.0006 | ~$1.66 |
| 008d — Autocomplete (per unique keyword) |
~20 tasks | $0.0006 | ~$0.01 |
| Weekly fire (008c + 008d) |
| | ~$1.67 |
| 008e — Carpet scan (1,551 points × 8 keywords) |
~12,408 tasks | $0.0006 | ~$7.44 |
| Monthly target total (assuming 008e fires monthly) |
| | ~$50 |
Volume anchors: 2,772 ATTORNEY_CONFIRMED CIDs (queried from enrichment_007a_gbp_info); 008e carpet-scan grid point count from the script header (~1,551 grid points × 8 keywords); cost-per-task constants from each script’s COST_PER_TASK.
Schedule
Monthly (008a + 008b)
Fires once per 30 days. Deep cuts: full circle grid + organic SERP per firm.
Weekly (008c + 008d)
Fires every Monday. Cheap fresh-data cuts — alongside Phase 007.
Concurrency
DFSAsyncClient, 50 in-flight, 2,000 RPM cap (008 doesn’t throttle below cap like 005/007).
Crash recovery
--resume flag for all 4 of 008a-008d.
008e cadence
Not yet on a recurring schedule. Currently a manual / one-off carpet scan.
Output guarantee
Every input CID gets a row in 008a/b/c. 008d gets one row per unique keyword. 008e is per (grid_point × keyword).
What's Fucked
Phase 008 is half working, half broken. Here’s the split.
Forensic findings, 2026-05-16. Three sub-steps have real-world fire history; two have problems — one of them empty.
Finding 1 — 008b (organic SERP) has ZERO rows. Ever.
Direct query of enrichment_008b_serp_organic for atty_wa_seattle as of 2026-05-16 returns zero rows for any run_id. Not a partial coverage problem — complete absence.
The script exists at step_008b_serp_organic.py, the BQ table exists, the schema appears intact. But no run has ever successfully landed organic-SERP data. The most likely explanation is the recurring PIPELINE_RUN_ID gotcha (BQ write silently skips when env var is unset), but it could also be a permissions error, a schema mismatch, or the orchestrator never invoking 008b. Unverified which.
Monthly cost spec says this step should cost ~$3.33 per fire. If it’s being invoked but the BQ write is silently failing, that’s ~$3.33/month being burned for no output.
Finding 2 — 008a (circle grid) has fired ONCE successfully. The grid is in 008a’s table from 2026-04-21.
Direct query of enrichment_008a_mapserp_competitors for atty_wa_seattle:
| Run ID | Date | CIDs covered |
d196bd63-7a96-4fca-965b-b8dfd06954a6 | 2026-04-21 | 2,741 |
backfill_008a_missing_20260411T022655 | 2026-04-11 | 1,428 |
backfill_008a_20260411T010930 | 2026-04-11 | 974 |
One real cron-driven run on 2026-04-21. Two manual backfills earlier in April. Nothing since. 008a is supposed to be monthly — so we’d expect a fire around 2026-05-21 (next Wednesday/Thursday). Whether the cron will actually trigger it is unverified given the broader cron problems.
Finding 3 — 008c + 008d are firing weekly, alongside Phase 007. (Good news.)
Same run IDs as Phase 007. Same dates: 2026-04-21, 2026-04-28, 2026-05-11. (Same missing 2026-05-04.) Coverage:
| Table | 2026-04-21 | 2026-04-28 | 2026-05-11 |
enrichment_008c_serp_local_finder | 2,772 rows | 2,772 rows | 2,772 rows |
enrichment_008d_serp_autocomplete | 1 row | 20 rows | 20 rows |
008c is solid. 008d had a config change between 2026-04-21 and 2026-04-28 that took the run from 1 row to 20 rows — presumably the search_keyword set expanded from one default to 20 attorney specialties. Whether 1 was a bug or the original spec is unknown.
Finding 4 — 008e (carpet scan) is in the codebase but has no BQ table populated.
The script step_008e_mapserp_grid_drape.py exists, the geofence KMZ files exist (polaris_grid_20pt*.kmz, carpet_scan_v3.kmz, etc.), the HTML viewer exists (grid_drape_viewer.html). But no enrichment_008e_* table appears to have been populated under atty_wa_seattle — or the table name differs from the convention and we don’t know what to query.
008e is also not in the pipeline/steps overview tier system as far as we’ve verified — it appears to be a separate manual job rather than a cron-tier step. The script’s usage comments imply it’s intended to run on-demand against a specific KMZ.
Finding 5 — 008c stores per-keyword, not per-CID, in its enrichment table.
When we queried enrichment_008c_serp_local_finder with COUNT(DISTINCT cid), it returned 0 — meaning the cid column is either NULL or absent. The 2,772 rows are presumably keyed on something else (run_id + search_keyword + lat/lon?). This is a schema fact worth knowing for anyone joining 008c output to the CID-grain tables.
Finding 6 — If 008a fires on the broken cron, it’s a $33-per-fire wrecking ball.
The cron expression 30 10 * * 1,4 fires Mon + Thu. If 008a’s “monthly” cadence gets picked up by every cron fire (like Phase 005 may be doing), 008a alone would cost ~$33 × 8-9 fires/month = $264-$300/month. The empirical data (only ONE successful fire since 2026-04-21) suggests this isn’t actually happening — either the cron isn’t triggering 008a or it’s failing silently. Investigate before fixing other phases, because firing 008a on the broken cron would explode the bill.
The bottom line
Where Phase 008 Stands Today
008c + 008d fire weekly with Phase 007 and look healthy. 008a fired ONCE on 2026-04-21 and hasn’t since — either by design (monthly cadence not yet triggered again) or because something’s broken. 008b has never written a single row. 008e is a manual carpet-scan job, not on any schedule. The split is unusual: half of Phase 008 is the model of how the pipeline should work; the other half hasn’t worked at all.
The Fix
What we’ll do to make Phase 008 match the spec.
Concrete remediation. 008b first — it’s the only sub-step that’s completely broken.
Six fixes. 008b is the urgent one. Everything else is investigation, tuning, or guardrails.
Three checks, in order:
- SSH to VM, run 008b manually with
PIPELINE_RUN_ID=manual_test_2026_05_16 set, watch the BQ write step in logs.
- If the script reports “BQ write skipped: no PIPELINE_RUN_ID” — same root cause as Phase 002/003/004/005/006 (silent-skip on missing env var). Fix is identical: fail loud or auto-generate a run_id.
- If the script reports “BQ write failed: schema mismatch” or similar — reconcile the script’s row dict against
bq/schemas/enrichment.py.
- If the script never runs at all (orchestrator doesn’t invoke it) — check
cadence.py tags and orchestrator topological order.
This is the highest-priority fix in all of Phase 008. Without 008b, the “who outranks this firm on Google search” question has no answer at all.
008a last fired 2026-04-21. If it’s truly monthly, the next fire should land around 2026-05-21 (5 days from now as of 2026-05-16). Action:
- Confirm
cadence.py tags 008a as monthly.
- Confirm the orchestrator’s
monthly tier runs on the cron.
- Wait until 2026-05-21 + 1 day; query
enrichment_008a_mapserp_competitors for any new run_id; if none, 008a is broken and we have a 30-day data gap to chase.
If 008a HAS fired multiple times under the broken Mon+Thu cron (Finding 6), the bill is the problem — not the data. In that case Phase 005 Fix 1 cron split is the right fix.
Same as 005/006/007. Once the cron splits into monthly / weekly / biweekly entries:
- 008a + 008b fire on the 1st of each month only
- 008c + 008d continue weekly, only on Mondays
- The Mon+Thu confusion is gone
No 008-specific cron line.
008e (carpet scan) is currently a manual job. Three real options:
- Schedule it monthly alongside 008a + 008b. Adds ~$7.44/month. Useful for finding firms missing from gold.
- Keep it on-demand. Operator runs it whenever they want a market-wide carpet view. Document the trigger.
- Retire it. If the per-CID circle grids in 008a are giving us enough competitor coverage, 008e is redundant.
Decision blocks Fix 5 (the carpet-scan-specific BQ table needs to exist if 008e is staying).
Define enrichment_008e_mapserp_carpet (or similar) schema in bq/schemas/enrichment.py. Grain: one row per (run_id, market_id, grid_lat, grid_lon, keyword, competitor_domain). Then wire 008e’s output writer to land into BQ on every run.
Without this fix, every carpet scan today lands data only on the VM disk — no BQ-side analytics.
Same pattern as Phases 002-007 fail-loud fixes. After each 008 sub-step writes its enrichment table, assert:
- 008a:
20 × input_CID_count grid-point rows (each CID gets 20 grid scans)
- 008b:
≥ 90% of input CIDs got a row (5-10% slack for genuine SERP failures)
- 008c:
≥ 95% of input CIDs got a row
- 008d: row count = unique search_keyword count from input (~20 for atty)
- 008e: row count ~ (grid_points × keywords) — allow 10% slack for no-results points
If any assertion fails, exit non-zero with the specific shortfall.
After all 6 fixes
The cron split from Phase 005 Fix 1 lands. 008a + 008b fire once per month on the 1st against fresh gold. 008b actually writes rows for the first time (fixing the “competitor organic-SERP” gap). 008c + 008d continue working weekly, alongside Phase 007. 008e is either scheduled monthly or kept on-demand — whichever Fix 4 chose. Every sub-step asserts coverage at the end. Phase 008 joins Phase 007 as a fully working enrichment phase.
Then we move on to Phase 009 (Backlinks — the first phase that uses the biweekly tier).