The Spec
How Phase 001 is supposed to behave.
Phase 001 — Data Acquisition. Design intent. Reference, not reality.
Duty: Take a market config (e.g. atty_wa_seattle), fetch raw Google Maps SERPs from DataForSEO, clean and dedupe them, and write a candidate-domain list to amicus_pipeline.silver.
Schedule: Fires every 30 days, end-to-end, all 9 sub-steps in a single chain.
End state: Fresh silver-layer candidate domains ready for Phase 002 to verify.
What Phase 001 does, plain English
Phase 001 is the data acquisition phase. It does not classify firms (that’s Phase 003), it does not verify domains resolve (that’s Phase 002), and it does not tag specialties (that’s Phase 004). All Phase 001 does is:
- Hit Google Maps with ~17 keywords across ~18 geographic coordinates for the target market
- Collect the raw SERP responses into BigQuery’s
bronze table
- Split, dedupe, geofence, filter, and normalize that raw data through 8 sequential cleaning steps
- Promote the cleaned result to
amicus_pipeline.silver — the candidate-domain list
That silver list is the input that Phase 002 picks up. Phase 001’s job is done the moment silver is fresh.
The 9 sub-steps, in order
| Step | What it does | Provider | Reads | Writes |
001a | Google Maps SERP fetch — ~17 keywords × ~18 coordinates per profile | DataForSEO | market config (GCS) | raw JSON → disk; rows → bronze |
001b | Split raw SERP rows into “paid ads” vs “organic” | — | bronze | bronze |
001c | Dedupe Google Business listings by CID (Google’s unique business ID) | — | bronze | bronze |
001d | Apply geofence — drop results outside the target geographic boundary | — | bronze | bronze |
001e | Keep only results that have a website URL | — | bronze | bronze |
001f | Normalize domain strings — lowercase, strip www., strip trailing slashes | — | bronze | bronze |
001g | Drop directory/aggregator domains (avvo, findlaw, justia — listings of firms, not firms) | — | bronze | bronze |
001h | Final unique dedup so each domain appears once | — | bronze | bronze |
001s | Promote cleaned bronze → silver. This is the handoff to Phase 002. | — | bronze | silver |
How the data moves (and where it lives)
Phase 001 reads two inputs, runs nine scripts on the VM, mutates one BigQuery table seven times, and writes one final output table:
↓
Code · runs on the VM
Steps 001a → 001s
(9 Python scripts, in order)
/mnt/workspace/amicus/
pipeline/steps/001_data_acquisition/
↓
Intermediate · BigQuery
amicus_pipeline.bronze
001a writes raw SERP rows here.
001b → 001h filter & dedupe in place.
Audit trail · VM disk
000_raw_serp_json/
Raw JSON dumps written by 001a.
For replay / debug if BQ goes weird.
↓
001s promotes cleaned bronze → silver
Output · the deliverable
amicus_pipeline.silver
cleaned candidate-domain list
(one row per unique candidate domain)
↓
Hands off to
Phase 002 — Domain Verification
(out of scope for this page)
Where to look — file & table reference
| Thing | Path or table |
| The 9 scripts | /mnt/workspace/amicus/pipeline/steps/001_data_acquisition/step_001*.py |
| Market config (input) | gs://amicus_app_configs/<market_id>.yaml |
| Raw JSON dumps (audit trail) | pipeline/steps/001_data_acquisition/000_raw_serp_json/ |
| Per-step structured output | pipeline/steps/001_data_acquisition/output/<step_id>/<YYYY-MM-DD>/ |
| Per-step log files | pipeline/steps/000_log_files/step_001*_*.log |
| Bronze table (raw + intermediate) | amicus_pipeline.bronze |
| Silver table (the deliverable) | amicus_pipeline.silver |
| Run history (one row per run) | amicus_logs.pipeline_runs |
| Per-step accounting | amicus_logs.step_logs |
| Per-API-call cost ledger | amicus_logs.api_cost_log |
Cost per fire
Only 001a costs money — it’s the only step that calls a paid external API. Everything else is a BigQuery transform on data already paid for.
| Line item | Calls | Per call | Subtotal |
001a — DFS Maps SERP for atty_wa_seattle (17 kw × 18 coords) |
306 | $0.002 | $0.61 |
001a — DFS Maps SERP for tow_wa_seattle (17 kw × 18 coords) |
306 | $0.002 | $0.61 |
| 001b → 001s — BigQuery transforms (no external API) |
— | — | $0.00 |
| Total per 30-day fire |
612 | | ~$1.22 |
Schedule
Frequency
Every 30 days. Once. No more, no less.
Trigger
Cron on the 1st of each month. Not manual.
Execution mode
Atomic chain — all 9 sub-steps (001a → 001s) fire in order, in a single run.
Partial runs
Not allowed. If any sub-step fails, the whole run fails loud.
Skipping steps
Not allowed. Every sub-step must run every time.
Other cadences
None. No daily, no weekly, no biweekly fires of any Phase 001 step.
Output guarantee
Silver is fresh, or the run fails loud. Never “completed with zero rows.”
What's Fucked
Phase 001 is not running the spec. Here’s exactly how.
Forensic findings, 2026-05-15. Every claim has a query or a file path.
Finding 1 — Only 001a auto-fires. 001b through 001s never auto-fire.
pipeline/steps/cadence.py tags 001a as biweekly but every other step in Phase 001 as intake. The intake tier has runs_on: [] — meaning never auto-fires.
| Step | Cadence tag | Auto-fires? |
001a | biweekly | Every Mon + Thu × 2 profiles = 4 runs/week |
001b | intake | never |
001c | intake | never |
001d | intake | never |
001e | intake | never |
001f | intake | never |
001g | intake | never |
001h | intake | never |
001s | intake | never |
Finding 2 — There is no 30-day intake cron. None.
The spec says Phase 001 fires every 30 days. The reality: no cron entry exists that runs intake on a calendar. crontab -l on the VM shows exactly one orchestrator entry — 30 10 * * 1,4 — which fires only the biweekly/weekly/monthly tiers, never intake.
Phase 001 only completes when someone manually SSHs into the VM and fires it by hand. That has happened roughly once per month, and when it happens, it usually cascades and partially fails.
Finding 3 — 001a’s biweekly fires write orphan data.
Twice a week, 001a fires ~306 DFS Maps SERP calls per profile, dumps raw JSON to 000_raw_serp_json/, and (allegedly) appends rows to the bronze table. Nothing downstream processes that data because 001b through 001s never auto-fire. The fresh SERPs accumulate as orphan data sitting in bronze.
Every Monday and Thursday for weeks, we have been paying DataForSEO to refresh data that no downstream step reads.
Finding 4 — step_logs.rows_written = 0 on every recent 001a fire.
| Date | Profile | Step | Status | rows_written |
| 2026-05-01 | atty | 001a | completed | 0 |
| 2026-05-07 | atty | 001a | completed | 0 |
| 2026-05-11 | atty | 001a | completed | 0 |
| 2026-05-14 | atty | 001a | completed | 0 |
| 2026-05-14 | tow | 001a | completed | 0 |
Either 001a’s BigQuery insert is silently failing (its docstring claims it writes bronze rows, but the accounting says zero) or the step_logs.rows_written field is broken (we already know step_logs.api_cost is broken — logs $0 even when real money was spent). One of those two is true. Not yet confirmed which.
Finding 5 — The cron didn’t exist until 2026-05-13.
stat /var/spool/cron/crontabs/localaccount reports the crontab was created on 2026-05-13 20:29 UTC. Before that date, no scheduled fires of Phase 001 existed at all — every prior fire was a manual SSH-and-run-by-hand. The 2026-05-04 attempted monthly intake had no cron to fall back on; it was someone firing scripts in sequence by hand and the cascade was what we saw on disk in 55 step log files.
Finding 6 — When intake IS fired manually, it cascades.
On 2026-04-06, somebody tried to fire intake by hand for atty_wa_seattle + tow_wa_seattle. pipeline_runs for that day shows 24 separate run rows, all with status=failed. The failed_step column clusters on:
001d (geofence filter) — crashed on input format
001s (bronze → silver) — crashed before promoting any rows
002a — never reached on the runs that died earlier
003c (Haiku classify) — cascaded into expensive retries, burned $52 in Anthropic spend
The cascade burned $505 across atty + tow ($439 DFS + $44 + $11 + $9 + $8 × 5 Haiku retries) before the operator gave up. No fresh silver was written.
The bottom line
Where Phase 001 Stands Today
The silver layer has not been refreshed since some date in mid-April 2026. Every law firm that opened, closed, or rebranded in atty_wa_seattle or tow_wa_seattle since then is invisible to every downstream phase. Meanwhile, ~$10/month is bleeding on orphan 001a fires that go nowhere.
The Fix
What we’ll do to make Phase 001 match the spec.
Concrete remediation. Ordered cheapest-first.
Six fixes. Do them in this order. Each is independent enough that you can stop after any of them and the pipeline is in a better state than before.
Query amicus_pipeline.bronze for rows inserted on 2026-05-11 and 2026-05-14 with the known cron-fire run_ids (213f9e1b, f480597a, etc.). Two possible outcomes:
- Rows present: 001a is writing fine;
step_logs.rows_written is the bug. Separate fix — tracked but not blocking Phase 001.
- Rows absent: 001a’s BQ-append is silently failing in production. Read
step_001a_serp_fetch.py’s write path, find the bug, fix it.
This must be done before any other fix because every other fix assumes 001a writes correctly.
The biweekly tag on 001a is either intentional or a leftover. Two paths:
- Path A — biweekly is intentional (e.g. for a time-series ranking-tracker that needs SERP snapshots more often than monthly): find the downstream consumer (a view, dashboard, or step that reads biweekly snapshots from bronze). If found, document the dependency and leave biweekly in place. If not found, kill the biweekly fire — it’s bleeding money for no consumer.
- Path B — biweekly is leftover: re-tag 001a as
intake in cadence.py. Phase 001 then becomes a clean 9-step intake-only chain.
Either path settles the “why is 001a alone running while the rest of Phase 001 isn’t” absurdity.
Add a crontab entry on the VM that fires the full Phase 001 chain on a 30-day cadence. Concretely:
0 10 1 * * cd /mnt/workspace/amicus && /mnt/workspace/venv/bin/python pipeline/steps/orchestrator.py --cadence intake >> /mnt/workspace/amicus/000_log_files/intake_cron.log 2>&1
Fires 10:00 UTC on the 1st of every month, runs orchestrator.py with the intake tier. Note: the existing CRON_TZ=America/Los_Angeles in the user crontab is being ignored by the cron daemon on this Ubuntu — use UTC in the expression and pick a UTC hour that maps to a sensible Pacific time. Verify after 30 days that it fires.
The 2026-04-06 cascade showed 001d and 001s crashing. Subsequent fix commits exist in git (e.g. fix(001s,003a): bronze_to_silver filters by profile_id not run_id) but no end-to-end smoke test has confirmed the chain runs cleanly. Action:
- Pick the smallest profile (
tow_wa_seattle with ~194 gold domains, cheaper)
- Fire
orchestrator.py --cadence intake --profile tow_wa_seattle manually
- Watch each step’s exit code; for any failure, fix the underlying bug, re-fire that step only
- Confirm
silver table has rows for that profile with current date
Only after this smoke passes should the Fix 3 cron be allowed to fire unsupervised.
After Fixes 2 and 3 land, cadence.py needs to reflect reality. Two clean shapes:
- Option A: drop 001a’s
biweekly tag, tag it as intake like the rest of Phase 001. Intake fires once per month via the new cron in Fix 3.
- Option B: introduce a new tier
monthly_intake that nests intake. Tag 001a–s as monthly_intake. Update resolve_active_tiers() so the 30-day cron picks this tier up.
Either way: no tag in cadence.py should describe a schedule that no cron honors.
Right now 001s can complete with zero rows promoted and mark itself completed. That has happened. Action: after the bronze→silver promotion in step_001s_bronze_to_silver.py, query the silver table and verify:
- Row count for this profile_id is > some minimum threshold (e.g. 50)
MAX(updated_at) is within the last 5 minutes
If either check fails, exit non-zero with a clear error message. This prevents a future silent zero-write run from being marked “completed” while leaving silver stale.
After all 6 fixes
Phase 001 fires on the 1st of every month at a known UTC hour. All 9 sub-steps run in sequence. Silver is fresh within an hour of the cron firing. step_logs and api_cost_log reflect what actually happened. Any failure exits loud and visible. The biweekly orphan stops bleeding money.
Then we move on to Phase 002.