PHASE 001

Data Acquisition

Steps 001a → 001s · Every 30 days · Output: silver layer candidate domains

The Spec

How Phase 001 is supposed to behave.

Duty: Take a market config (e.g. atty_wa_seattle), fetch raw Google Maps SERPs from DataForSEO, clean and dedupe them, and write a candidate-domain list to amicus_pipeline.silver.

Schedule: Fires every 30 days, end-to-end, all 9 sub-steps in a single chain.

End state: Fresh silver-layer candidate domains ready for Phase 002 to verify.

What Phase 001 does, plain English

Phase 001 is the data acquisition phase. It does not classify firms (that’s Phase 003), it does not verify domains resolve (that’s Phase 002), and it does not tag specialties (that’s Phase 004). All Phase 001 does is:

  1. Hit Google Maps with ~17 keywords across ~18 geographic coordinates for the target market
  2. Collect the raw SERP responses into BigQuery’s bronze table
  3. Split, dedupe, geofence, filter, and normalize that raw data through 8 sequential cleaning steps
  4. Promote the cleaned result to amicus_pipeline.silver — the candidate-domain list

That silver list is the input that Phase 002 picks up. Phase 001’s job is done the moment silver is fresh.

The 9 sub-steps, in order

StepWhat it doesProviderReadsWrites
001aGoogle Maps SERP fetch — ~17 keywords × ~18 coordinates per profileDataForSEOmarket config (GCS)raw JSON → disk; rows → bronze
001bSplit raw SERP rows into “paid ads” vs “organic”bronzebronze
001cDedupe Google Business listings by CID (Google’s unique business ID)bronzebronze
001dApply geofence — drop results outside the target geographic boundarybronzebronze
001eKeep only results that have a website URLbronzebronze
001fNormalize domain strings — lowercase, strip www., strip trailing slashesbronzebronze
001gDrop directory/aggregator domains (avvo, findlaw, justia — listings of firms, not firms)bronzebronze
001hFinal unique dedup so each domain appears oncebronzebronze
001sPromote cleaned bronze → silver. This is the handoff to Phase 002.bronzesilver

How the data moves (and where it lives)

Phase 001 reads two inputs, runs nine scripts on the VM, mutates one BigQuery table seven times, and writes one final output table:

Input · Config
Market YAML
gs://amicus_app_configs/<market>.yaml
Input · Live API
DataForSEO Maps SERP
~306 calls per profile
(~$1.20 per fire across both profiles)
Code · runs on the VM
Steps 001a → 001s
(9 Python scripts, in order)
/mnt/workspace/amicus/
pipeline/steps/001_data_acquisition/
Intermediate · BigQuery
amicus_pipeline.bronze
001a writes raw SERP rows here.
001b → 001h filter & dedupe in place.
Audit trail · VM disk
000_raw_serp_json/
Raw JSON dumps written by 001a.
For replay / debug if BQ goes weird.
001s promotes cleaned bronze → silver
Output · the deliverable
amicus_pipeline.silver
cleaned candidate-domain list
(one row per unique candidate domain)
Hands off to
Phase 002 — Domain Verification
(out of scope for this page)

Where to look — file & table reference

ThingPath or table
The 9 scripts/mnt/workspace/amicus/pipeline/steps/001_data_acquisition/step_001*.py
Market config (input)gs://amicus_app_configs/<market_id>.yaml
Raw JSON dumps (audit trail)pipeline/steps/001_data_acquisition/000_raw_serp_json/
Per-step structured outputpipeline/steps/001_data_acquisition/output/<step_id>/<YYYY-MM-DD>/
Per-step log filespipeline/steps/000_log_files/step_001*_*.log
Bronze table (raw + intermediate)amicus_pipeline.bronze
Silver table (the deliverable)amicus_pipeline.silver
Run history (one row per run)amicus_logs.pipeline_runs
Per-step accountingamicus_logs.step_logs
Per-API-call cost ledgeramicus_logs.api_cost_log

Cost per fire

Only 001a costs money — it’s the only step that calls a paid external API. Everything else is a BigQuery transform on data already paid for.

Line itemCallsPer callSubtotal
001a — DFS Maps SERP for atty_wa_seattle (17 kw × 18 coords) 306$0.002$0.61
001a — DFS Maps SERP for tow_wa_seattle (17 kw × 18 coords) 306$0.002$0.61
001b → 001s — BigQuery transforms (no external API) $0.00
Total per 30-day fire 612~$1.22

Schedule

Frequency
Every 30 days. Once. No more, no less.
Trigger
Cron on the 1st of each month. Not manual.
Execution mode
Atomic chain — all 9 sub-steps (001a → 001s) fire in order, in a single run.
Partial runs
Not allowed. If any sub-step fails, the whole run fails loud.
Skipping steps
Not allowed. Every sub-step must run every time.
Other cadences
None. No daily, no weekly, no biweekly fires of any Phase 001 step.
Output guarantee
Silver is fresh, or the run fails loud. Never “completed with zero rows.”
What's Fucked

Phase 001 is not running the spec. Here’s exactly how.

Finding 1 — Only 001a auto-fires. 001b through 001s never auto-fire.

pipeline/steps/cadence.py tags 001a as biweekly but every other step in Phase 001 as intake. The intake tier has runs_on: [] — meaning never auto-fires.

StepCadence tagAuto-fires?
001abiweeklyEvery Mon + Thu × 2 profiles = 4 runs/week
001bintakenever
001cintakenever
001dintakenever
001eintakenever
001fintakenever
001gintakenever
001hintakenever
001sintakenever

Finding 2 — There is no 30-day intake cron. None.

The spec says Phase 001 fires every 30 days. The reality: no cron entry exists that runs intake on a calendar. crontab -l on the VM shows exactly one orchestrator entry — 30 10 * * 1,4 — which fires only the biweekly/weekly/monthly tiers, never intake.

Phase 001 only completes when someone manually SSHs into the VM and fires it by hand. That has happened roughly once per month, and when it happens, it usually cascades and partially fails.

Finding 3 — 001a’s biweekly fires write orphan data.

Twice a week, 001a fires ~306 DFS Maps SERP calls per profile, dumps raw JSON to 000_raw_serp_json/, and (allegedly) appends rows to the bronze table. Nothing downstream processes that data because 001b through 001s never auto-fire. The fresh SERPs accumulate as orphan data sitting in bronze.

Every Monday and Thursday for weeks, we have been paying DataForSEO to refresh data that no downstream step reads.

Finding 4 — step_logs.rows_written = 0 on every recent 001a fire.

DateProfileStepStatusrows_written
2026-05-01atty001acompleted0
2026-05-07atty001acompleted0
2026-05-11atty001acompleted0
2026-05-14atty001acompleted0
2026-05-14tow001acompleted0

Either 001a’s BigQuery insert is silently failing (its docstring claims it writes bronze rows, but the accounting says zero) or the step_logs.rows_written field is broken (we already know step_logs.api_cost is broken — logs $0 even when real money was spent). One of those two is true. Not yet confirmed which.

Finding 5 — The cron didn’t exist until 2026-05-13.

stat /var/spool/cron/crontabs/localaccount reports the crontab was created on 2026-05-13 20:29 UTC. Before that date, no scheduled fires of Phase 001 existed at all — every prior fire was a manual SSH-and-run-by-hand. The 2026-05-04 attempted monthly intake had no cron to fall back on; it was someone firing scripts in sequence by hand and the cascade was what we saw on disk in 55 step log files.

Finding 6 — When intake IS fired manually, it cascades.

On 2026-04-06, somebody tried to fire intake by hand for atty_wa_seattle + tow_wa_seattle. pipeline_runs for that day shows 24 separate run rows, all with status=failed. The failed_step column clusters on:

  • 001d (geofence filter) — crashed on input format
  • 001s (bronze → silver) — crashed before promoting any rows
  • 002a — never reached on the runs that died earlier
  • 003c (Haiku classify) — cascaded into expensive retries, burned $52 in Anthropic spend

The cascade burned $505 across atty + tow ($439 DFS + $44 + $11 + $9 + $8 × 5 Haiku retries) before the operator gave up. No fresh silver was written.

The bottom line

Where Phase 001 Stands Today

The silver layer has not been refreshed since some date in mid-April 2026. Every law firm that opened, closed, or rebranded in atty_wa_seattle or tow_wa_seattle since then is invisible to every downstream phase. Meanwhile, ~$10/month is bleeding on orphan 001a fires that go nowhere.

The Fix

What we’ll do to make Phase 001 match the spec.

Six fixes. Do them in this order. Each is independent enough that you can stop after any of them and the pipeline is in a better state than before.

FIX 1 Confirm 001a’s BigQuery write actually happens. ~5 min

Query amicus_pipeline.bronze for rows inserted on 2026-05-11 and 2026-05-14 with the known cron-fire run_ids (213f9e1b, f480597a, etc.). Two possible outcomes:

  • Rows present: 001a is writing fine; step_logs.rows_written is the bug. Separate fix — tracked but not blocking Phase 001.
  • Rows absent: 001a’s BQ-append is silently failing in production. Read step_001a_serp_fetch.py’s write path, find the bug, fix it.

This must be done before any other fix because every other fix assumes 001a writes correctly.

FIX 2 Decide what 001a’s biweekly fire is FOR. Then act on the decision. ~15 min

The biweekly tag on 001a is either intentional or a leftover. Two paths:

  • Path A — biweekly is intentional (e.g. for a time-series ranking-tracker that needs SERP snapshots more often than monthly): find the downstream consumer (a view, dashboard, or step that reads biweekly snapshots from bronze). If found, document the dependency and leave biweekly in place. If not found, kill the biweekly fire — it’s bleeding money for no consumer.
  • Path B — biweekly is leftover: re-tag 001a as intake in cadence.py. Phase 001 then becomes a clean 9-step intake-only chain.

Either path settles the “why is 001a alone running while the rest of Phase 001 isn’t” absurdity.

FIX 3 Install the 30-day intake cron. ~30 min

Add a crontab entry on the VM that fires the full Phase 001 chain on a 30-day cadence. Concretely:

0 10 1 * * cd /mnt/workspace/amicus && /mnt/workspace/venv/bin/python pipeline/steps/orchestrator.py --cadence intake >> /mnt/workspace/amicus/000_log_files/intake_cron.log 2>&1

Fires 10:00 UTC on the 1st of every month, runs orchestrator.py with the intake tier. Note: the existing CRON_TZ=America/Los_Angeles in the user crontab is being ignored by the cron daemon on this Ubuntu — use UTC in the expression and pick a UTC hour that maps to a sensible Pacific time. Verify after 30 days that it fires.

FIX 4 Smoke-test 001b through 001s end-to-end on a small profile. ~1–2 hr

The 2026-04-06 cascade showed 001d and 001s crashing. Subsequent fix commits exist in git (e.g. fix(001s,003a): bronze_to_silver filters by profile_id not run_id) but no end-to-end smoke test has confirmed the chain runs cleanly. Action:

  • Pick the smallest profile (tow_wa_seattle with ~194 gold domains, cheaper)
  • Fire orchestrator.py --cadence intake --profile tow_wa_seattle manually
  • Watch each step’s exit code; for any failure, fix the underlying bug, re-fire that step only
  • Confirm silver table has rows for that profile with current date

Only after this smoke passes should the Fix 3 cron be allowed to fire unsupervised.

FIX 5 Reconcile cadence.py with how Phase 001 actually fires. ~10 min

After Fixes 2 and 3 land, cadence.py needs to reflect reality. Two clean shapes:

  • Option A: drop 001a’s biweekly tag, tag it as intake like the rest of Phase 001. Intake fires once per month via the new cron in Fix 3.
  • Option B: introduce a new tier monthly_intake that nests intake. Tag 001a–s as monthly_intake. Update resolve_active_tiers() so the 30-day cron picks this tier up.

Either way: no tag in cadence.py should describe a schedule that no cron honors.

FIX 6 Add a fail-loud verification at the end of 001s. ~20 min

Right now 001s can complete with zero rows promoted and mark itself completed. That has happened. Action: after the bronze→silver promotion in step_001s_bronze_to_silver.py, query the silver table and verify:

  • Row count for this profile_id is > some minimum threshold (e.g. 50)
  • MAX(updated_at) is within the last 5 minutes

If either check fails, exit non-zero with a clear error message. This prevents a future silent zero-write run from being marked “completed” while leaving silver stale.

After all 6 fixes

Phase 001 fires on the 1st of every month at a known UTC hour. All 9 sub-steps run in sequence. Silver is fresh within an hour of the cron firing. step_logs and api_cost_log reflect what actually happened. Any failure exits loud and visible. The biweekly orphan stops bleeding money.

Then we move on to Phase 002.