pipeline.amicusdata.dev — Phase 001: Data Acquisition

PHASE 001

Data Acquisition

Steps 001a → 001s · Every 30 days · Output: silver layer candidate domains

The Spec

How Phase 001 is supposed to behave.

Phase 001 — Data Acquisition. Design intent. Reference, not reality.

Duty: Take a market config (e.g. atty_wa_seattle), fetch raw Google Maps SERPs from DataForSEO, clean and dedupe them, and write a candidate-domain list to amicus_pipeline.silver.

Schedule: Fires every 30 days, end-to-end, all 9 sub-steps in a single chain.

End state: Fresh silver-layer candidate domains ready for Phase 002 to verify.

What Phase 001 does, plain English

Phase 001 is the data acquisition phase. It does not classify firms (that’s Phase 003), it does not verify domains resolve (that’s Phase 002), and it does not tag specialties (that’s Phase 004). All Phase 001 does is:

Hit Google Maps with ~17 keywords across ~18 geographic coordinates for the target market
Collect the raw SERP responses into BigQuery’s bronze table
Split, dedupe, geofence, filter, and normalize that raw data through 8 sequential cleaning steps
Promote the cleaned result to amicus_pipeline.silver — the candidate-domain list

That silver list is the input that Phase 002 picks up. Phase 001’s job is done the moment silver is fresh.

The 9 sub-steps, in order

Step	What it does	Provider	Reads	Writes
`001a`	Google Maps SERP fetch — ~17 keywords × ~18 coordinates per profile	DataForSEO	market config (GCS)	raw JSON → disk; rows → `bronze`
`001b`	Split raw SERP rows into “paid ads” vs “organic”	—	`bronze`	`bronze`
`001c`	Dedupe Google Business listings by CID (Google’s unique business ID)	—	`bronze`	`bronze`
`001d`	Apply geofence — drop results outside the target geographic boundary	—	`bronze`	`bronze`
`001e`	Keep only results that have a website URL	—	`bronze`	`bronze`
`001f`	Normalize domain strings — lowercase, strip `www.`, strip trailing slashes	—	`bronze`	`bronze`
`001g`	Drop directory/aggregator domains (avvo, findlaw, justia — listings of firms, not firms)	—	`bronze`	`bronze`
`001h`	Final unique dedup so each domain appears once	—	`bronze`	`bronze`
`001s`	Promote cleaned bronze → `silver`. This is the handoff to Phase 002.	—	`bronze`	`silver`

How the data moves (and where it lives)

Phase 001 reads two inputs, runs nine scripts on the VM, mutates one BigQuery table seven times, and writes one final output table:

Input · Config

Market YAML

gs://amicus_app_configs/<market>.yaml

Input · Live API

DataForSEO Maps SERP

~306 calls per profile
(~$1.20 per fire across both profiles)

↓

Code · runs on the VM

Steps 001a → 001s
(9 Python scripts, in order)

/mnt/workspace/amicus/
pipeline/steps/001_data_acquisition/

↓

Intermediate · BigQuery

amicus_pipeline.bronze

001a writes raw SERP rows here.
001b → 001h filter & dedupe in place.

Audit trail · VM disk

000_raw_serp_json/

Raw JSON dumps written by 001a.
For replay / debug if BQ goes weird.

↓ 001s promotes cleaned bronze → silver

Output · the deliverable

amicus_pipeline.silver

cleaned candidate-domain list
(one row per unique candidate domain)

↓

Hands off to

Phase 002 — Domain Verification

(out of scope for this page)

Where to look — file & table reference

Thing	Path or table
The 9 scripts	`/mnt/workspace/amicus/pipeline/steps/001_data_acquisition/step_001*.py`
Market config (input)	`gs://amicus_app_configs/<market_id>.yaml`
Raw JSON dumps (audit trail)	`pipeline/steps/001_data_acquisition/000_raw_serp_json/`
Per-step structured output	`pipeline/steps/001_data_acquisition/output/<step_id>/<YYYY-MM-DD>/`
Per-step log files	`pipeline/steps/000_log_files/step_001_.log`
Bronze table (raw + intermediate)	`amicus_pipeline.bronze`
Silver table (the deliverable)	`amicus_pipeline.silver`
Run history (one row per run)	`amicus_logs.pipeline_runs`
Per-step accounting	`amicus_logs.step_logs`
Per-API-call cost ledger	`amicus_logs.api_cost_log`

Cost per fire

Only 001a costs money — it’s the only step that calls a paid external API. Everything else is a BigQuery transform on data already paid for.

Line item	Calls	Per call	Subtotal
001a — DFS Maps SERP for `atty_wa_seattle` (17 kw × 18 coords)	306	$0.002	$0.61
001a — DFS Maps SERP for `tow_wa_seattle` (17 kw × 18 coords)	306	$0.002	$0.61
001b → 001s — BigQuery transforms (no external API)	—	—	$0.00
Total per 30-day fire	612		~$1.22

Schedule

Frequency

Every 30 days. Once. No more, no less.

Trigger

Cron on the 1st of each month. Not manual.

Execution mode

Atomic chain — all 9 sub-steps (001a → 001s) fire in order, in a single run.

Partial runs

Not allowed. If any sub-step fails, the whole run fails loud.

Skipping steps

Not allowed. Every sub-step must run every time.

Other cadences

None. No daily, no weekly, no biweekly fires of any Phase 001 step.

Output guarantee

Silver is fresh, or the run fails loud. Never “completed with zero rows.”

What's Fucked

Phase 001 is not running the spec. Here’s exactly how.

Forensic findings, 2026-05-15. Every claim has a query or a file path.

Finding 1 — Only 001a auto-fires. 001b through 001s never auto-fire.

pipeline/steps/cadence.py tags 001a as biweekly but every other step in Phase 001 as intake. The intake tier has runs_on: [] — meaning never auto-fires.

Step	Cadence tag	Auto-fires?
`001a`	biweekly	Every Mon + Thu × 2 profiles = 4 runs/week
`001b`	intake	never
`001c`	intake	never
`001d`	intake	never
`001e`	intake	never
`001f`	intake	never
`001g`	intake	never
`001h`	intake	never
`001s`	intake	never

Finding 2 — There is no 30-day intake cron. None.

The spec says Phase 001 fires every 30 days. The reality: no cron entry exists that runs intake on a calendar. crontab -l on the VM shows exactly one orchestrator entry — 30 10 * * 1,4 — which fires only the biweekly/weekly/monthly tiers, never intake.

Phase 001 only completes when someone manually SSHs into the VM and fires it by hand. That has happened roughly once per month, and when it happens, it usually cascades and partially fails.

Finding 3 — 001a’s biweekly fires write orphan data.

Twice a week, 001a fires ~306 DFS Maps SERP calls per profile, dumps raw JSON to 000_raw_serp_json/, and (allegedly) appends rows to the bronze table. Nothing downstream processes that data because 001b through 001s never auto-fire. The fresh SERPs accumulate as orphan data sitting in bronze.

Every Monday and Thursday for weeks, we have been paying DataForSEO to refresh data that no downstream step reads.

Finding 4 — `step_logs.rows_written = 0` on every recent 001a fire.

Date	Profile	Step	Status
2026-05-01	atty	`001a`	completed
2026-05-07	atty	`001a`	completed
2026-05-11	atty	`001a`	completed
2026-05-14	atty	`001a`	completed
2026-05-14	tow	`001a`	completed

Either 001a’s BigQuery insert is silently failing (its docstring claims it writes bronze rows, but the accounting says zero) or the step_logs.rows_written field is broken (we already know step_logs.api_cost is broken — logs $0 even when real money was spent). One of those two is true. Not yet confirmed which.

Finding 5 — The cron didn’t exist until 2026-05-13.

stat /var/spool/cron/crontabs/localaccount reports the crontab was created on 2026-05-13 20:29 UTC. Before that date, no scheduled fires of Phase 001 existed at all — every prior fire was a manual SSH-and-run-by-hand. The 2026-05-04 attempted monthly intake had no cron to fall back on; it was someone firing scripts in sequence by hand and the cascade was what we saw on disk in 55 step log files.

Finding 6 — When intake IS fired manually, it cascades.

On 2026-04-06, somebody tried to fire intake by hand for atty_wa_seattle + tow_wa_seattle. pipeline_runs for that day shows 24 separate run rows, all with status=failed. The failed_step column clusters on:

001d (geofence filter) — crashed on input format
001s (bronze → silver) — crashed before promoting any rows
002a — never reached on the runs that died earlier
003c (Haiku classify) — cascaded into expensive retries, burned $52 in Anthropic spend

The cascade burned $505 across atty + tow ($439 DFS + $44 + $11 + $9 + $8 × 5 Haiku retries) before the operator gave up. No fresh silver was written.

The bottom line

Where Phase 001 Stands Today

The silver layer has not been refreshed since some date in mid-April 2026. Every law firm that opened, closed, or rebranded in atty_wa_seattle or tow_wa_seattle since then is invisible to every downstream phase. Meanwhile, ~$10/month is bleeding on orphan 001a fires that go nowhere.

The Fix

What we’ll do to make Phase 001 match the spec.

Concrete remediation. Ordered cheapest-first.

Six fixes. Do them in this order. Each is independent enough that you can stop after any of them and the pipeline is in a better state than before.

FIX 1 Confirm 001a’s BigQuery write actually happens. ~5 min

Query amicus_pipeline.bronze for rows inserted on 2026-05-11 and 2026-05-14 with the known cron-fire run_ids (213f9e1b, f480597a, etc.). Two possible outcomes:

Rows present: 001a is writing fine; step_logs.rows_written is the bug. Separate fix — tracked but not blocking Phase 001.
Rows absent: 001a’s BQ-append is silently failing in production. Read step_001a_serp_fetch.py’s write path, find the bug, fix it.

This must be done before any other fix because every other fix assumes 001a writes correctly.

FIX 2 Decide what 001a’s biweekly fire is FOR. Then act on the decision. ~15 min

The biweekly tag on 001a is either intentional or a leftover. Two paths:

Path A — biweekly is intentional (e.g. for a time-series ranking-tracker that needs SERP snapshots more often than monthly): find the downstream consumer (a view, dashboard, or step that reads biweekly snapshots from bronze). If found, document the dependency and leave biweekly in place. If not found, kill the biweekly fire — it’s bleeding money for no consumer.
Path B — biweekly is leftover: re-tag 001a as intake in cadence.py. Phase 001 then becomes a clean 9-step intake-only chain.

Either path settles the “why is 001a alone running while the rest of Phase 001 isn’t” absurdity.

FIX 3 Install the 30-day intake cron. ~30 min

Add a crontab entry on the VM that fires the full Phase 001 chain on a 30-day cadence. Concretely:

0 10 1 * * cd /mnt/workspace/amicus && /mnt/workspace/venv/bin/python pipeline/steps/orchestrator.py --cadence intake >> /mnt/workspace/amicus/000_log_files/intake_cron.log 2>&1

Fires 10:00 UTC on the 1st of every month, runs orchestrator.py with the intake tier. Note: the existing CRON_TZ=America/Los_Angeles in the user crontab is being ignored by the cron daemon on this Ubuntu — use UTC in the expression and pick a UTC hour that maps to a sensible Pacific time. Verify after 30 days that it fires.

FIX 4 Smoke-test 001b through 001s end-to-end on a small profile. ~1–2 hr

The 2026-04-06 cascade showed 001d and 001s crashing. Subsequent fix commits exist in git (e.g. fix(001s,003a): bronze_to_silver filters by profile_id not run_id) but no end-to-end smoke test has confirmed the chain runs cleanly. Action:

Pick the smallest profile (tow_wa_seattle with ~194 gold domains, cheaper)
Fire orchestrator.py --cadence intake --profile tow_wa_seattle manually
Watch each step’s exit code; for any failure, fix the underlying bug, re-fire that step only
Confirm silver table has rows for that profile with current date

Only after this smoke passes should the Fix 3 cron be allowed to fire unsupervised.

FIX 5 Reconcile cadence.py with how Phase 001 actually fires. ~10 min

After Fixes 2 and 3 land, cadence.py needs to reflect reality. Two clean shapes:

Option A: drop 001a’s biweekly tag, tag it as intake like the rest of Phase 001. Intake fires once per month via the new cron in Fix 3.
Option B: introduce a new tier monthly_intake that nests intake. Tag 001a–s as monthly_intake. Update resolve_active_tiers() so the 30-day cron picks this tier up.

Either way: no tag in cadence.py should describe a schedule that no cron honors.

FIX 6 Add a fail-loud verification at the end of 001s. ~20 min

Right now 001s can complete with zero rows promoted and mark itself completed. That has happened. Action: after the bronze→silver promotion in step_001s_bronze_to_silver.py, query the silver table and verify:

Row count for this profile_id is > some minimum threshold (e.g. 50)
MAX(updated_at) is within the last 5 minutes

If either check fails, exit non-zero with a clear error message. This prevents a future silent zero-write run from being marked “completed” while leaving silver stale.

After all 6 fixes

Phase 001 fires on the 1st of every month at a known UTC hour. All 9 sub-steps run in sequence. Silver is fresh within an hour of the cron firing. step_logs and api_cost_log reflect what actually happened. Any failure exits loud and visible. The biweekly orphan stops bleeding money.

Then we move on to Phase 002.

Generated 2026-05-15 from /mnt/workspace/amicus/sites/pipeline/ on amicus-dev VM.