The Spec
How Phase 007 is supposed to behave.
Phase 007 — Google Business Profile. Design intent. Reference, not reality.
Duty: For every Google Business listing (CID) belonging to a confirmed firm, pull three things from Google’s GBP system: the listing metadata (007a), every Google review (007b), and every Google Post / update the firm has published (007c). Reviews and posts change weekly, so the cadence is weekly — the only weekly-tier phase in the pipeline.
Schedule: Weekly tier — designed to fire every Monday.
End state: Three BigQuery enrichment tables populated with fresh rows for each CID: enrichment_007a_gbp_info (address, phone, hours, categories, rating), enrichment_007b_gbp_reviews (every review with text/rating/author/date/owner-reply), enrichment_007c_gbp_updates (every Google Post).
What Phase 007 does, plain English
Phases 005 and 006 looked at each firm’s website. Phase 007 looks at each firm’s Google Maps listing — which is a completely different system run by Google. Reviews show up there. Posts show up there. Address changes show up there. None of that is on the website.
Phase 007 is also the only weekly-tier phase. The reason is review velocity: a single new bad review can drop a firm’s star rating noticeably. We want to catch it within the week, not a month later.
Phase 007 operates per CID, not per domain. One firm may have multiple offices = multiple CIDs. Phase 007 fetches GBP data for every CID, even if 3 of them belong to the same firm.
The 3 sub-steps
| Step | What it does | Provider | Reads | Writes |
007a | GBP Info — basic listing fields: address, phone, hours, primary category, rating, review count, photo count, claimed status. | DataForSEO Business Data | gold_cids (ATTORNEY_CONFIRMED filter, all CIDs not deduped to primary) | enrichment_007a_gbp_info |
007b | GBP Reviews — every Google review. Initial scan: depth=4490 (all). Rescan mode: depth=50 (newest only, threshold 3 days) for cheap weekly updates. | DataForSEO Business Data | gold_cids | enrichment_007b_gbp_reviews |
007c | GBP Updates — every Google Post / business update. Same initial-vs-rescan pattern as 007b. | DataForSEO Business Data | gold_cids | enrichment_007c_gbp_updates |
The async task-poll pattern (all three steps)
GBP endpoints are asynchronous at DataForSEO. You POST a list of CIDs, get back task_ids, then poll tasks_ready until each finishes, then GET the final result. Concretely:
| Aspect | Value |
| POST endpoint | /v3/business_data/google/{my_business_info,reviews,my_business_updates}/task_post |
| Ready endpoint | /tasks_ready — lists task_ids that have completed |
| GET endpoint | /task_get/{task_id} — pull the finished result |
| Rate limit pacing | 1,600 RPM (80% of 2,000 cap) |
| Concurrency | 50 in-flight POSTs / GETs |
| Poll interval | 30 seconds |
| Max wait per task | 1 hour (3,600 sec) |
| Resume | --resume flag re-attaches to in-flight tasks via stored task_ids |
The rescan optimization (007b + 007c)
Reviews and posts accumulate over time. Initial fetch grabs everything (depth=4490); subsequent weekly fetches only need the NEW ones. The --rescan flag does this:
| Mode | Depth | Threshold | What it pulls | Cost shape |
| Initial scan | 4,490 (max) | — | Every review/update ever | $0.0015/task + $0.00075 per 10 reviews returned |
| Rescan | 50 | 3 days (matches Mon+Thu cadence) | Only reviews/updates newer than threshold | $0.0015/task + small per-10 fee (~1-2 reviews per firm) |
This is why Phase 007 can be weekly without exploding in cost — rescan mode is ~10× cheaper than initial scan.
How the data moves
↓
007a · GBP info
step_007a_gbp_info.py
Async POST → poll → GET
address, phone, hours, rating
007b · GBP reviews
step_007b_gbp_reviews.py
Initial: depth=4490 (all reviews)
Rescan: depth=50, 3-day threshold
One row per CID, reviews nested
007c · GBP updates
step_007c_gbp_updates.py
Same pattern as 007b
Google Posts & business updates
↓
Output · listing metadata
enrichment_007a_gbp_info
one row per CID
Output · reviews
enrichment_007b_gbp_reviews
one row per CID per run
(reviews nested in JSON)
Output · posts
enrichment_007c_gbp_updates
one row per CID per run
(updates nested in JSON)
↓
Hands off to
Phase 008 — SERP Analysis
(out of scope for this page)
Where to look — file & table reference
| Thing | Path or table |
| The 3 scripts | /mnt/workspace/amicus/pipeline/steps/007_gbp/step_007[abc]_*.py |
| Output tables | enrichment_007a_gbp_info · enrichment_007b_gbp_reviews · enrichment_007c_gbp_updates |
| DFS location lookup | pipeline/steps/utilities/DFS_Lookup_Tables_Lists_of_categories_and_locations/dfs_business_data_google_locations.json |
| Async client | pipeline/steps/shared/dfs_async_client.py |
| Per-step logs | pipeline/steps/000_log_files/step_007*_*.log |
| API keys | DATAFORSEO_USERNAME · DATAFORSEO_PASSWORD in .env |
Cost per fire
Anchored on 2,772 ATTORNEY_CONFIRMED CIDs (queried 2026-05-16 from enrichment_007a_gbp_info — this is the actual production fire size for the past 3 weekly runs).
| Line item | Volume | Per unit | Subtotal |
| 007a — GBP info ($0.0015 per task) |
~2,772 CIDs | $0.0015 | ~$4.16 |
| 007b — GBP reviews task fee ($0.0015 per task) |
~2,772 CIDs | $0.0015 | ~$4.16 |
| 007b — Reviews returned at rescan depth=50 (~1-2 new reviews/firm avg, $0.00075 per 10) |
~2,772 firms × ~2 new | $0.000075/review | ~$0.42 |
| 007c — GBP updates ($0.00225 per task) |
~2,772 CIDs | $0.00225 | ~$6.24 |
| Total per weekly fire (rescan mode) |
| | ~$15 |
Initial-scan cost is much higher — depth=4490 with ~50 reviews/firm avg = ~13,860 reviews fetched at $0.000075 each = ~$10 just for reviews, plus task fees. Initial-scan fire ≈ $25. Once the first run lands, every subsequent weekly fire uses rescan and stays around $15.
Schedule
Frequency
Weekly tier — designed to fire every Monday.
Trigger
Orchestrator cron with --cadence weekly.
Execution mode
All three steps run async-parallel. Each manages its own POST + poll + GET lifecycle.
Concurrency
50 in-flight requests, 1,600 RPM strict pacing.
Rescan threshold
3 days — matches the original Mon+Thu schedule. Cuts review/update fetch cost ~10×.
Crash recovery
--resume reads stored task_ids and re-attaches to in-flight tasks.
Output guarantee
Every ATTORNEY_CONFIRMED CID gets a row in all three tables for each run_id.
What's Fucked
Phase 007 is mostly working — here’s what’s drifting.
Forensic findings, 2026-05-16. This is the first phase that’s actually firing on schedule. Findings are about edges, not wholesale failure.
Finding 1 — 007 IS firing on the cron successfully (good news, finally).
Direct BQ query of enrichment_007a_gbp_info for atty_wa_seattle as of 2026-05-16:
| Run ID | Date | CIDs |
213f9e1b-af9c-43f6-9b20-64fb0a008664 | 2026-05-11 | 2,772 |
dcb7cfab-8737-46f1-a722-b77176e63585 | 2026-04-28 | 2,772 |
610f9aa7-7b2c-4c0e-9160-aa86b19d8b2f | 2026-04-21 | 2,772 |
backfill_20260420 | 2026-04-20 | 2,968 |
007a, 007b, 007c all show the same 4 runs with identical CID coverage. This is the first phase I’ve documented where the cron actually produces fresh data weekly. Phases 005 and 006 are stuck on their April backfills; Phase 007 has three subsequent cron-driven fires after it.
Finding 2 — The 2026-05-04 Monday fire is missing.
Look at the timestamps: 2026-04-21 → 2026-04-28 → 2026-05-11. That’s 7 days → 13 days. The expected 2026-05-04 fire didn’t happen.
Possible causes (unverified):
- VM was down or rebooted during the May 4 cron window
- The orchestrator hit an error and didn’t retry
- DFS was experiencing API issues that morning
- Memorial-Day-adjacent or other operator action
amicus_logs.pipeline_runs filtered to profile_id='atty_wa_seattle' and the date range 2026-05-03 to 2026-05-05 would show whether a run was attempted at all. Not yet checked.
Finding 3 — CID count drift: 2,968 backfill → 2,772 ongoing.
The 2026-04-20 backfill processed 2,968 CIDs. Every cron fire since has processed 2,772 — 196 CIDs disappeared between the backfill and the first cron fire on 2026-04-21.
Most likely explanation: gold_cids was rebuilt between those dates (e.g. by a Phase 003+004 re-run that produced fewer ATTORNEY_CONFIRMED CIDs). The cron now operates on whatever the current gold_cids table holds.
Confirmation query (not yet run): SELECT COUNT(*) FROM gold_cids WHERE profile_id='atty_wa_seattle' AND final_verdict='ATTORNEY_CONFIRMED' — today that returns 2,742, which is close to 2,772 but not identical, suggesting one more small rebuild happened recently.
Finding 4 — Reviews are stored as one BQ row per CID per run, with reviews nested in JSON.
enrichment_007b_gbp_reviews.review_rows for the latest run = 2,772 = unique_cids. That means the BQ table has one row per CID, not one row per review. The actual review text is presumably nested inside a JSON field on that row.
Implication for downstream analytics:
- Counting reviews requires unnesting the JSON field with
SELECT JSON_EXTRACT_ARRAY(...) or similar.
- Comparing across runs (“new reviews this week”) requires diffing two JSON blobs, which is awkward.
- A flatter schema (one row per (cid, review_id)) would make analytics easier but would explode row count (2,772 firms × avg 50 reviews = 138K rows per run).
This is a design choice, not a bug — just call it out so a future analyst knows where to look.
Finding 5 — Operating on a stale gold list.
Same root cause as every other downstream phase. gold_cids hasn’t been refreshed since Phase 004 last produced rows (see Phase 004 forensic Finding 2). Firms that opened or closed since mid-April are invisible to the weekly GBP scrape. New CIDs added to a firm aren’t tracked.
Finding 6 — If a firm has 10 CIDs (multi-location), Phase 007 fetches 10 times.
Phase 007 explicitly does not dedupe by domain. It processes every CID in the gold table. A large firm with 10 offices = 10 separate GBP API calls. That’s correct behavior (each office has its own listing, reviews, posts) but it does mean cost scales with CIDs not unique firms.
For atty_wa_seattle today: 2,742 CIDs / 2,223 unique domains = 1.23 CIDs per firm on average. Not a big amplifier — but for a firm with 5+ offices in the dataset, Phase 007 is the only phase that bills 5x. (Phases 005 and 006 dedupe to one call per primary domain.)
The bottom line
Where Phase 007 Stands Today
Phase 007 is the first working phase in the pipeline. Weekly fires on 2026-04-21, 2026-04-28, and 2026-05-11 produced complete BQ rows in all three tables for ~2,772 CIDs each. The expected 2026-05-04 fire is missing — unverified cause. The bigger contextual issue is the input: it’s scraping the same stale firm list every week because gold_cids hasn’t been refreshed.
The Fix
What we’ll do to keep Phase 007 working.
Concrete remediation. Mostly small — this phase is working, just needs guardrails and edge-case handling.
Five fixes. None are urgent. The first is the only diagnostic that’s actually missing data.
The Monday 2026-05-04 fire is missing from BQ for all three 007 tables. Three checks:
amicus_logs.pipeline_runs for profile_id='atty_wa_seattle', date 2026-05-03 to 2026-05-05 — was a run attempted?
- VM’s
journalctl for cron at that time — did the cron job even start?
amicus_logs.api_cost_log for DFS business_data spend on 2026-05-04 — did API calls happen but BQ writes fail?
This is investigation, not code. The right next move depends on the answer. If it was a one-off (DFS outage, VM reboot), no further action. If a systemic gap exists, prioritize Fix 2.
If a weekly fire is missed (no enrichment_007a_gbp_info rows for the past 8 days for a profile that previously had weekly data), send an alert. Concretely:
- A daily cron that runs at 12:00 UTC checks: for each active profile, when was the last 007a fire?
- If the last fire is > 8 days old (one-week grace), log to
pipeline_alerts and ping the operator
- If > 14 days, escalate
Reuses the same alert path that Phase 005 / 006 should use once those fixes land.
Today’s cron 30 10 * * 1,4 fires on Mondays AND Thursdays. If 007 is on the weekly tier (Mondays only), it should NOT fire on Thursdays. The empirical data suggests this is already working — we see ~1 fire/week, not ~2. But it’s worth confirming once the cron split lands.
If Phase 005 Fix 1 splits the cron into explicit weekly / biweekly / monthly entries, 007 unambiguously fires Mondays only.
Today 007 happily processes a 4-week-old gold_cids. Add a startup check: if MAX(ingestion_timestamp) FROM gold_cids WHERE profile_id = $P is > 35 days old, log a loud warning. If > 60 days, refuse to run.
Prevents the silent “we’ve been scraping the same dead firms for 6 months” failure mode.
Today enrichment_007b_gbp_reviews has 2,772 rows for 2,772 CIDs — one row per CID, reviews nested. Most analytics that operate on review text would benefit from one row per review. Tradeoffs:
- Keep nested: cheaper storage (~$1/yr), harder to query, easier append semantics.
- Flatten: 2,772 firms × ~50 reviews = 138K rows per run × 52 weeks ≈ 7M rows/yr. Storage cost ~$3/yr. Analytics-friendly. Requires schema for review_id, author, rating, text, date, owner_reply, response_date.
Recommended: flatten. A separate silver_007b_reviews table built from the bronze nested JSON each run. Keep the bronze nested table for replay; the silver table is what analytics queries.
After all 5 fixes
Phase 007 keeps firing every Monday, only on Mondays. Missed fires get alerted within 24 hours. A stale gold list refuses to run instead of silently scraping dead firms. Reviews land in a flat analytics-friendly silver table alongside the nested bronze. This phase becomes the model for what the other phases should look like once their cron and BQ-write gotchas are fixed.
Then we move on to Phase 008 (SERP Analysis — the first phase that uses BOTH monthly AND weekly tiers).