pipeline.amicusdata.dev — Phase 007: Google Business Profile

PHASE 007

Google Business Profile

Steps 007a → 007c · Weekly tier · Output: GBP info + reviews + updates per CID

The Spec

How Phase 007 is supposed to behave.

Phase 007 — Google Business Profile. Design intent. Reference, not reality.

Duty: For every Google Business listing (CID) belonging to a confirmed firm, pull three things from Google’s GBP system: the listing metadata (007a), every Google review (007b), and every Google Post / update the firm has published (007c). Reviews and posts change weekly, so the cadence is weekly — the only weekly-tier phase in the pipeline.

Schedule: Weekly tier — designed to fire every Monday.

End state: Three BigQuery enrichment tables populated with fresh rows for each CID: enrichment_007a_gbp_info (address, phone, hours, categories, rating), enrichment_007b_gbp_reviews (every review with text/rating/author/date/owner-reply), enrichment_007c_gbp_updates (every Google Post).

What Phase 007 does, plain English

Phases 005 and 006 looked at each firm’s website. Phase 007 looks at each firm’s Google Maps listing — which is a completely different system run by Google. Reviews show up there. Posts show up there. Address changes show up there. None of that is on the website.

Phase 007 is also the only weekly-tier phase. The reason is review velocity: a single new bad review can drop a firm’s star rating noticeably. We want to catch it within the week, not a month later.

Phase 007 operates per CID, not per domain. One firm may have multiple offices = multiple CIDs. Phase 007 fetches GBP data for every CID, even if 3 of them belong to the same firm.

The 3 sub-steps

Step	What it does	Provider	Reads	Writes
`007a`	GBP Info — basic listing fields: address, phone, hours, primary category, rating, review count, photo count, claimed status.	DataForSEO Business Data	`gold_cids` (ATTORNEY_CONFIRMED filter, all CIDs not deduped to primary)	`enrichment_007a_gbp_info`
`007b`	GBP Reviews — every Google review. Initial scan: `depth=4490` (all). Rescan mode: `depth=50` (newest only, threshold 3 days) for cheap weekly updates.	DataForSEO Business Data	`gold_cids`	`enrichment_007b_gbp_reviews`
`007c`	GBP Updates — every Google Post / business update. Same initial-vs-rescan pattern as 007b.	DataForSEO Business Data	`gold_cids`	`enrichment_007c_gbp_updates`

The async task-poll pattern (all three steps)

GBP endpoints are asynchronous at DataForSEO. You POST a list of CIDs, get back task_ids, then poll tasks_ready until each finishes, then GET the final result. Concretely:

Aspect	Value
POST endpoint	`/v3/business_data/google/{my_business_info,reviews,my_business_updates}/task_post`
Ready endpoint	`/tasks_ready` — lists task_ids that have completed
GET endpoint	`/task_get/{task_id}` — pull the finished result
Rate limit pacing	1,600 RPM (80% of 2,000 cap)
Concurrency	50 in-flight POSTs / GETs
Poll interval	30 seconds
Max wait per task	1 hour (3,600 sec)
Resume	`--resume` flag re-attaches to in-flight tasks via stored task_ids

The rescan optimization (007b + 007c)

Reviews and posts accumulate over time. Initial fetch grabs everything (depth=4490); subsequent weekly fetches only need the NEW ones. The --rescan flag does this:

Mode	Depth	Threshold	What it pulls	Cost shape
Initial scan	4,490 (max)	—	Every review/update ever	$0.0015/task + $0.00075 per 10 reviews returned
Rescan	50	3 days (matches Mon+Thu cadence)	Only reviews/updates newer than threshold	$0.0015/task + small per-10 fee (~1-2 reviews per firm)

This is why Phase 007 can be weekly without exploding in cost — rescan mode is ~10× cheaper than initial scan.

How the data moves

Input · Phase 004 deliverable

gold_cids

Filter: final_verdict = 'ATTORNEY_CONFIRMED'
NO domain dedup — every CID processed,
even if firm has multiple offices

↓

007a · GBP info

step_007a_gbp_info.py

Async POST → poll → GET
address, phone, hours, rating

007b · GBP reviews

step_007b_gbp_reviews.py

Initial: depth=4490 (all reviews)
Rescan: depth=50, 3-day threshold
One row per CID, reviews nested

007c · GBP updates

step_007c_gbp_updates.py

Same pattern as 007b
Google Posts & business updates

↓

Output · listing metadata

enrichment_007a_gbp_info

one row per CID

Output · reviews

enrichment_007b_gbp_reviews

one row per CID per run
(reviews nested in JSON)

Output · posts

enrichment_007c_gbp_updates

one row per CID per run
(updates nested in JSON)

↓

Hands off to

Phase 008 — SERP Analysis

(out of scope for this page)

Where to look — file & table reference

Thing	Path or table
The 3 scripts	`/mnt/workspace/amicus/pipeline/steps/007_gbp/step_007[abc]_*.py`
Output tables	`enrichment_007a_gbp_info` · `enrichment_007b_gbp_reviews` · `enrichment_007c_gbp_updates`
DFS location lookup	`pipeline/steps/utilities/DFS_Lookup_Tables_Lists_of_categories_and_locations/dfs_business_data_google_locations.json`
Async client	`pipeline/steps/shared/dfs_async_client.py`
Per-step logs	`pipeline/steps/000_log_files/step_007_.log`
API keys	`DATAFORSEO_USERNAME` · `DATAFORSEO_PASSWORD` in `.env`

Cost per fire

Anchored on 2,772 ATTORNEY_CONFIRMED CIDs (queried 2026-05-16 from enrichment_007a_gbp_info — this is the actual production fire size for the past 3 weekly runs).

Line item	Volume	Per unit	Subtotal
007a — GBP info ($0.0015 per task)	~2,772 CIDs	$0.0015	~$4.16
007b — GBP reviews task fee ($0.0015 per task)	~2,772 CIDs	$0.0015	~$4.16
007b — Reviews returned at rescan depth=50 (~1-2 new reviews/firm avg, $0.00075 per 10)	~2,772 firms × ~2 new	$0.000075/review	~$0.42
007c — GBP updates ($0.00225 per task)	~2,772 CIDs	$0.00225	~$6.24
Total per weekly fire (rescan mode)			~$15

Initial-scan cost is much higher — depth=4490 with ~50 reviews/firm avg = ~13,860 reviews fetched at $0.000075 each = ~$10 just for reviews, plus task fees. Initial-scan fire ≈ $25. Once the first run lands, every subsequent weekly fire uses rescan and stays around $15.

Schedule

Frequency

Weekly tier — designed to fire every Monday.

Trigger

Orchestrator cron with --cadence weekly.

Execution mode

All three steps run async-parallel. Each manages its own POST + poll + GET lifecycle.

Concurrency

50 in-flight requests, 1,600 RPM strict pacing.

Rescan threshold

3 days — matches the original Mon+Thu schedule. Cuts review/update fetch cost ~10×.

Crash recovery

--resume reads stored task_ids and re-attaches to in-flight tasks.

Output guarantee

Every ATTORNEY_CONFIRMED CID gets a row in all three tables for each run_id.

What's Fucked

Phase 007 is mostly working — here’s what’s drifting.

Forensic findings, 2026-05-16. This is the first phase that’s actually firing on schedule. Findings are about edges, not wholesale failure.

Finding 1 — 007 IS firing on the cron successfully (good news, finally).

Direct BQ query of enrichment_007a_gbp_info for atty_wa_seattle as of 2026-05-16:

Run ID	Date	CIDs
`213f9e1b-af9c-43f6-9b20-64fb0a008664`	2026-05-11	2,772
`dcb7cfab-8737-46f1-a722-b77176e63585`	2026-04-28	2,772
`610f9aa7-7b2c-4c0e-9160-aa86b19d8b2f`	2026-04-21	2,772
`backfill_20260420`	2026-04-20	2,968

007a, 007b, 007c all show the same 4 runs with identical CID coverage. This is the first phase I’ve documented where the cron actually produces fresh data weekly. Phases 005 and 006 are stuck on their April backfills; Phase 007 has three subsequent cron-driven fires after it.

Finding 2 — The 2026-05-04 Monday fire is missing.

Look at the timestamps: 2026-04-21 → 2026-04-28 → 2026-05-11. That’s 7 days → 13 days. The expected 2026-05-04 fire didn’t happen.

Possible causes (unverified):

VM was down or rebooted during the May 4 cron window
The orchestrator hit an error and didn’t retry
DFS was experiencing API issues that morning
Memorial-Day-adjacent or other operator action

amicus_logs.pipeline_runs filtered to profile_id='atty_wa_seattle' and the date range 2026-05-03 to 2026-05-05 would show whether a run was attempted at all. Not yet checked.

Finding 3 — CID count drift: 2,968 backfill → 2,772 ongoing.

The 2026-04-20 backfill processed 2,968 CIDs. Every cron fire since has processed 2,772 — 196 CIDs disappeared between the backfill and the first cron fire on 2026-04-21.

Most likely explanation: gold_cids was rebuilt between those dates (e.g. by a Phase 003+004 re-run that produced fewer ATTORNEY_CONFIRMED CIDs). The cron now operates on whatever the current gold_cids table holds.

Confirmation query (not yet run): SELECT COUNT(*) FROM gold_cids WHERE profile_id='atty_wa_seattle' AND final_verdict='ATTORNEY_CONFIRMED' — today that returns 2,742, which is close to 2,772 but not identical, suggesting one more small rebuild happened recently.

Finding 4 — Reviews are stored as one BQ row per CID per run, with reviews nested in JSON.

enrichment_007b_gbp_reviews.review_rows for the latest run = 2,772 = unique_cids. That means the BQ table has one row per CID, not one row per review. The actual review text is presumably nested inside a JSON field on that row.

Implication for downstream analytics:

Counting reviews requires unnesting the JSON field with SELECT JSON_EXTRACT_ARRAY(...) or similar.
Comparing across runs (“new reviews this week”) requires diffing two JSON blobs, which is awkward.
A flatter schema (one row per (cid, review_id)) would make analytics easier but would explode row count (2,772 firms × avg 50 reviews = 138K rows per run).

This is a design choice, not a bug — just call it out so a future analyst knows where to look.

Finding 5 — Operating on a stale gold list.

Same root cause as every other downstream phase. gold_cids hasn’t been refreshed since Phase 004 last produced rows (see Phase 004 forensic Finding 2). Firms that opened or closed since mid-April are invisible to the weekly GBP scrape. New CIDs added to a firm aren’t tracked.

Finding 6 — If a firm has 10 CIDs (multi-location), Phase 007 fetches 10 times.

Phase 007 explicitly does not dedupe by domain. It processes every CID in the gold table. A large firm with 10 offices = 10 separate GBP API calls. That’s correct behavior (each office has its own listing, reviews, posts) but it does mean cost scales with CIDs not unique firms.

For atty_wa_seattle today: 2,742 CIDs / 2,223 unique domains = 1.23 CIDs per firm on average. Not a big amplifier — but for a firm with 5+ offices in the dataset, Phase 007 is the only phase that bills 5x. (Phases 005 and 006 dedupe to one call per primary domain.)

The bottom line

Where Phase 007 Stands Today

Phase 007 is the first working phase in the pipeline. Weekly fires on 2026-04-21, 2026-04-28, and 2026-05-11 produced complete BQ rows in all three tables for ~2,772 CIDs each. The expected 2026-05-04 fire is missing — unverified cause. The bigger contextual issue is the input: it’s scraping the same stale firm list every week because gold_cids hasn’t been refreshed.

The Fix

What we’ll do to keep Phase 007 working.

Concrete remediation. Mostly small — this phase is working, just needs guardrails and edge-case handling.

Five fixes. None are urgent. The first is the only diagnostic that’s actually missing data.

FIX 1 Find out what happened on 2026-05-04. ~15 min

The Monday 2026-05-04 fire is missing from BQ for all three 007 tables. Three checks:

amicus_logs.pipeline_runs for profile_id='atty_wa_seattle', date 2026-05-03 to 2026-05-05 — was a run attempted?
VM’s journalctl for cron at that time — did the cron job even start?
amicus_logs.api_cost_log for DFS business_data spend on 2026-05-04 — did API calls happen but BQ writes fail?

This is investigation, not code. The right next move depends on the answer. If it was a one-off (DFS outage, VM reboot), no further action. If a systemic gap exists, prioritize Fix 2.

FIX 2 Add a missed-fire detector. ~30 min

If a weekly fire is missed (no enrichment_007a_gbp_info rows for the past 8 days for a profile that previously had weekly data), send an alert. Concretely:

A daily cron that runs at 12:00 UTC checks: for each active profile, when was the last 007a fire?
If the last fire is > 8 days old (one-week grace), log to pipeline_alerts and ping the operator
If > 14 days, escalate

Reuses the same alert path that Phase 005 / 006 should use once those fixes land.

FIX 3 Inherit Phase 005 Fix 1 — once the cron is split, weekly fires only on Mondays. no work — lands with Phase 005 Fix 1

Today’s cron 30 10 * * 1,4 fires on Mondays AND Thursdays. If 007 is on the weekly tier (Mondays only), it should NOT fire on Thursdays. The empirical data suggests this is already working — we see ~1 fire/week, not ~2. But it’s worth confirming once the cron split lands.

If Phase 005 Fix 1 splits the cron into explicit weekly / biweekly / monthly entries, 007 unambiguously fires Mondays only.

FIX 4 Add a fresh-gold guard on 007’s input. ~20 min

Today 007 happily processes a 4-week-old gold_cids. Add a startup check: if MAX(ingestion_timestamp) FROM gold_cids WHERE profile_id = $P is > 35 days old, log a loud warning. If > 60 days, refuse to run.

Prevents the silent “we’ve been scraping the same dead firms for 6 months” failure mode.

FIX 5 Decide whether 007b reviews should land as flat rows (one per review) instead of nested JSON. ~1 day decision + ~half day code

Today enrichment_007b_gbp_reviews has 2,772 rows for 2,772 CIDs — one row per CID, reviews nested. Most analytics that operate on review text would benefit from one row per review. Tradeoffs:

Keep nested: cheaper storage (~$1/yr), harder to query, easier append semantics.
Flatten: 2,772 firms × ~50 reviews = 138K rows per run × 52 weeks ≈ 7M rows/yr. Storage cost ~$3/yr. Analytics-friendly. Requires schema for review_id, author, rating, text, date, owner_reply, response_date.

Recommended: flatten. A separate silver_007b_reviews table built from the bronze nested JSON each run. Keep the bronze nested table for replay; the silver table is what analytics queries.

After all 5 fixes

Phase 007 keeps firing every Monday, only on Mondays. Missed fires get alerted within 24 hours. A stale gold list refuses to run instead of silently scraping dead firms. Reviews land in a flat analytics-friendly silver table alongside the nested bronze. This phase becomes the model for what the other phases should look like once their cron and BQ-write gotchas are fixed.

Then we move on to Phase 008 (SERP Analysis — the first phase that uses BOTH monthly AND weekly tiers).

Generated 2026-05-16 from /mnt/workspace/amicus/site_pipeline_amicusdata/ on amicus-dev VM.