pipeline.amicusdata.dev — Phase 010: Keywords

PHASE 010

Keywords

Steps 010a → 010i · Mixed monthly + weekly + biweekly · Output: keyword universe per firm + per market

The Spec

How Phase 010 is supposed to behave.

Phase 010 — Keywords. Design intent. Reference, not reality.

Duty: Map out every firm’s keyword universe. What do they currently rank for (010a)? What organic traffic does that send (010b)? What’s their overall ranking footprint (010c)? Then characterize the keyword market itself: how hard is each keyword to rank for (010d), what related keywords matter (010e/f/g), what topical categories does each domain own (010h), and what’s the headline metric for each tracked keyword (010i).

Schedule: Mixed across all three enrichment tiers. Monthly: 010a, 010c, 010h (deep, per-domain). Weekly: 010b (bulk traffic). Biweekly: 010d, 010e, 010f, 010g, 010i (per-keyword Labs endpoints).

End state: Nine BigQuery enrichment tables — the most fragmented phase in the pipeline. Each table answers one slice of “where does this firm sit in the keyword economy.”

What Phase 010 does, plain English

Phase 005 / 006 / 007 / 008 / 009 all measured a firm’s infrastructure: their website, their domain age, their reviews, their backlinks, who outranks them in a SERP. Phase 010 measures the keyword economy the firm operates in.

The 9 sub-steps split into three buckets by what they take as input:

Per-domain (010a, 010c, 010h): one DFS Labs call per firm domain. Expensive at scale. Monthly cadence.
Per-keyword (010d, 010e, 010f, 010g, 010i): one DFS Labs call per unique search keyword (~20 across all attorney specialties). Cheap. Biweekly cadence.
Bulk-per-domain (010b): one DFS Labs call per chunk of 1,000 domains. Cheap. Weekly cadence.

The 9 sub-steps

Step	What it does	Tier	Per-?	Endpoint
`010a`	Ranked keywords — every keyword the firm currently ranks for, with position, search volume, traffic estimate, CPC.	monthly	per domain	`/dataforseo_labs/google/ranked_keywords/live`
`010b`	Bulk traffic estimation — total organic traffic per domain. One number per firm.	weekly	chunked, 1,000/POST	`/dataforseo_labs/google/bulk_traffic_estimation/live`
`010c`	Domain rank overview — high-level organic visibility / keyword footprint per firm.	monthly	per domain	`/dataforseo_labs/google/domain_rank_overview/live`
`010d`	Bulk keyword difficulty — how hard is each tracked keyword to rank for.	biweekly	chunked, 1,000/POST	`/dataforseo_labs/google/bulk_keyword_difficulty/live`
`010e`	Related keywords — semantically related terms per tracked keyword.	biweekly	per keyword	`/dataforseo_labs/google/related_keywords/live`
`010f`	Keyword suggestions — autocomplete-style expansions per tracked keyword.	biweekly	per keyword	`/dataforseo_labs/google/keyword_suggestions/live`
`010g`	Keyword ideas — long-tail keyword ideas adjacent to each tracked term.	biweekly	chunked, 200/POST	`/dataforseo_labs/google/keyword_ideas/live`
`010h`	Categories for domain — what topical categories does each firm cover (auto law, family law, etc.).	monthly	per domain	`/dataforseo_labs/google/categories_for_domain/live`
`010i`	Keyword overview — headline metrics (search volume, CPC, competition) per tracked keyword.	biweekly	chunked, 700/POST	`/dataforseo_labs/google/keyword_overview/live`

Per-domain vs per-keyword — the cost split

The same cost shape applies to every Labs endpoint: $0.01 per POST + $0.0001 per item returned. The number of POSTs depends on whether the endpoint is per-domain or per-keyword and whether it supports chunking.

Endpoint shape	Steps	POSTs for atty_wa_seattle
Per-domain (1 call each)	010a, 010c, 010h	~2,245 POSTs × $0.01 = ~$22.45 base, each
Per-keyword (1 call each)	010e, 010f	~20 POSTs × $0.01 = $0.20 base, each
Chunked per-domain (1,000/POST)	010b	3 POSTs × $0.01 = $0.03 base
Chunked per-keyword (varies)	010d (1,000) · 010g (200) · 010i (700)	1 POST each for 20 keywords = $0.01 base each

This is why monthly cadence is the right slot for 010a + 010c + 010h — they’re the only 3 sub-steps that scale linearly with firm count. The other 6 scale with keyword count (~20), which is tiny.

How the data moves

Input · Phase 004 deliverable

gold_domains

For per-domain steps:
~2,245 unique V_CONFIRMED firms

Input · search keyword set

distinct search_keyword

~20 unique keywords from
Phase 004’s specialty map

↓

Monthly — per-domain

010a · 010c · 010h

ranked_keywords / domain_rank_overview /
categories_for_domain
2,245 POSTs each

Weekly — bulk-per-domain

010b

bulk_traffic_estimation
3 POSTs total (chunks of 1,000)

Biweekly — per-keyword

010d · 010e · 010f · 010g · 010i

5 endpoints × ~20 keywords
~1 POST per endpoint per fire

↓

Output · 9 enrichment tables

enrichment_010{a..i}_*

Per-domain tables: 2,245 rows each
Per-keyword tables: 20 rows each (~100 for 010g)

Where to look — file & table reference

Thing	Path or table
The 9 scripts	`/mnt/workspace/amicus/pipeline/steps/010_keywords/step_010[a-i]_*.py`
BQ output tables	`enrichment_010a_ranked_keywords` · `enrichment_010b_bulk_traffic` · `enrichment_010c_domain_rank_overview` · `enrichment_010d_bulk_keyword_difficulty` · `enrichment_010e_related_keywords` · `enrichment_010f_keyword_suggestions` · `enrichment_010g_keyword_ideas` · `enrichment_010h_categories_for_domain` · `enrichment_010i_keyword_overview`
Per-step logs	`pipeline/steps/000_log_files/step_010_.log`
Cost constants	`COST_PER_REQUEST = 0.01` + `COST_PER_ITEM = 0.0001` on every 010 script

Cost per fire

Three fire types, each with different math.

Line item	Volume	Per unit	Subtotal
010a — ranked_keywords ($0.01 + ~$0.0001 per ranked keyword)	~2,245 firms × ~100 kw avg	~$0.02	~$45
010c — domain_rank_overview ($0.01 base + small per-item)	~2,245 firms	~$0.011	~$25
010h — categories_for_domain ($0.01 base + small per-item)	~2,245 firms	~$0.011	~$25
Monthly fire (010a + 010c + 010h)			~$95
010b — bulk_traffic (3 POSTs × $0.01 + per-row)	~3 POSTs / 2,245 rows		~$0.25
Weekly fire (010b)			~$0.25
010d, 010e, 010f, 010g, 010i (5 steps × ~20 keywords each)	~5 POSTs base		~$0.60
Biweekly fire (010d-g, 010i)			~$0.60

Anchored on 2,245 V_CONFIRMED firms (queried 2026-05-16 from enrichment_010b_bulk_traffic — the actual production fire size for the past 3 weekly runs). 20 unique keywords for biweekly tier comes from the 20 attorney specialties Phase 004 produces.

Schedule

Monthly (010a + 010c + 010h)

Fires once per 30 days. The expensive per-domain Labs cuts.

Weekly (010b)

Fires every Monday. Bulk traffic estimate — one number per firm.

Biweekly (010d, e, f, g, i)

Fires Mon + Thu. Per-keyword Labs endpoints (~20 keywords) — tiny payload.

Execution mode

All 9 steps are synchronous — dfs_request_with_retry per call, no async pool.

Chunk sizes

010b/d: 1,000 · 010g: 200 · 010i: 700 · 010a/c/h: 1 (per domain).

Output guarantee

Per-domain steps: one row per input domain. Per-keyword steps: one row per input keyword (~20).

What's Fucked

Phase 010 is split clean down the middle: weekly+biweekly work, monthly doesn’t.

Forensic findings, 2026-05-16. Anchored on direct BQ queries of all 9 enrichment tables.

Finding 1 — All three monthly steps are broken or stalled.

Step	Tier	Last fire	Status
`010a_ranked_keywords`	monthly	2026-04-07	1 run ever (`atty_production_001`). 39 days stale.
`010c_domain_rank_overview`	monthly	never	ZERO ROWS. Never run.
`010h_categories_for_domain`	monthly	2026-04-14	2 runs (Apr 7 + Apr 14). 32 days stale.

Same pattern as Phase 005a, 006a, 006b, 008a, 008b, 009a, 009b: the monthly tier has not fired since mid-April. 010c is the worst case — never run at all, same situation as 008b. Combined cost gap: ~$95/month of expected spend that’s either not happening, or happening but silently no-op’ing on the BQ write.

Finding 2 — 010b (weekly) is firing correctly, alongside Phase 007.

Same 3 weekly run_ids as Phase 007:

Run ID (short)	Date	Domains
`5ab4a945`	2026-04-21	2,245
`dcb7cfab`	2026-04-28	2,245
`213f9e1b`	2026-05-11	2,245

Consistent coverage. Missing 2026-05-04 fire (same gap as Phase 007). Same drift pattern as the rest of the weekly tier.

Finding 3 — All five biweekly steps (010d, e, f, g, i) are firing reliably.

Each of the 5 biweekly tables shows 5 consecutive fires across the past 16 days:

Date	Row counts per table
2026-04-28	010d:20 · 010e:20 · 010f:20 · 010g:100 · 010i:20
2026-05-01	010d:20 · 010e:20 · 010f:20 · 010g:100 · 010i:20
2026-05-07	010d:20 · 010e:20 · 010f:20 · 010g:100 · 010i:20
2026-05-11	010d:20 · 010e:20 · 010f:20 · 010g:100 · 010i:20
2026-05-14	010d:20 · 010e:20 · 010f:20 · 010g:100 · 010i:20

That’s 25 successful BQ writes in a row across 5 tables × 5 runs. Same pattern as Phase 009 biweekly — identical run_ids, identical cadence. Like 009c-h, this is the model of how the pipeline should behave.

Finding 4 — The 20 / 100 row counts are per-keyword, not per-domain.

When we queried COUNT(DISTINCT domain) on the biweekly tables, every one returned 0 — meaning the domain column is NULL or absent. The 20 rows are keyed by keyword instead. 010g_keyword_ideas returns 100 rows because each keyword is expanded into ~5 ideas.

Implication: joining 010d-g/i output back to per-domain tables requires going through the search_keyword field, not domain. Easy join but worth knowing.

Finding 5 — 010a baseline used a different run_id than other monthly runs.

The single 010a fire used atty_production_001 (the April 7 baseline). 010h’s second fire on 2026-04-14 used e6e2a8b5-119b-43eb-a3c6-d5ac61256505. These don’t share run_ids with each other, with 009a-b’s backfill_20260420, or with the weekly/biweekly cron run_ids.

That suggests the monthly tier was being fired manually with ad-hoc run_ids during the April attempts, not by a clean cron-triggered “run_id per market per cycle” scheme. Once the cron split lands (Phase 005 Fix 1), every monthly fire should use a single coherent run_id.

Finding 6 — 010h has the only run-over-run row count change in monthly tier.

010h on 2026-04-07 wrote 1,667 rows. 010h on 2026-04-14 wrote 2,245 rows. Either:

The April 14 rerun used a fuller gold list (Phase 003+004 added ~578 V_CONFIRMED firms between Apr 7 and Apr 14), OR
The April 7 run had partial coverage and the April 14 run backfilled the rest

2,245 matches the consistent count seen in 009c-h biweekly fires from 2026-04-28 onwards, so the gold list size has been stable since then.

The bottom line

Where Phase 010 Stands Today

The cheap stuff works perfectly. 010b weekly fires alongside Phase 007 (3 successful runs). 010d, e, f, g, i biweekly fire alongside Phase 009 (5 successful runs each, 25 BQ writes in a row). The expensive monthly stuff hasn’t fired in 32-39 days, and 010c has never fired at all. Combined ~$95/month of expected monthly spend is not landing. Same root cause family as Phase 005 / 006 / 008b / 009a: monthly tier on the broken Mon+Thu cron either isn’t triggering or is silently no-op’ing the BQ write.

The Fix

What we’ll do to make Phase 010 match the spec.

Concrete remediation. 010c is the urgent one; everything else inherits Phase 005’s cron split.

Six fixes. 010c is the only sub-step in Phase 010 with zero rows ever, same priority level as 008b.

FIX 1 Get 010c to write its first row. ~30 min diagnosis + ~30 min fix

Same diagnostic playbook as Phase 008 Fix 1. SSH to VM, run step_010c_domain_rank_overview.py manually with PIPELINE_RUN_ID=manual_test_2026_05_16 set, watch the BQ write step.

Three possibilities, same as 008b:

Silent skip on missing PIPELINE_RUN_ID — apply Phase 002’s fail-loud fix
Schema mismatch — reconcile against bq/schemas/enrichment.py
Never invoked by orchestrator — check cadence.py tags

If 010c is supposed to be monthly, it’s ~$25/month of expected output that’s simply absent today.

FIX 2 Investigate why 010a + 010h haven’t fired in 32-39 days. ~30 min

Identical diagnostic to Phase 009 Fix 1:

Check cadence.py — are 010a and 010h tagged monthly?
Check amicus_logs.step_logs for any 010a/010h rows since April 14
Check amicus_logs.api_cost_log for DFS Labs spend on the ranked_keywords and categories_for_domain endpoints in May

If the monthly tier simply isn’t in the cron’s active set, Phase 005 Fix 1 (cron split) fixes 010a, 010c, 010h plus every other dormant monthly step in one shot.

FIX 3 Inherit Phase 005 Fix 1 cron split. no work — lands with Phase 005 Fix 1

Once the cron splits into explicit monthly / weekly / biweekly entries:

010a + 010c + 010h fire once on the 1st of each month (~$95/month)
010b fires Mondays only (~$0.25/week = ~$1/month)
010d, e, f, g, i fire Mon + Thu (~$0.60/fire × 8-9 = ~$5/month)

No 010-specific cron line. The pattern is the same as 005/006/008/009.

FIX 4 Add per-fire row-count assertions for all 9 sub-steps. ~35 min

Reuse the shared assert_bulk_coverage() helper from Phase 009 Fix 4. For each of the 9 010 steps:

Per-domain (010a, 010c, 010h): row count = input domain count (no slack)
Bulk-per-domain (010b): row count = input domain count
Per-keyword (010d, e, f, i): row count = unique search_keyword count (~20)
Per-keyword expansion (010g): row count ~ 5× input keyword count (~100)

If any assertion fails, exit non-zero with the specific shortfall. Catches future silent partial-coverage runs.

FIX 5 Document the keyword set drift. ~15 min

Today the biweekly tables hold 20 rows per keyword run. If Phase 004’s 24-category specialty map ever grows or shrinks, those tables grow or shrink to match. There’s no record of which keywords were tracked on a given date.

Action: on every 010 fire, log the input keyword set to step_logs.params_json. A future analyst querying “why does 2026-08’s data have 22 rows when 2026-04’s had 20” can trace the answer to a specific Phase 004 config change instead of guessing.

FIX 6 Decide whether 010g’s 5× expansion (20 → 100 rows) is the right limit. ~20 min

010g_keyword_ideas writes 100 rows per fire for 20 input keywords — an ~5× expansion. That’s the DFS endpoint default. Two questions:

Is 5 ideas per keyword the right ceiling? More gives broader signal, fewer gives focus.
Should the per-keyword expansion be configurable per profile?

Low priority — the current value works, just call it out so it’s an explicit decision rather than an accident of the DFS default.

After all 6 fixes

The cron split from Phase 005 Fix 1 lands. 010a + 010c + 010h fire on the 1st of each month producing ~$95 of fresh per-domain Labs data. 010b continues weekly, 010d-g+i continue biweekly — both already working. 010c finally writes its first row. Every sub-step asserts coverage. The 9-table fragmentation is intentional and traceable.

Phase 010 is the last phase in the pipeline. After all 10 phases are in good shape, the next forensic question is no longer per-phase — it’s about cross-phase analytics tables (silver layers, scoring models) which live on top of all this enrichment data. That’s out of scope for this site.

Generated 2026-05-16 from /mnt/workspace/amicus/site_pipeline_amicusdata/ on amicus-dev VM. Last phase page in the set.