The Spec
How Phase 010 is supposed to behave.
Phase 010 — Keywords. Design intent. Reference, not reality.
Duty: Map out every firm’s keyword universe. What do they currently rank for (010a)? What organic traffic does that send (010b)? What’s their overall ranking footprint (010c)? Then characterize the keyword market itself: how hard is each keyword to rank for (010d), what related keywords matter (010e/f/g), what topical categories does each domain own (010h), and what’s the headline metric for each tracked keyword (010i).
Schedule: Mixed across all three enrichment tiers. Monthly: 010a, 010c, 010h (deep, per-domain). Weekly: 010b (bulk traffic). Biweekly: 010d, 010e, 010f, 010g, 010i (per-keyword Labs endpoints).
End state: Nine BigQuery enrichment tables — the most fragmented phase in the pipeline. Each table answers one slice of “where does this firm sit in the keyword economy.”
What Phase 010 does, plain English
Phase 005 / 006 / 007 / 008 / 009 all measured a firm’s infrastructure: their website, their domain age, their reviews, their backlinks, who outranks them in a SERP. Phase 010 measures the keyword economy the firm operates in.
The 9 sub-steps split into three buckets by what they take as input:
- Per-domain (010a, 010c, 010h): one DFS Labs call per firm domain. Expensive at scale. Monthly cadence.
- Per-keyword (010d, 010e, 010f, 010g, 010i): one DFS Labs call per unique search keyword (~20 across all attorney specialties). Cheap. Biweekly cadence.
- Bulk-per-domain (010b): one DFS Labs call per chunk of 1,000 domains. Cheap. Weekly cadence.
The 9 sub-steps
| Step | What it does | Tier | Per-? | Endpoint |
010a | Ranked keywords — every keyword the firm currently ranks for, with position, search volume, traffic estimate, CPC. | monthly | per domain | /dataforseo_labs/google/ranked_keywords/live |
010b | Bulk traffic estimation — total organic traffic per domain. One number per firm. | weekly | chunked, 1,000/POST | /dataforseo_labs/google/bulk_traffic_estimation/live |
010c | Domain rank overview — high-level organic visibility / keyword footprint per firm. | monthly | per domain | /dataforseo_labs/google/domain_rank_overview/live |
010d | Bulk keyword difficulty — how hard is each tracked keyword to rank for. | biweekly | chunked, 1,000/POST | /dataforseo_labs/google/bulk_keyword_difficulty/live |
010e | Related keywords — semantically related terms per tracked keyword. | biweekly | per keyword | /dataforseo_labs/google/related_keywords/live |
010f | Keyword suggestions — autocomplete-style expansions per tracked keyword. | biweekly | per keyword | /dataforseo_labs/google/keyword_suggestions/live |
010g | Keyword ideas — long-tail keyword ideas adjacent to each tracked term. | biweekly | chunked, 200/POST | /dataforseo_labs/google/keyword_ideas/live |
010h | Categories for domain — what topical categories does each firm cover (auto law, family law, etc.). | monthly | per domain | /dataforseo_labs/google/categories_for_domain/live |
010i | Keyword overview — headline metrics (search volume, CPC, competition) per tracked keyword. | biweekly | chunked, 700/POST | /dataforseo_labs/google/keyword_overview/live |
Per-domain vs per-keyword — the cost split
The same cost shape applies to every Labs endpoint: $0.01 per POST + $0.0001 per item returned. The number of POSTs depends on whether the endpoint is per-domain or per-keyword and whether it supports chunking.
| Endpoint shape | Steps | POSTs for atty_wa_seattle |
| Per-domain (1 call each) | 010a, 010c, 010h | ~2,245 POSTs × $0.01 = ~$22.45 base, each |
| Per-keyword (1 call each) | 010e, 010f | ~20 POSTs × $0.01 = $0.20 base, each |
| Chunked per-domain (1,000/POST) | 010b | 3 POSTs × $0.01 = $0.03 base |
| Chunked per-keyword (varies) | 010d (1,000) · 010g (200) · 010i (700) | 1 POST each for 20 keywords = $0.01 base each |
This is why monthly cadence is the right slot for 010a + 010c + 010h — they’re the only 3 sub-steps that scale linearly with firm count. The other 6 scale with keyword count (~20), which is tiny.
How the data moves
↓
Monthly — per-domain
010a · 010c · 010h
ranked_keywords / domain_rank_overview /
categories_for_domain
2,245 POSTs each
Weekly — bulk-per-domain
010b
bulk_traffic_estimation
3 POSTs total (chunks of 1,000)
Biweekly — per-keyword
010d · 010e · 010f · 010g · 010i
5 endpoints × ~20 keywords
~1 POST per endpoint per fire
↓
Output · 9 enrichment tables
enrichment_010{a..i}_*
Per-domain tables: 2,245 rows each
Per-keyword tables: 20 rows each (~100 for 010g)
Where to look — file & table reference
| Thing | Path or table |
| The 9 scripts | /mnt/workspace/amicus/pipeline/steps/010_keywords/step_010[a-i]_*.py |
| BQ output tables | enrichment_010a_ranked_keywords · enrichment_010b_bulk_traffic · enrichment_010c_domain_rank_overview · enrichment_010d_bulk_keyword_difficulty · enrichment_010e_related_keywords · enrichment_010f_keyword_suggestions · enrichment_010g_keyword_ideas · enrichment_010h_categories_for_domain · enrichment_010i_keyword_overview |
| Per-step logs | pipeline/steps/000_log_files/step_010*_*.log |
| Cost constants | COST_PER_REQUEST = 0.01 + COST_PER_ITEM = 0.0001 on every 010 script |
Cost per fire
Three fire types, each with different math.
| Line item | Volume | Per unit | Subtotal |
| 010a — ranked_keywords ($0.01 + ~$0.0001 per ranked keyword) |
~2,245 firms × ~100 kw avg | ~$0.02 | ~$45 |
| 010c — domain_rank_overview ($0.01 base + small per-item) |
~2,245 firms | ~$0.011 | ~$25 |
| 010h — categories_for_domain ($0.01 base + small per-item) |
~2,245 firms | ~$0.011 | ~$25 |
| Monthly fire (010a + 010c + 010h) |
| | ~$95 |
| 010b — bulk_traffic (3 POSTs × $0.01 + per-row) |
~3 POSTs / 2,245 rows | | ~$0.25 |
| Weekly fire (010b) |
| | ~$0.25 |
| 010d, 010e, 010f, 010g, 010i (5 steps × ~20 keywords each) |
~5 POSTs base | | ~$0.60 |
| Biweekly fire (010d-g, 010i) |
| | ~$0.60 |
Anchored on 2,245 V_CONFIRMED firms (queried 2026-05-16 from enrichment_010b_bulk_traffic — the actual production fire size for the past 3 weekly runs). 20 unique keywords for biweekly tier comes from the 20 attorney specialties Phase 004 produces.
Schedule
Monthly (010a + 010c + 010h)
Fires once per 30 days. The expensive per-domain Labs cuts.
Weekly (010b)
Fires every Monday. Bulk traffic estimate — one number per firm.
Biweekly (010d, e, f, g, i)
Fires Mon + Thu. Per-keyword Labs endpoints (~20 keywords) — tiny payload.
Execution mode
All 9 steps are synchronous — dfs_request_with_retry per call, no async pool.
Chunk sizes
010b/d: 1,000 · 010g: 200 · 010i: 700 · 010a/c/h: 1 (per domain).
Output guarantee
Per-domain steps: one row per input domain. Per-keyword steps: one row per input keyword (~20).
What's Fucked
Phase 010 is split clean down the middle: weekly+biweekly work, monthly doesn’t.
Forensic findings, 2026-05-16. Anchored on direct BQ queries of all 9 enrichment tables.
Finding 1 — All three monthly steps are broken or stalled.
| Step | Tier | Last fire | Status |
010a_ranked_keywords | monthly | 2026-04-07 | 1 run ever (atty_production_001). 39 days stale. |
010c_domain_rank_overview | monthly | never | ZERO ROWS. Never run. |
010h_categories_for_domain | monthly | 2026-04-14 | 2 runs (Apr 7 + Apr 14). 32 days stale. |
Same pattern as Phase 005a, 006a, 006b, 008a, 008b, 009a, 009b: the monthly tier has not fired since mid-April. 010c is the worst case — never run at all, same situation as 008b. Combined cost gap: ~$95/month of expected spend that’s either not happening, or happening but silently no-op’ing on the BQ write.
Finding 2 — 010b (weekly) is firing correctly, alongside Phase 007.
Same 3 weekly run_ids as Phase 007:
| Run ID (short) | Date | Domains |
5ab4a945 | 2026-04-21 | 2,245 |
dcb7cfab | 2026-04-28 | 2,245 |
213f9e1b | 2026-05-11 | 2,245 |
Consistent coverage. Missing 2026-05-04 fire (same gap as Phase 007). Same drift pattern as the rest of the weekly tier.
Finding 3 — All five biweekly steps (010d, e, f, g, i) are firing reliably.
Each of the 5 biweekly tables shows 5 consecutive fires across the past 16 days:
| Date | Row counts per table |
| 2026-04-28 | 010d:20 · 010e:20 · 010f:20 · 010g:100 · 010i:20 |
| 2026-05-01 | 010d:20 · 010e:20 · 010f:20 · 010g:100 · 010i:20 |
| 2026-05-07 | 010d:20 · 010e:20 · 010f:20 · 010g:100 · 010i:20 |
| 2026-05-11 | 010d:20 · 010e:20 · 010f:20 · 010g:100 · 010i:20 |
| 2026-05-14 | 010d:20 · 010e:20 · 010f:20 · 010g:100 · 010i:20 |
That’s 25 successful BQ writes in a row across 5 tables × 5 runs. Same pattern as Phase 009 biweekly — identical run_ids, identical cadence. Like 009c-h, this is the model of how the pipeline should behave.
Finding 4 — The 20 / 100 row counts are per-keyword, not per-domain.
When we queried COUNT(DISTINCT domain) on the biweekly tables, every one returned 0 — meaning the domain column is NULL or absent. The 20 rows are keyed by keyword instead. 010g_keyword_ideas returns 100 rows because each keyword is expanded into ~5 ideas.
Implication: joining 010d-g/i output back to per-domain tables requires going through the search_keyword field, not domain. Easy join but worth knowing.
Finding 5 — 010a baseline used a different run_id than other monthly runs.
The single 010a fire used atty_production_001 (the April 7 baseline). 010h’s second fire on 2026-04-14 used e6e2a8b5-119b-43eb-a3c6-d5ac61256505. These don’t share run_ids with each other, with 009a-b’s backfill_20260420, or with the weekly/biweekly cron run_ids.
That suggests the monthly tier was being fired manually with ad-hoc run_ids during the April attempts, not by a clean cron-triggered “run_id per market per cycle” scheme. Once the cron split lands (Phase 005 Fix 1), every monthly fire should use a single coherent run_id.
Finding 6 — 010h has the only run-over-run row count change in monthly tier.
010h on 2026-04-07 wrote 1,667 rows. 010h on 2026-04-14 wrote 2,245 rows. Either:
- The April 14 rerun used a fuller gold list (Phase 003+004 added ~578 V_CONFIRMED firms between Apr 7 and Apr 14), OR
- The April 7 run had partial coverage and the April 14 run backfilled the rest
2,245 matches the consistent count seen in 009c-h biweekly fires from 2026-04-28 onwards, so the gold list size has been stable since then.
The bottom line
Where Phase 010 Stands Today
The cheap stuff works perfectly. 010b weekly fires alongside Phase 007 (3 successful runs). 010d, e, f, g, i biweekly fire alongside Phase 009 (5 successful runs each, 25 BQ writes in a row). The expensive monthly stuff hasn’t fired in 32-39 days, and 010c has never fired at all. Combined ~$95/month of expected monthly spend is not landing. Same root cause family as Phase 005 / 006 / 008b / 009a: monthly tier on the broken Mon+Thu cron either isn’t triggering or is silently no-op’ing the BQ write.
The Fix
What we’ll do to make Phase 010 match the spec.
Concrete remediation. 010c is the urgent one; everything else inherits Phase 005’s cron split.
Six fixes. 010c is the only sub-step in Phase 010 with zero rows ever, same priority level as 008b.
Same diagnostic playbook as Phase 008 Fix 1. SSH to VM, run step_010c_domain_rank_overview.py manually with PIPELINE_RUN_ID=manual_test_2026_05_16 set, watch the BQ write step.
Three possibilities, same as 008b:
- Silent skip on missing
PIPELINE_RUN_ID — apply Phase 002’s fail-loud fix
- Schema mismatch — reconcile against
bq/schemas/enrichment.py
- Never invoked by orchestrator — check
cadence.py tags
If 010c is supposed to be monthly, it’s ~$25/month of expected output that’s simply absent today.
Identical diagnostic to Phase 009 Fix 1:
- Check
cadence.py — are 010a and 010h tagged monthly?
- Check
amicus_logs.step_logs for any 010a/010h rows since April 14
- Check
amicus_logs.api_cost_log for DFS Labs spend on the ranked_keywords and categories_for_domain endpoints in May
If the monthly tier simply isn’t in the cron’s active set, Phase 005 Fix 1 (cron split) fixes 010a, 010c, 010h plus every other dormant monthly step in one shot.
Once the cron splits into explicit monthly / weekly / biweekly entries:
- 010a + 010c + 010h fire once on the 1st of each month (~$95/month)
- 010b fires Mondays only (~$0.25/week = ~$1/month)
- 010d, e, f, g, i fire Mon + Thu (~$0.60/fire × 8-9 = ~$5/month)
No 010-specific cron line. The pattern is the same as 005/006/008/009.
Reuse the shared assert_bulk_coverage() helper from Phase 009 Fix 4. For each of the 9 010 steps:
- Per-domain (010a, 010c, 010h): row count = input domain count (no slack)
- Bulk-per-domain (010b): row count = input domain count
- Per-keyword (010d, e, f, i): row count = unique search_keyword count (~20)
- Per-keyword expansion (010g): row count ~ 5× input keyword count (~100)
If any assertion fails, exit non-zero with the specific shortfall. Catches future silent partial-coverage runs.
Today the biweekly tables hold 20 rows per keyword run. If Phase 004’s 24-category specialty map ever grows or shrinks, those tables grow or shrink to match. There’s no record of which keywords were tracked on a given date.
Action: on every 010 fire, log the input keyword set to step_logs.params_json. A future analyst querying “why does 2026-08’s data have 22 rows when 2026-04’s had 20” can trace the answer to a specific Phase 004 config change instead of guessing.
010g_keyword_ideas writes 100 rows per fire for 20 input keywords — an ~5× expansion. That’s the DFS endpoint default. Two questions:
- Is 5 ideas per keyword the right ceiling? More gives broader signal, fewer gives focus.
- Should the per-keyword expansion be configurable per profile?
Low priority — the current value works, just call it out so it’s an explicit decision rather than an accident of the DFS default.
After all 6 fixes
The cron split from Phase 005 Fix 1 lands. 010a + 010c + 010h fire on the 1st of each month producing ~$95 of fresh per-domain Labs data. 010b continues weekly, 010d-g+i continue biweekly — both already working. 010c finally writes its first row. Every sub-step asserts coverage. The 9-table fragmentation is intentional and traceable.
Phase 010 is the last phase in the pipeline. After all 10 phases are in good shape, the next forensic question is no longer per-phase — it’s about cross-phase analytics tables (silver layers, scoring models) which live on top of all this enrichment data. That’s out of scope for this site.