pipeline.amicusdata.dev — Phase 009: Backlinks

PHASE 009

Backlinks

Steps 009a → 009h · Mixed monthly + biweekly · Output: who links at each firm, how authoritative those links are

The Spec

How Phase 009 is supposed to behave.

Phase 009 — Backlinks. Design intent. Reference, not reality.

Duty: For every confirmed firm, characterize the firm’s backlink profile — how many sites link at them, how authoritative those sites are, whether the profile looks spammy, and how it’s changing over time. Two cadences: deep per-domain calls (009a-b) run monthly; cheap bulk-aggregate calls (009c-h) run twice a week to track movement.

Schedule: Mixed — 009a + 009b are monthly. 009c, d, e, f, g, h are biweekly (Mon + Thu).

End state: Eight BigQuery enrichment tables populated. The biweekly bulks (c-h) give a refreshed snapshot of authority metrics for every firm twice a week; the monthly deep cuts (a-b) capture the full per-link detail.

What Phase 009 does, plain English

Backlinks are the single most predictive Google-ranking signal we have. A firm with 500 backlinks from authoritative sites (law journals, bar association directories, news outlets) ranks above a firm with 20 backlinks from low-quality directories, all else equal.

Phase 009 splits the question of “how strong is this firm’s backlink profile” into eight different DFS API calls, each answering a different sub-question:

009a + 009b — deep cuts. Full per-domain summaries and individual backlinks. Expensive ($0.02 per domain × per-row fees). Monthly cadence.
009c through 009h — bulk endpoints. Each one batches up to 1,000 domains per POST. Each returns one metric per firm: rank score, total backlinks count, spam score, referring domains count, new/lost referring domains, total indexed pages. Roughly 100× cheaper than the deep cuts.

The bulk endpoints are what makes biweekly cadence affordable. A monthly snapshot of full backlink detail (009a + 009b) is paired with twice-weekly bulk-aggregate updates (009c-h) so we can detect “this firm just gained 50 referring domains” within 72 hours.

The 8 sub-steps

Step	What it does	Tier	Endpoint	Batch
`009a`	Backlinks summary — per domain: total backlinks, total referring domains, domain rank, anchor text breakdown.	monthly	`/backlinks/summary/live`	1 domain/call
`009b`	Backlinks live — the full list of individual live backlinks pointing at the domain. Limit 200 per domain, ordered by rank desc.	monthly	`/backlinks/backlinks/live`	1 domain/call
`009c`	Bulk Domain Rank — DFS authority score (0–1000 scale) per domain.	biweekly	`/backlinks/bulk_ranks/live`	up to 1,000 domains/call
`009d`	Bulk Total Backlinks — total count of backlinks per domain.	biweekly	`/backlinks/bulk_backlinks/live`	up to 1,000 domains/call
`009e`	Bulk Spam Score — DFS-assigned spam-score per domain. Flags low-quality backlink profiles.	biweekly	`/backlinks/bulk_spam_score/live`	up to 1,000 domains/call
`009f`	Bulk Referring Domains — count of unique referring domains per firm.	biweekly	`/backlinks/bulk_referring_domains/live`	up to 1,000 domains/call
`009g`	Bulk New/Lost Referring Domains — referring domains gained or lost in the last period.	biweekly	`/backlinks/bulk_new_lost_referring_domains/live`	up to 1,000 domains/call
`009h`	Bulk Pages Summary — total pages of the firm’s site indexed by Google.	biweekly	`/backlinks/bulk_pages_summary/live`	up to 100 domains/call (smaller batch)

Deep cuts (009a/b) vs Bulk (009c-h) — the cost difference is the point

Aspect	Deep cut (009a/b)	Bulk (009c-h)
Cost shape	$0.02 per domain (one POST per firm)	$0.02 per POST + $0.00003 per row (one POST per 1,000 firms)
For 2,247 firms	2,247 POSTs × $0.02 = $44.94	3 POSTs × $0.02 = $0.06 + 2,247 rows × $0.00003 = $0.067 ≈ $0.13 per bulk step
Detail returned	Full per-domain breakdown	One metric value per domain
Use case	Quarterly “deep audit” for a firm	Trend tracking, change-detection alerts
Frequency that’s affordable	Monthly	Twice a week

How the data moves

Input · Phase 004 deliverable

gold_domains

Filter: final_verdict = ATTORNEY_CONFIRMED
+ is_domain_primary = TRUE
(one row per unique firm domain)

↓

009a · summary (monthly)

step_009a_backlinks_summary.py

1 POST per firm
~$0.02 each

009b · live (monthly)

step_009b_backlinks_live.py

1 POST per firm
Top 200 backlinks each

009c · ranks (biweekly)

step_009c_bulk_ranks.py

3 POSTs total
$0.13 per fire

009d · backlinks count

step_009d_bulk_backlinks.py

3 POSTs total

009e · spam score

step_009e_bulk_spam_score.py

3 POSTs total

009f · referring

step_009f_bulk_referring_domains.py

3 POSTs total

009g · new/lost

step_009g_bulk_new_lost_referring.py

3 POSTs total

009h · pages (smaller batch)

step_009h_bulk_pages_summary.py

23 POSTs (chunk=100, not 1000)

↓

Output · 8 enrichment tables

enrichment_009{a..h}_*

Monthly: a + b have full per-link detail
Biweekly: c-h get fresh snapshots of 6 metrics

Where to look — file & table reference

Thing	Path or table
The 8 scripts	`/mnt/workspace/amicus/pipeline/steps/009_backlinks/step_009[a-h]_*.py`
BQ output tables	`enrichment_009a_backlinks_summary` · `enrichment_009b_backlinks_live` · `enrichment_009c_bulk_ranks` · `enrichment_009d_bulk_backlinks` · `enrichment_009e_bulk_spam_score` · `enrichment_009f_bulk_referring_domains` · `enrichment_009g_bulk_new_lost_referring` · `enrichment_009h_bulk_pages_summary`
Per-step logs	`pipeline/steps/000_log_files/step_009_.log`
Shared DFS utilities	`pipeline/steps/shared/dfs_common.py`

Cost per fire

Anchored on real BQ row counts queried 2026-05-16:

Monthly fire: 2,423 domains (the size of the one backfill on 2026-04-20 that populated 009a + 009b)
Biweekly fire: 2,245 domains (the consistent size of the past 5 cron-driven 009c-h fires)

Line item	Volume	Per unit	Subtotal
009a — Backlinks summary ($0.02 per domain)	~2,423	$0.02	~$48.46
009b — Backlinks live ($0.02 + $0.00003 × ~200 backlinks)	~2,423	~$0.026	~$63.00
Monthly fire (009a + 009b)			~$111
009c – 009g (5 bulk steps, chunk=1000, 3 POSTs each)	5 × ~$0.13		~$0.65
009h (chunk=100, 23 POSTs per fire)	~23 POSTs	$0.02	~$0.53
Biweekly fire (009c-h)			~$1.18

Biweekly fires are absurdly cheap relative to monthly — that’s the whole design intent. Twice a week × ~$1.18 = ~$10/month for change-tracking on 6 backlink metrics across 2,245 firms.

Schedule

Monthly (009a + 009b)

Fires once per 30 days. Full per-link detail.

Biweekly (009c-h)

Fires every Monday + Thursday. 6 bulk-aggregate metrics, all 2,245 firms.

Execution mode

All 8 steps run sequentially within their tier. 009a-b are expensive serial loops. 009c-h are chunked POSTs (mostly 3 calls each).

Crash recovery

009a + 009b support --resume (per-domain submission log). 009c-h are idempotent enough that re-running is safe.

Output guarantee

Every input domain ends up with a row in each enrichment table for that run_id.

What's Fucked

Phase 009 is mostly the model. Two findings.

Forensic findings, 2026-05-16. 009c-h are the best-behaved sub-steps in the entire pipeline. The monthly cuts have a different story.

Finding 1 — 009c-h are firing biweekly successfully. The cleanest BQ history in the pipeline.

Direct query of all 6 bulk-tier tables (009c, d, e, f, g, h) for atty_wa_seattle:

Run ID (short)	Date	Rows per table
`dcb7cfab`	2026-04-28	2,245
`1dee02fa`	2026-05-01	2,245
`fc7b0f91`	2026-05-07	2,245
`213f9e1b`	2026-05-11	2,245
`f480597a`	2026-05-14	2,245

Five cron fires across 16 days, identical 2,245-row coverage in every single one of the 6 bulk tables. That’s 30 successful BQ writes in a row across 6 tables × 5 runs. No silent skips, no schema drift, no missing rows.

Cadence: roughly Mon + Thu, matching the biweekly tier intent. (Same missing 2026-05-04 fire as Phase 007/008c.)

Finding 2 — 009a + 009b each have ONE row in BQ. Ever. From the 2026-04-20 backfill.

Direct query of enrichment_009a_backlinks_summary and enrichment_009b_backlinks_live for atty_wa_seattle:

Table	Only run_id	Date	Domains
`enrichment_009a_backlinks_summary`	`backfill_20260420`	2026-04-20	2,423
`enrichment_009b_backlinks_live`	`backfill_20260420`	2026-04-20	2,423

Same single-backfill pattern as Phase 005a, 006a, 006b, 008a. The monthly tier has not fired since 2026-04-20 — it’s been 26 days. If 009a-b are tagged monthly in cadence.py and the cron is the broken Mon+Thu expression, they should have fired roughly 8-9 times by now at ~$111 per fire = ~$900-1000 cost burned. We’d see fresh rows. We don’t. So either monthly cron isn’t triggering, or 009a-b are silently no-op’ing.

Finding 3 — CID count drift: 2,423 (backfill) vs 2,245 (cron fires).

Backfill processed 2,423 unique primary domains. Subsequent biweekly fires consistently process 2,245. That’s a 178-domain difference (~7% drop).

Most likely cause: gold_domains got rebuilt between 2026-04-20 and 2026-04-28, producing a smaller V_CONFIRMED slice. The same drift appeared in Phase 007 (2,968 backfill vs 2,772 cron fires) and is consistent with one Phase 003 / 004 rerun happening in that window.

Finding 4 — The biweekly fire pattern doesn’t cleanly match Mon+Thu.

Look at the dates: 04-28 (Mon), 05-01 (Thu), 05-07 (Wed?), 05-11 (Mon), 05-14 (Thu). The 2026-05-07 fire was on a Wednesday, not the expected Thursday.

The timestamp on 05-07 is 22:42 UTC — which is the morning of 2026-05-08 in some timezones but evening 05-07 in UTC. Either a clock-skew artifact in the timestamp, or the cron actually ran a day off that week. Worth checking against amicus_logs.pipeline_runs to see whether the orchestrator’s expected schedule for that date was Wed or Thu.

The bottom line

Where Phase 009 Stands Today

009c-h (the biweekly bulks) are the cleanest, most reliable sub-steps in the whole pipeline. Five successful fires in 16 days, identical 2,245-row coverage every time, all 6 tables populated correctly. 009a + 009b (the monthly deep cuts) have not fired since the 2026-04-20 backfill — if they were supposed to fire monthly on the Mon+Thu cron, that’s a 26-day data gap with ~$900-1000 of expected spend that didn’t happen. Investigate which: was the monthly tier turned off, are 009a-b silently no-op’ing, or is the deep cut simply not yet due?

The Fix

What we’ll do to make Phase 009 match the spec.

Concrete remediation. Most of Phase 009 already works — just confirm the monthly path.

Five fixes. The first one is the only diagnostic that’s actually missing data. The rest are guardrails or future-proofing.

FIX 1 Find out why 009a + 009b haven’t fired since 2026-04-20. ~30 min

Three checks, in order:

cadence.py: confirm 009a + 009b are tagged monthly and the monthly tier is in the orchestrator’s active cron set.
amicus_logs.step_logs: any 009a/009b rows since 2026-04-20? If completed rows exist but no enrichment rows landed, that’s the PIPELINE_RUN_ID silent-skip gotcha (apply Phase 002’s fix).
amicus_logs.api_cost_log: DFS spend on backlinks/summary and backlinks/backlinks endpoints since 2026-04-20? If yes, money is being burned for no output. If no, the script never ran.

If the answer is “monthly cadence isn’t in the cron at all,” then this is the bigger problem 005/006 also has — the cron expression doesn’t respect the tier system. Phase 005 Fix 1 (cron split) is the canonical remediation.

FIX 2 Inherit Phase 005 Fix 1 cron split. no work — lands with Phase 005 Fix 1

Once the cron splits into explicit monthly / weekly / biweekly entries:

009a + 009b fire once on the 1st of each month
009c-h continue firing Mon + Thu (no change — they already work)

The 009c-h cadence is already correct under today’s cron; Fix 2 specifically rescues 009a-b.

FIX 3 Investigate the 2026-05-07 Wed-vs-Thu fire timestamp. ~15 min

The biweekly fires are on a clean Mon+Thu cadence except for one fire at 2026-05-07 22:42 UTC which is a Wednesday-evening / Thursday-morning border. Two possible causes:

Cron skipped the Monday 05-04 fire and recovered Wed 05-07 — suggests retry logic exists somewhere (where?).
UTC-vs-PT confusion — the timestamp is in UTC; if the cron is Pacific Time, 22:42 UTC = 15:42 PT which is Wednesday afternoon, not Thursday.

Confirm timezone, decide if anything needs fixing. Almost certainly nothing — this is annotation, not a bug.

FIX 4 Add per-fire row-count assertions to 009c-h. ~25 min

The 6 bulk steps are working. They’re also the simplest place to add fail-loud guarantees, because the input count is known up front:

Read input domain count from gold_domains filter
After each step’s BQ write, assert COUNT(*) FROM enrichment_009{X} WHERE run_id = $RUN equals input count (no slack — bulk endpoints return one row per input)
If mismatch, exit non-zero

The pattern is so consistent across 009c-g that this can be a shared helper in shared/dfs_common.py: assert_bulk_coverage(table_name, run_id, expected_count).

FIX 5 Add new-domain-detection between bulk fires. ~45 min

The 7% drop from 2,423 (backfill) to 2,245 (cron fires) suggests gold_domains shrunk after a Phase 003 rerun. Going forward, the opposite is more likely: new firms get added to V_CONFIRMED and Phase 009 should pick them up automatically.

Add a startup log line for each 009 step:

Input count: 2,245 firms. New since last run: 12. Removed since last run: 0.

Helps operators notice gold_domains drift across runs without having to query BQ. Uses the previous run’s enrichment_009X as the diff baseline.

After all 5 fixes

The biweekly bulks (009c-h) continue firing twice a week with row-count assertions catching any future silent failures. The monthly cuts (009a + 009b) actually fire on the 1st of each month, producing fresh per-link detail for $111 per fire. New gold domains added by Phase 003+004 reruns get visibly logged on the next fire. Phase 009 stays the best-behaved phase in the pipeline.

Then we move on to Phase 010 (Keywords — the final phase, mixed monthly + weekly + biweekly).

Generated 2026-05-16 from /mnt/workspace/amicus/site_pipeline_amicusdata/ on amicus-dev VM.