pipeline.amicusdata.dev — Phase 006: Domain Intel

PHASE 006

Domain Intel

Steps 006a → 006b · Monthly tier · Output: registration metadata + tech stack per firm

The Spec

How Phase 006 is supposed to behave.

Phase 006 — Domain Intel. Design intent. Reference, not reality.

Duty: For every confirmed firm, look up who owns the domain and when it was registered (006a WHOIS) and what tech stack the site is built on (006b Technologies). Both signals matter for SEO modeling: old domains rank better, and tech stack tells us whether the site is a WordPress template or a custom build.

Schedule: Monthly tier — designed to fire once per 30 days.

End state: Two BigQuery enrichment tables populated with fresh rows: enrichment_006a_whois (registration date, registrar, expiry) and enrichment_006b_technologies (detected CMS, JS frameworks, analytics tags, ad pixels, contact info).

What Phase 006 does, plain English

Domain age and tech stack are two of the cheapest, most durable signals about a law firm’s online sophistication. Phase 006 captures both in two DFS API calls:

006a WHOIS — When was the domain first registered? Who registered it? When does it expire? Old domains are a Google trust signal. A firm with a 20-year-old domain ranks higher than one with a 6-month-old domain, all else equal.
006b Technologies — What CMS does the site use (WordPress / Wix / Squarespace / custom)? What JavaScript frameworks (React, jQuery)? What analytics or ad pixels (Google Analytics, Meta Pixel, etc.)? Plus contact info and social-media links scraped from the homepage.

Both API calls are one-shot per firm — no crawling, no polling. Each costs pennies. Combined Phase 006 cost is dominated by 006b (which forces one domain per POST call, no batching).

The 2 sub-steps

Step	What it does	Provider	Reads	Writes
`006a`	WHOIS Overview lookup — registration date, registrar, expiry, status. Batches up to 1,000 domains per API call via the `in` filter.	DataForSEO Domain Analytics	`gold_domains` (V_CONFIRMED filter, primary domains)	`enrichment_006a_whois`
`006b`	Domain Technologies lookup — detected CMS, JS frameworks, analytics tags, ad pixels, contact info, social-media URLs. One domain per POST call (DFS endpoint limitation).	DataForSEO Domain Analytics	`gold_domains` (V_CONFIRMED filter, primary domains)	`enrichment_006b_technologies`

006a vs 006b call volume — very different shapes

The two sub-steps look similar but have radically different API call patterns:

Aspect	006a WHOIS	006b Technologies
Endpoint	`/v3/domain_analytics/whois/overview/live`	`/v3/domain_analytics/technologies/domain_technologies/live`
Batching	up to 1,000 domains per POST via `filters: ["domain", "in", [...]]`	1 domain per POST — no batching, by DFS design
API calls for 2,247 firms	3 calls	2,247 calls
Cost model	$0.10 per task + $0.001 per record returned	$0.01 per domain
Total cost for 2,247 firms	~$2.55	~$22.47
Rate limit pacing	3 calls is trivial	1,600 RPM (80% of 2,000 cap) — ~85 sec to complete

How the data moves

Input · Phase 004 deliverable

gold_domains

Filter: final_verdict in {V_CONFIRMED, V_LIKELY}
+ is_domain_primary = TRUE

↓

006a · WHOIS batch lookup

step_006a_whois.py

Chunk domains into batches of 1,000
3 DFS calls for ~2,247 firms
Parse: registration_date, registrar, expiry

006b · per-domain tech lookup

step_006b_technologies.py

2,247 DFS calls, 1 domain each
1,600 RPM pacing → ~85 sec wall time
Parse: CMS, JS, analytics, pixels, contacts

↓

Output · WHOIS

amicus_pipeline.enrichment_006a_whois

One row per firm: domain, created_date,
registrar, expiry_date, name_servers

Output · Tech stack

amicus_pipeline.enrichment_006b_technologies

One row per firm: detected_technologies,
contact_emails, social_urls, phone_numbers

↓

Hands off to

Phase 007 — Google Business Profile

(out of scope for this page)

Where to look — file & table reference

Thing	Path or table
The 2 scripts	`/mnt/workspace/amicus/pipeline/steps/006_domain_intel/step_006[ab]_*.py`
006a output	`amicus_pipeline.enrichment_006a_whois`
006b output	`amicus_pipeline.enrichment_006b_technologies`
Per-step logs	`pipeline/steps/000_log_files/step_006_.log`
API keys	`DATAFORSEO_USERNAME` · `DATAFORSEO_PASSWORD` in `.env`
Shared DFS utilities	`pipeline/steps/shared/dfs_common.py` — `load_gold_records`, `chunk_list`, `dfs_request_with_retry`

Cost per fire

Phase 006 is mid-priced — not as cheap as 002 but nothing like Phase 005’s deep crawl. 006b dominates because it can’t batch.

Line item	Volume	Per unit	Subtotal
006a — WHOIS task POSTs (batched up to 1,000/call)	~3 calls	$0.10	~$0.30
006a — WHOIS records returned ($0.001 per record)	~2,247	$0.001	~$2.25
006b — Technologies per-domain ($0.01 per domain)	~2,247	$0.01	~$22.47
Total per monthly fire			~$25

V_CONFIRMED count = 2,247 unique domains for atty_wa_seattle in run atty_production_001, queried 2026-05-16 from enrichment_004a_specialties. Cost constants from step_006a_whois.py (COST_PER_TASK = 0.10, COST_PER_ITEM = 0.001) and step_006b_technologies.py (COST_PER_TASK = 0.01).

Schedule

Frequency

Monthly tier — designed to fire once per 30 days.

Trigger

Orchestrator cron with --cadence monthly.

Execution mode

006a and 006b are independent and can parallelize. Both read the same gold list.

Concurrency

006a: serial 3-call batch loop. 006b: 1,600 RPM strict pacing.

Wall time

006a finishes in seconds. 006b takes ~85 sec for 2,247 domains at 1,600 RPM.

Output guarantee

Every V_CONFIRMED domain gets a WHOIS row and a technologies row. Domains DFS has no record of land with status_code != 20000 and an error_message.

What's Fucked

Phase 006 is not running the spec. Here’s exactly how.

Forensic findings, 2026-05-16. Same monthly-cron-isn’t-monthly gap as Phase 005, plus 006-specific coverage gaps.

Finding 1 — Only ONE run has ever populated 006 tables. It was a manual backfill in April.

Direct query of BQ as of 2026-05-16:

Table	Unique domains (atty_wa_seattle)	Run ID	Last write
`enrichment_006a_whois`	1,409	`backfill_20260420`	2026-04-20
`enrichment_006b_technologies`	1,460	`backfill_20260420`	2026-04-20

One run, 26 days ago. Not a scheduled fire — a manual backfill_20260420. No subsequent monthly cron has produced a fresh row in either table.

Finding 2 — The one run had partial coverage: 63-65% of V_CONFIRMED firms.

Phase 004 has classified 2,247 firms as V_CONFIRMED. The 006a backfill landed rows for 1,409 of them; 006b landed rows for 1,460. So:

006a coverage: 1,409 / 2,247 = 63%. 838 firms have NO WHOIS data.
006b coverage: 1,460 / 2,247 = 65%. 787 firms have NO tech-stack data.

Possible explanations (not yet confirmed):

DFS has no WHOIS data for some domains (privacy services, recently transferred, GDPR-redacted) — some legitimate misses.
The backfill ran against a smaller V_CONFIRMED slice (e.g. an older Phase 003 output) before Phase 004 was rerun and added more firms.
Per-domain failures in the original run that were never retried.

Either way: 38% of confirmed firms have no domain-intel data at all. Any downstream analytics or scoring model that uses domain age or CMS is operating on a 63%-populated table.

Finding 3 — If 006 fires on the broken cron, it’s bleeding ~$200-225/month.

Like Phase 005, Phase 006 is tagged monthly in cadence.py but the cron expression 30 10 * * 1,4 fires every Monday + Thursday. That’s ~8-9 fires per month at ~$25 per fire = ~$200-225/month.

The empirical data above (only 1 run since April 20) suggests the cron isn’t actually firing 006 successfully — otherwise we’d see fresh rows from every Mon + Thu in BQ. Whether that’s because:

006 isn’t in the active cadence the orchestrator picks up, OR
The orchestrator invokes 006 but the BQ write silently skips (recurring PIPELINE_RUN_ID gotcha), OR
006 errors out and the orchestrator logs but doesn’t retry

— is unverified. The bleed estimate above assumes the worst case (the API calls are happening, just nothing’s landing). If 006 is silently no-op’ing the cost is $0 but the data gap remains.

Finding 4 — 006b can’t be batched, and that’s a real cost driver as the market grows.

At 2,247 firms, 006b costs ~$22.47. At 12,000 firms (the documented scale target), 006b would cost ~$120 per fire. If the “monthly” cron stays broken and fires 8-9× per month, the bill at scale is ~$1,000/month for tech-stack alone. The DFS endpoint design forces one-domain-per-POST so there’s no Amicus-side optimization that fixes this.

Finding 5 — No retry path for partial coverage.

If a future Phase 003 run grows V_CONFIRMED by, say, 500 new firms, there’s no “backfill only the new ones” flag on the 006 scripts (that we’ve verified). A normal fire would re-WHOIS and re-Tech every firm, including the ones we already have data for — paying the full cost again. The shared load_gold_records helper doesn’t appear to dedupe against the existing enrichment table.

That’s acceptable for monthly cadence (you want fresh data anyway — tech stacks change), but if we ever go more frequent, “diff and only fetch new” is the obvious optimization.

The bottom line

Where Phase 006 Stands Today

One manual backfill on 2026-04-20 produced partial coverage (63-65% of V_CONFIRMED firms). No scheduled fire has refreshed the tables since. 838 firms have no WHOIS data; 787 have no tech-stack data. Either the cron isn’t firing 006, or it’s firing and silently failing the BQ write — the empirical evidence (zero fresh rows in 4 weeks) doesn’t distinguish between those.

The Fix

What we’ll do to make Phase 006 match the spec.

Concrete remediation. Cheap fixes first.

Six fixes. The first two are cheap diagnostics — figure out why 006 hasn’t fired since April before throwing new code at it.

FIX 1 Diagnose why 006 hasn’t fired since 2026-04-20. ~30 min

Three checks, in order:

cadence.py: confirm 006a and 006b are tagged monthly and the monthly tier is in the orchestrator’s active set for the 30 10 * * 1,4 cron.
amicus_logs.step_logs: query for any 006a/006b rows since 2026-04-20. If rows exist with status=completed but no enrichment rows landed, the BQ-write gotcha is the cause.
amicus_logs.api_cost_log: query for DFS spend on domain_analytics/whois and domain_analytics/technologies endpoints since 2026-04-20. If spend exists but enrichment tables didn’t grow, money was burned for no output.

This is purely investigation. The right next fix depends on what it finds.

FIX 2 Backfill the 38% coverage gap. ~20 min run + ~30 min code if needed

For the 838 firms missing WHOIS and 787 missing technologies: identify the gap with a query, then re-run only the missing slice.

SELECT g.domain FROM gold_domains g LEFT JOIN enrichment_006a_whois w USING (domain, profile_id) WHERE g.profile_id = 'atty_wa_seattle' AND g.final_verdict IN ('V_CONFIRMED', 'V_LIKELY') AND w.domain IS NULL

Cost: 838 × $0.001 = ~$0.84 for 006a. 787 × $0.01 = ~$7.87 for 006b. Total backfill cost ~$9. Cheaper than a full fire.

If the scripts don’t support a domain-list input flag today, add one. Don’t hack around it by re-running the full pipeline.

FIX 3 Inherit Phase 005 Fix 1 — once monthly cron is actually monthly, 006 fires correctly. no work — lands with Phase 005 Fix 1

Phase 005 Fix 1 splits the cron into monthly, weekly, biweekly entries. After that, 006 fires once per month on the 1st — same cadence as the rest of the monthly tier.

No 006-specific cron line. Don’t add one.

FIX 4 Fail-loud on BQ write skip. ~15 min

Same fix pattern as Phase 002/003/004/005 — if PIPELINE_RUN_ID is missing, exit non-zero with an explicit error instead of silently no-op’ing the enrichment write. Apply to both step_006a_whois.py and step_006b_technologies.py.

FIX 5 Add a per-fire coverage assertion. ~20 min

After each script completes, assert:

006a: enrichment_006a_whois row count for this run_id ≥ 90% of input domain count. (10% slack for genuine DFS-no-record cases.)
006b: enrichment_006b_technologies row count for this run_id ≥ 95% of input domain count. (5% slack — tech-stack detection is more reliable than WHOIS.)

If the assertion fails, exit non-zero with the specific shortfall. Prevents a future silent partial-coverage run from being marked “completed.”

FIX 6 Add a --missing-only flag to both scripts. ~45 min

For incremental backfills (Fix 2 use case, future V_CONFIRMED growth), add a flag that queries the corresponding enrichment table first and only POSTs for domains missing from it.

python step_006a_whois.py --missing-only --input /path/to/gold.json

Avoids paying twice for data we already have. Useful when Phase 003 + 004 reruns add new V_CONFIRMED firms mid-month.

After all 6 fixes

The monthly cron from Phase 005 Fix 1 fires Phase 006 once per month against fresh gold. Coverage hits > 90% on first try. The 38% historical gap is closed via Fix 2 backfill. Future Phase 003 V_CONFIRMED additions get backfilled via --missing-only instead of re-firing the whole phase. Bad runs fail loud instead of silently producing partial tables.

Then we move on to Phase 007 (Google Business Profile — weekly tier, where the cron cadence actually matters).

Generated 2026-05-16 from /mnt/workspace/amicus/site_pipeline_amicusdata/ on amicus-dev VM.