Pipeline overview
10 phases · 52 sub-steps · everything that runs, in one view.
Data Acquisition
Fetch raw Maps SERPs for ~17 keywords × ~18 coordinates per market. The seed data.
001bSeparate paid-ad rows from organic-result rows in the raw bronze table.
001cDrop duplicate Google Business listings by Customer ID. One row per real listing.
001dDrop any result whose physical address falls outside the target geographic boundary.
001eDrop listings that don’t have a website URL — there’s nothing to enrich.
001fLowercase domain strings, strip www., strip trailing slashes — canonical form.
Drop directory sites (avvo, findlaw, justia, etc.) — listings of firms, not firms.
001hOne last dedup pass so every candidate domain appears exactly once.
001sPromote the cleaned bronze rows to the silver table. The Phase 001 deliverable.
Domain Verification
Domain Classification
gold_domains.Three-field match of GMaps category data against 35 attorney categories. Tags each CID in_vertical or not_in_vertical.
Recover content from sites DFS couldn’t crawl. Headless browser + stealth, httpx fallback.
003cPass 1 Haiku → Pass 2 Haiku (+body) → Pass 3 Sonnet. Escalates only when ambiguous.
003dJoin domain verdicts back onto CID-grain table. Unseen domains tagged NO_DATA or DNS_DEAD with a reason.
Specialties
Single-pass Haiku. For each V_CONFIRMED firm: pick primary + secondary specialties from 24 categories, plus a one-line firm summary.
004bApply NON_V_OVERRIDE post-filter. Compute search_keyword from primary_specialty. Split CID-grain into 3 BQ gold tables.
OnPage
Crawl every page of every firm’s site: HTML, headers, status codes, structure. The heavy expensive step.
005bSummarize each crawl: page count, average load time, error counts. Derived from 005a.
005cPer-URL detail row: title, meta description, H1, word count.
005dInternal & external link graph extracted from the crawl.
005eFind pages that share <title> or <meta description> — SEO duplicate-content bugs.
Find multi-hop redirects and redirect loops.
005gFind pages blocked from Google (noindex, robots.txt disallow).
Google Lighthouse scores: performance, accessibility, SEO, best practices.
005iReal-user Core Web Vitals (LCP / CLS / INP) from Google’s field-data dataset.
005jSend each firm’s “Our Team” / “Attorneys” page to Haiku. Count the attorneys.
Domain Intel
Google Business Profile
SERP Analysis
20-point Maps SERP grid around each firm: 1 center + 6 inner hex + 13 outer ring. Source for local-dominator heatmaps.
008bWho outranks each firm on Google organic for their target keywords.
008cGoogle’s expanded local pack — beyond the visible 3-pack.
008dGoogle autocomplete suggestions when typing the firm’s name or category.
008eUniform 1.3-mile grid across the entire geofence × 8 fixed practice-area keywords. Catches firms missing from gold.
Backlinks
Per firm: total backlinks, total referring domains, domain rank.
The full list of live URLs pointing at each firm.
DFS authority score per firm. Fast bulk endpoint.
Total backlinks count per firm. Fast bulk endpoint.
DFS spam score per firm. Flags low-quality backlink profiles.
Count of unique referring domains per firm.
Referring domains gained or lost in the last period — who started or stopped linking.
Total pages of each firm’s site indexed by Google.
Keywords
Every keyword each firm ranks for: position, volume, traffic estimate, CPC.
Estimated organic traffic per firm. Fast bulk endpoint.
High-level organic visibility score per firm.
Difficulty scores for the tracked keywords.
Semantically related keywords per tracked term.
Autocomplete-style expansions per tracked term.
Long-tail keyword ideas adjacent to the tracked set.
Topical categories each firm covers (auto law / family law / personal injury / etc.).
Overall keyword performance summary per firm.