Pipeline overview
10 phases · 52 sub-steps · everything that runs, in one view.
Data Acquisition
Fetch raw Maps SERPs for ~17 keywords × ~18 coordinates per market. The seed data.
001bSeparate paid-ad rows from organic-result rows in the raw bronze table.
001cDrop duplicate Google Business listings by Customer ID. One row per real listing.
001dDrop any result whose physical address falls outside the target geographic boundary.
001eDrop listings that don’t have a website URL — there’s nothing to enrich.
001fLowercase domain strings, strip www., strip trailing slashes — canonical form.
Drop directory sites (avvo, findlaw, justia, etc.) — listings of firms, not firms.
001hOne last dedup pass so every candidate domain appears exactly once.
001sPromote the cleaned bronze rows to the silver table. The Phase 001 deliverable.
Domain Verification
Domain Classification
gold_domains.Three-field match of GMaps category data against 35 attorney categories. Tags each CID in_vertical or not_in_vertical.
Recover content from sites DFS couldn’t crawl. Headless browser + stealth, httpx fallback.
003cPass 1 Haiku → Pass 2 Haiku (+body) → Pass 3 Sonnet. Escalates only when ambiguous.
003dJoin domain verdicts back onto CID-grain table. Unseen domains tagged NO_DATA or DNS_DEAD with a reason.
Specialties
Single-pass Haiku. For each V_CONFIRMED firm: pick primary + secondary specialties from 24 categories, plus a one-line firm summary.
004bApply NON_V_OVERRIDE post-filter. Compute search_keyword from primary_specialty. Split CID-grain into 3 BQ gold tables.
OnPage
Crawl every page of every firm’s site: HTML, headers, status codes, structure. The heavy expensive step.
005bSummarize each crawl: page count, average load time, error counts. Derived from 005a.
005cPer-URL detail row: title, meta description, H1, word count.
005dInternal & external link graph extracted from the crawl.
005eFind pages that share <title> or <meta description> — SEO duplicate-content bugs.
Find multi-hop redirects and redirect loops.
005gFind pages blocked from Google (noindex, robots.txt disallow).
Google Lighthouse scores: performance, accessibility, SEO, best practices.
005iReal-user Core Web Vitals (LCP / CLS / INP) from Google’s field-data dataset.
005jSend each firm’s “Our Team” / “Attorneys” page to Haiku. Count the attorneys.
Domain Intel
Google Business Profile
SERP Analysis
20-point Maps SERP grid around each firm: 1 center + 6 inner hex + 13 outer ring. Source for local-dominator heatmaps.
008bWho outranks each firm on Google organic for their target keywords.
008cGoogle’s expanded local pack — beyond the visible 3-pack.
008dGoogle autocomplete suggestions when typing the firm’s name or category.
008eUniform 1.3-mile grid across the entire geofence × 8 fixed practice-area keywords. Catches firms missing from gold.
Backlinks
Per firm: total backlinks, total referring domains, domain rank.
009bThe full list of live URLs pointing at each firm.
009cDFS authority score per firm. Fast bulk endpoint.
009dTotal backlinks count per firm. Fast bulk endpoint.
009eDFS spam score per firm. Flags low-quality backlink profiles.
009fCount of unique referring domains per firm.
009gReferring domains gained or lost in the last period — who started or stopped linking.
009hTotal pages of each firm’s site indexed by Google.
Keywords
Every keyword each firm ranks for: position, volume, traffic estimate, CPC.
010bEstimated organic traffic per firm. Fast bulk endpoint.
010cHigh-level organic visibility score per firm.
010dDifficulty scores for the tracked keywords.
010eSemantically related keywords per tracked term.
010fAutocomplete-style expansions per tracked term.
010gLong-tail keyword ideas adjacent to the tracked set.
010hTopical categories each firm covers (auto law / family law / personal injury / etc.).
010iOverall keyword performance summary per firm.