Pipeline overview
10 phases · 52 sub-steps · everything that runs, in one view.
Data Acquisition
Fetch raw Maps SERPs for ~17 keywords × ~18 coordinates per market. The seed data.
001bSeparate paid-ad rows from organic-result rows in the raw bronze table.
001cDrop duplicate Google Business listings by Customer ID. One row per real listing.
001dDrop any result whose physical address falls outside the target geographic boundary.
001eDrop listings that don’t have a website URL — there’s nothing to enrich.
001fLowercase domain strings, strip www., strip trailing slashes — canonical form.
Drop directory sites (avvo, findlaw, justia, etc.) — listings of firms, not firms.
001hOne last dedup pass so every candidate domain appears exactly once.
001sPromote the cleaned bronze rows to the silver table. The Phase 001 deliverable.
Domain Verification
Hit each candidate domain over HTTP. Confirm DNS resolves and the server returns content.
Domain Classification
gold_domains.Cheap string-match: does the domain contain known PI-attorney keywords? Skip the AI on obvious matches.
Load each homepage in a real browser. Necessary for JS-heavy sites that don’t render via plain HTTP.
Ask Claude Haiku: “is this actually a PI law firm?” Yes/no per domain.
Insert the Haiku-confirmed PI firms into the authoritative gold_domains table.
Specialties
Scrape each firm’s site for the practice areas they advertise (auto accidents, slip-and-fall, med-mal, etc).
Write the detected specialty tags back into the gold_domains row for each firm.
OnPage
Crawl every page of every firm’s site: HTML, headers, status codes, structure. The heavy expensive step.
Summarize each crawl: page count, average load time, error counts. Derived from 005a.
Per-URL detail row: title, meta description, H1, word count.
Internal & external link graph extracted from the crawl.
Find pages that share <title> or <meta description> — SEO duplicate-content bugs.
Find multi-hop redirects and redirect loops.
Find pages blocked from Google (noindex, robots.txt disallow).
Google Lighthouse scores: performance, accessibility, SEO, best practices.
Real-user Core Web Vitals (LCP / CLS / INP) from Google’s field-data dataset.
Send each firm’s “Our Team” / “Attorneys” page to Haiku. Count the attorneys.
Domain Intel
Domain registration date, registrar, expiry. Old domains rank better — age is a trust signal.
Detect WordPress / React / jQuery / GA / Facebook Pixel / etc. per domain.
Google Business Profile
Basic fields: address, phone, hours, primary category, rating, review count, photo count.
Pull every Google review: text, rating, date, author, owner reply.
Pull every Google Post / update the firm has published.
SERP Analysis
Maps SERP from a grid of GPS points around each firm. Source for local-dominator heatmaps.
Who outranks each firm on Google organic for their target keywords.
Google’s expanded local pack — beyond the visible 3-pack.
Google autocomplete suggestions when typing the firm’s name or category.
Backlinks
Per firm: total backlinks, total referring domains, domain rank.
The full list of live URLs pointing at each firm.
DFS authority score per firm. Fast bulk endpoint.
Total backlinks count per firm. Fast bulk endpoint.
DFS spam score per firm. Flags low-quality backlink profiles.
Count of unique referring domains per firm.
Referring domains gained or lost in the last period — who started or stopped linking.
Total pages of each firm’s site indexed by Google.
Keywords
Every keyword each firm ranks for: position, volume, traffic estimate, CPC.
Estimated organic traffic per firm. Fast bulk endpoint.
High-level organic visibility score per firm.
Difficulty scores for the tracked keywords.
Semantically related keywords per tracked term.
Autocomplete-style expansions per tracked term.
Long-tail keyword ideas adjacent to the tracked set.
Topical categories each firm covers (auto law / family law / personal injury / etc.).
Overall keyword performance summary per firm.