How Lookalike Domain Detection Works: CT Logs, WHOIS Deltas, and Keyword Scoring

How Lookalike Domain Detection Works: CT Logs, WHOIS Deltas, and Keyword Scoring cover image

Domain detection is not a hard problem in the abstract. It's a hard problem at scale, under time pressure, against adversaries who actively study detection logic. When a growing financial services brand can accumulate dozens of lookalike domains per month — some registered proactively by squatters, some staged days before a targeted phishing campaign — the gap between "we have a monitoring system" and "we actually catch attacks before they land" comes down to where you pull your signal, how fast you process it, and how precisely you score it.

This post covers the three data layers that together give early and reliable lookalike detection: CT logs, WHOIS delta feeds, and keyword scoring. Not as independent tools, but as a pipeline — and with an honest account of where each layer breaks down.

Layer 1: Certificate Transparency Log Scraping

The fastest lookalike signal available publicly is the Certificate Transparency (CT) log feed, as defined in RFC 6962. When any CA issues a publicly trusted TLS certificate, the cert is appended to at least two CT logs within seconds. The entry contains the full domain name and Subject Alternative Name list — no waiting for WHOIS propagation, no DNS query needed. The domain is visible to anyone watching the log feed within minutes of cert issuance.

Brandefense maintains per-log index cursors against all major CT logs — Google Argon, Cloudflare Nimbus, DigiCert Yeti, and Sectigo Mammoth — batching requests at the standard GET /ct/v1/get-entries endpoint in 1,000-entry increments. Each batch is decompressed, the leaf certificate parsed (projecting only CN and SAN fields to avoid storing raw DER), and the resulting domain tokens passed downstream for keyword matching.

The critical advantage: CT log monitoring surfaces infrastructure pre-staging. If an adversary registers five domain permutations of your brand in a single afternoon and obtains Let's Encrypt certificates for all five, CT logs will show that cluster before any of those domains are live. Seeing acmebanklogin[.]net, acmebank-secure[.]com, login-acmebank[.]org, and two similar variants appearing within two hours in the same log batch is a pattern that warrants immediate escalation regardless of whether any domain resolves yet.

Layer 2: WHOIS Delta Feeds

WHOIS delta feeds — new domain registration data published by registries under ICANN policy — are the second layer. Unlike CT logs, they don't require TLS certificate issuance. A domain registered and immediately parked (no HTTPS, no active content) won't generate a CT entry, but it will appear in the zone file diff within 24-48 hours for most gTLDs.

ICANN's Centralized Zone Data Service (CZDS) provides access to zone files for participating gTLDs. Processing zone file diffs requires handling significant volume — the .com zone alone adds tens of thousands of new registrations per day — but the extraction is straightforward: compare yesterday's zone file to today's, extract new entries, run brand keyword matching against the new registrations.

WHOIS delta monitoring is the right layer for catching parked typosquats: domains registered months in advance of a campaign, or registered by serial typosquatters who harvest domains speculatively and later sell or activate them. These don't get SSL certs immediately, so CT logs miss them. A brand protection team that relies only on CT monitoring will develop a blind spot for this class of threat.

Practical consideration: WHOIS privacy and gTLD coverage

GDPR and equivalent privacy regulations prompted widespread adoption of WHOIS privacy services after 2018. Registrant contact data is now routinely masked even for newly registered domains. This doesn't affect domain name detection — you can still see that brandname-support[.]com was registered — but it eliminates the registrant-based correlation that previously allowed clustering of domains registered by the same threat actor. Takedown workflows must now rely primarily on the domain name itself, the registrar's abuse contact, and where available, passive DNS evidence of shared hosting infrastructure.

Keyword Scoring: Separating Signal from Noise

Both CT logs and WHOIS delta feeds produce raw domain name strings. Turning those strings into an actionable threat signal requires a scoring model, not a simple keyword match. The problem with naive brand keyword matching is precision: for any brand name that appears as a common English word or that shares tokens with legitimate infrastructure, a straight substring search produces false-positive rates that render the feed unusable.

A regional payments brand with the brand term nova would produce thousands of hits per day on CT logs alone. The keyword scoring approach layers multiple signals:

SignalWeight contributionRationale
Brand token as leftmost domain label+HighBrand name at position 0 strongly indicates impersonation intent
Brand token + "login", "secure", "support", "verify", "account" suffix+HighCanonical phishing page labeling patterns
TLD in high-abuse set (.top, .xyz, .icu, .online, .shop)+MediumStatistically over-represented in phishing domain registrations vs legitimate use
Levenshtein distance ≤ 2 from exact brand domain+HighCovers single-character substitutions, transpositions, insertions
IDN/Punycode with visual lookalike characters+CriticalHomoglyph attacks — see below
Domain matches known legitimate subsidiary or partner-High (suppress)Allowlist hit

Homoglyph and Punycode Attacks: Why Visual Similarity Matters

Levenshtein distance alone doesn't catch the most sophisticated lookalike technique: internationalized domain name (IDN) homoglyphs. The Punycode encoding scheme (RFC 3492) allows domains containing Unicode characters to be represented in the DNS, which browsers render in their human-readable Unicode form in the address bar.

Consider the domain xn--pypa1-9zb.com. Browsers render this as what appears to be paypa1.com — but the character you're reading as a Latin 'a' in "paypa1" is actually U+0251, the Latin Small Letter Alpha. It's visually identical in most fonts. Levenshtein distance against paypal.com would score this as a near-match but the standard string comparison wouldn't flag it as an IDN attack. A brand protection system that only does byte-level string matching against known brand names will miss the entire IDN homoglyph threat class.

Effective homoglyph detection requires rendering the domain's Unicode form and then running visual similarity scoring against the brand's canonical domain — comparing the rendered pixel output rather than the byte string. This is computationally more expensive than string matching, which is why it's typically applied as a second-pass filter on candidates that have already passed a preliminary score threshold, rather than on every CT log entry in raw form.

We're not saying Punycode domains are rare or niche. In monitoring done across active brand protection campaigns, IDN lookalikes targeting financial services and e-commerce brands in the Gulf region and Latin America represent a meaningful fraction — often 10-20% — of total active phishing infrastructure at any given time. The attack surface is large enough that skipping visual similarity scoring leaves a significant detection gap.

DGA Detection and the Long Tail

Domain Generation Algorithms (DGAs) are primarily a malware C2 technique — malware generates pseudo-random domain names on a schedule, registering a new one if the last was sinkholed. Pure DGA domains don't usually impersonate brands by name (their character sequences are too random to be meaningful). However, a related technique — seeded DGA where the brand name is used as a seed to generate variations — does produce brand-adjacent lookalike domains at scale.

The tell is statistical: DGA-seeded domains produce a cluster of syntactically similar strings that share character n-gram patterns with the target brand but don't appear on any human-curated wordlist of expected variations. Detecting them requires n-gram modeling of the domain string rather than dictionary-based permutation matching. This matters more for brands targeted by highly automated threat actors running persistent campaigns — not the typical use case for an early-stage brand protection deployment, but relevant context for enterprise environments that see consistent attacker attention.

Putting It Together: What the Pipeline Actually Looks Like

A domain monitoring pipeline that handles all the above at production scale runs something like this: CT log workers and zone file parsers feed a normalized stream of (domain_name, first_seen_timestamp, source) records into a matching queue. A scoring engine applies keyword token matching, Levenshtein distance scoring, TLD risk weighting, and allowlist suppression synchronously. Candidates above a preliminary threshold go to a second-pass pipeline that performs DNS resolution (is the domain live?), HTTP screenshot capture (what does the page look like?), and perceptual hash comparison against the brand's visual identity. The output is a ranked alert list with evidence attached — domain name, registrar, detection source, current resolution status, and screenshot hash similarity score.

The monitoring is only as useful as what happens after an alert fires. For a domain that scores above the action threshold, the next step is contacting the registrar's abuse team with a structured takedown request — a process with its own operational complexity, covered separately. What this pipeline gives you is the detection window: catching the domain while it's still new, before a phishing campaign has built a victim list, and with enough evidence to support an urgent abuse report. That window typically runs 2-6 hours for CT-sourced detections and 12-36 hours for WHOIS-delta sourced detections. The difference in those windows is why running both layers in parallel matters.

Back to Blog