How Teams Identify the Correct Company from Messy Inputs

Most enrichment pipelines assume the company match is already correct. Here is how teams resolve messy inputs, with a 3-step pipeline that achieves 99.8% accuracy.

Published

May 17, 2026

Written by

Abhilash Chowdhary

Reviewed by

Nithish

Read time

7

minutes

The initial company match that happens when you give a data provider a company name, domain or profile URL is arguably the most important step in a data enrichment process. Enrichment, scoring, and CRM routing all inherit the identity of the initial company match. When that match is wrong, every downstream step produces confident, well-formatted data about the wrong company.

One data team we spoke with had 20 million raw company records in their system, which were reducible to just 7 million canonical entities. Imagine the mess their team looked at every time they opened their database.

Company entity resolution, specifically the step where you take a messy input (a legal name, a misspelled entry, a subsidiary alias) and land on the one correct canonical company record, is the step in any enrichment pipeline that determines whether everything downstream is correct or wrong. This article covers the specific input types that cause matching failures, the pipeline that achieves 99.8% accuracy, and what happens when teams skip the resolution step or force a bad match.

The messy inputs teams actually start with

The inputs that feed company matching pipelines are rarely clean. They arrive from government databases, CRM imports, form fills, third-party lists, and manual entry, and each source introduces its own category of mess.

Wrong spellings and fat-fingered entries

Sales reps create records quickly, and "Salesforce" becomes "Salesforce LLC" in one record and "Salesforce Sync" in another. A data engineering team we spoke with described upstream sources where company names were "just fat-fingered and spelled incorrectly," with no validation layer catching the error at entry time. In any CRM or database that accepts manual input, this is the default state.

Duplicate account names

A common cause of duplicate account names is a lack of control over what sales reps add to the CRMs. One RevOps team told us they had 15 different versions of the same company in their CRM, created by different SDRs over time. Each variant had its own account history, its own contact associations, and its own activity log. The team was constantly "clearing stuff up," while reps complained about records being moved out of their ownership during dedup exercises. Research from Databar suggests CRM duplication rates range from 10% to 30% for organizations without active quality management.

Subsidiary and brand aliases

A single parent company can appear under dozens of names depending on which subsidiary, brand, or regional entity filed the record. One team described the problem using General Motors, where GM, General Motors, General Motors Corporation, and Vauxhall (a non-US GM brand) are all the same parent entity but appear as separate records in most databases.

Legal name vs DBA mismatches

Government filings, tax records, and regulatory databases use the legal entity name, while the rest of the world uses the trade name. The legal name could be very different to the name customers, sales reps, or even the company's own employees use. Without a reliable mapping between legal name and DBA, any enrichment based on the legal name alone risks resolving to the wrong entity or returning no result at all.

Missing websites

Some companies, particularly small local businesses, franchisees, and recently formed entities, have no website. When the only input is a company name with no domain, matching drops to name-string-only resolution, which is the hardest matching scenario. One team described many of their target companies as "small mom-and-pop shops" where the usual domain-based enrichment path was not available.

What happens when the initial company match is wrong

A wrong company match produces correct-looking data for the wrong entity rather than an error, and everything downstream treats that data as ground truth.

Enrichment returns real data for a company you did not intend to match: If your pipeline resolves "Acme Services LLC" to the wrong Acme, the enrichment response will return that company's real headcount, real funding history, and real headquarters address. Nothing in the response flags it as a mismatch. The data is accurate, just attached to the wrong record.

Grouping breaks: When the same company exists under five variant names, each variant gets its own account, and different reps work the same company without knowing it. One team described SDRs "complaining about records going out of their name" during dedup exercises because what looked like separate accounts were actually duplicates.

Scoring uses the wrong firmographics: Lead scoring models that weight headcount, revenue, or funding stage will score the record based on whichever company the matcher resolved to. If the match is wrong, the score is meaningless, but it still routes the record through the pipeline.

CRM records become polluted at scale: Bad matches compound over time, and a 20-million-record database with even a 0.2% error rate has 40,000 records enriched with the wrong company's data. Those records feed segmentation, territory assignment, marketing campaigns, and reporting.

How teams recover the right company identity from weak inputs before enrichment

What failed: single-tool approaches

The teams we spoke with tried several approaches before finding one that worked consistently.

Fuzzy string matching catches typos and formatting differences, but it does not understand meaning. "General Motors" and "GM" have a high edit distance despite being the same entity.

LLM-powered search APIs were another common fallback. One data team used an LLM search tool to find company domains from name-only inputs, but described the accuracy as "not working well enough" for production use, leading to cascading errors when the wrong domain was resolved.

Another team built a matching system using Python's FuzzyWuzzy library with a human-in-the-loop interface four years ago. It worked, but required manual review for every ambiguous case, and the volume of ambiguous cases made it unsustainable as the dataset grew.

The 3-step pipeline: web search, company identify, LLM disambiguation

The approach that our solutions engineer tested and presented to a customer achieved 499 out of 500 correct matches (99.8% accuracy). It uses the following three steps chained together:

Step 1: Web search surfaces candidate URLs. Given a company name string, a web search API returns the top three candidate company pages (typically corporate homepages or professional profile URLs). This step works because web search engines already handle misspellings, abbreviations, and common aliases. The search query is the raw company name as-is, with no preprocessing required.

Step 2: Company identify pulls basic firmographics on each candidate. A company identification endpoint takes each candidate URL and returns basic firmographic data including company name, domain, headquarters, and employee count. This step is free (no credits consumed), so it scales to high-volume validation without cost pressure.

Step 3: LLM evaluates candidates and picks the best match. The original input (name, address, any other context from the source record) and the firmographic profiles of the top 3 candidates are passed to an LLM. The model compares the source context against each candidate and selects the best match. When no candidate is a strong match, the model returns a "no match" signal rather than forcing a bad pick.

MCP path (zero-code option)

A Claude Code agent with Crustdata's MCP server configured can run this entire workflow in natural language. Describe the company you are trying to resolve, and the agent calls web search, then company identify, then evaluates the candidates and returns the best match with its reasoning.

Direct API path (for programmatic pipelines)

For teams building this into a production pipeline, the same flow works as three sequential API calls:

import requests
from urllib.parse import urlparse

AUTH = {"Authorization": "Token YOUR_API_TOKEN"}
BASE = "https://api.crustdata.com"

# Step 1: Web search for candidate company pages
search_resp = requests.post(f"{BASE}/screener/web-search", headers=AUTH, json={
    "query": "Marshall McLennan insurance",
    "sources": ["web"]
})
candidates = search_resp.json()["results"][:3]

# Step 2: Company identify on each candidate domain (free, no credits)
for candidate in candidates:
    domain = urlparse(candidate["url"]).netloc
    identify_resp = requests.post(f"{BASE}/screener/identify", headers=AUTH, json={
        "query_company_website": domain
    })
    candidate["firmographics"] = identify_resp.json()

# Step 3: Pass to LLM for final pick
# (use your preferred LLM client here)
# Input: original company name + context + candidate firmographics
# Output: best match company_id, or "no_match"
import requests
from urllib.parse import urlparse

AUTH = {"Authorization": "Token YOUR_API_TOKEN"}
BASE = "https://api.crustdata.com"

# Step 1: Web search for candidate company pages
search_resp = requests.post(f"{BASE}/screener/web-search", headers=AUTH, json={
    "query": "Marshall McLennan insurance",
    "sources": ["web"]
})
candidates = search_resp.json()["results"][:3]

# Step 2: Company identify on each candidate domain (free, no credits)
for candidate in candidates:
    domain = urlparse(candidate["url"]).netloc
    identify_resp = requests.post(f"{BASE}/screener/identify", headers=AUTH, json={
        "query_company_website": domain
    })
    candidate["firmographics"] = identify_resp.json()

# Step 3: Pass to LLM for final pick
# (use your preferred LLM client here)
# Input: original company name + context + candidate firmographics
# Output: best match company_id, or "no_match"
import requests
from urllib.parse import urlparse

AUTH = {"Authorization": "Token YOUR_API_TOKEN"}
BASE = "https://api.crustdata.com"

# Step 1: Web search for candidate company pages
search_resp = requests.post(f"{BASE}/screener/web-search", headers=AUTH, json={
    "query": "Marshall McLennan insurance",
    "sources": ["web"]
})
candidates = search_resp.json()["results"][:3]

# Step 2: Company identify on each candidate domain (free, no credits)
for candidate in candidates:
    domain = urlparse(candidate["url"]).netloc
    identify_resp = requests.post(f"{BASE}/screener/identify", headers=AUTH, json={
        "query_company_website": domain
    })
    candidate["firmographics"] = identify_resp.json()

# Step 3: Pass to LLM for final pick
# (use your preferred LLM client here)
# Input: original company name + context + candidate firmographics
# Output: best match company_id, or "no_match"

Web search handles misspellings and aliases, the identify step provides structured data for comparison, and the LLM handles the judgment call that neither of the first two steps can make alone.

When web search, company identify, and LLM disambiguation each help

Not every record needs the full three-step pipeline. The right approach depends on what you start with.

You have a domain or a professional profile URL: Skip web search entirely. Call the Company Enrichment API directly with the domain. If the domain resolves to a single company, you are done. This is the fastest and cheapest path.

You have a company name with no domain: Run the full workflow. Web search is the only reliable way to go from a raw name string to candidate URLs, especially when the name is misspelled, abbreviated, or uses a legal entity name that differs from the trade name.

You have a company name and a partial domain (or an address, or an industry): Web search with the extra context improves candidate quality. Adding the industry category like "insurance" or location like "Chicago" to the search query when you have that context from the source record narrows the candidate set dramatically.

You have multiple possible matches and need to pick one: The Company Identification API is free. Use it to pull firmographics on every candidate, then compare against whatever source context you have. The LLM step is only needed when the firmographics alone do not produce a clear winner.

You have a known good identifier (a specific company ID or canonical URL) but need to verify it is still accurate: Call company identify with the identifier. If the response matches your expectations, the record is confirmed. If it does not, the company may have been acquired, renamed, or merged, and you need to re-resolve.

Where disambiguation is still needed even after you have a likely match

The workflow resolves most records cleanly, and the remaining 0.2% are the cases where disambiguation requires additional context or human judgment.

Parent vs subsidiary resolution: "Vauxhall" is a real company with real employees and real financials, and so is "General Motors." Whether you want the subsidiary or the parent depends on your use case. A sales team targeting the buying entity may want the subsidiary, while an investment team modeling the corporate structure wants the parent. The pipeline can return both, and the business logic decides which one to keep.

Multiple valid matches with the same name: Common company names ("Alpha," "Level," "Summit") match multiple real entities. One team described querying "AOL" and getting Netscape back as a result, with no confidence score to differentiate. When the identify endpoint returns multiple companies for the same name, the LLM disambiguation step becomes critical, and it works best when the source record includes any additional context (city, industry, employee range) to narrow the candidates.

Small companies with no web presence: When a company has no website, no professional profile, and no meaningful web footprint, web search returns irrelevant results. These records cannot be resolved programmatically and belong in a human review queue rather than being forced into the closest match.

Legal name collisions: Two legitimately different companies can share a legal name if they operate in different states or countries. "ABC Services LLC" registered in Delaware and "ABC Services LLC" registered in California are two different entities. Disambiguation requires the source record to include a geography or address signal.

When to queue, retry, or send the record for human review instead of forcing a match

A wrong match treated as correct is the worst possible outcome in entity resolution. A record sitting in a review queue causes no downstream damage, while a wrongly matched record poisons enrichment, scoring, and routing for that account until someone catches it.

24-hour retry queues for "not yet in database" responses: One team built retry logic into their pipeline: when the enrichment API returns a "not yet in database" response, the record enters a queue and the pipeline re-calls the API 24 hours later when new data may be available. This avoids force-matching to a partial result and avoids discarding a resolvable record.

Confidence thresholds that route to human review: When the LLM disambiguation step returns a match but with low confidence (the best candidate only partially matches the source context), the record should route to a review queue rather than auto-accepting. One law firm built this pattern four years ago, with automated matching handling the clear cases and a human-in-the-loop interface presenting the ambiguous ones for manual resolution.

Timeout handling for slow lookups: Some lookups hang on unresponsive endpoints or slow web search results. Rather than letting the entire pipeline block on a single record, set a timeout and route the record to retry. One team described setting explicit timeouts on problematic 404 responses to avoid API bottlenecks.

The decision framework: If the confidence score of match is high and the candidate firmographics align with the source context, auto-accept. If confidence is moderate, flag for review but provisionally accept. If confidence is low or no candidates matched, queue for retry or human review. Never force a match to clear a queue.

Conclusion

Company identification is the step that determines whether enrichment, scoring, and CRM routing produce correct outputs. Every downstream system inherits the identity of the initial match. Getting it wrong produces clean-looking data for the wrong company rather than an error anyone would catch. This applies to any data pipeline where company identity matters, from programmatic deal sourcing and founder discovery to portfolio monitoring.

The 3-step pipeline (web search to surface candidates, company identify to pull firmographics, LLM to pick the best match) achieves 99.8% accuracy on tested batches. The records it cannot resolve belong in a review queue, not forced into the closest available match.

Sign up for Crustdata's free tier (100 credits included) to test the web search and company identification endpoints against your own messy data. For teams building this into a production enrichment pipeline, book a demo to walk through the architecture.

Data

Delivery Methods

Use Cases

Solutions