Best Entity Resolution APIs for Sales, Recruiting, and Research Pipelines

Compare 7 entity resolution APIs and tools for matching messy company names, disambiguating locations, and resolving people identities in B2B data pipelines.

Published

May 24, 2026

Written by

Abhilash Chowdhary

Reviewed by

Chris Pisarski

Read time

minutes

Your enrichment pipeline only works if company identity is resolved first. When the data you pull in from CRM exports, government filings, partner feeds, or scraped sources arrives with inconsistent names, misspellings, or missing domains, every enrichment call is a guess. You burn credits on the wrong entity, get conflicting firmographic profiles back, and corrupt the data your product depends on.

One team processing 20 million company records from government tax filings described it directly: "We have a lot of upstream data from places that are government databases that have 100 variations on the same name, things spelled wrong." Their previous matching approach used an LLM-powered search API, but it returned wrong matches with no confidence scores to flag errors. Bad matches fed bad groupings, and the pipeline broke before enrichment even ran.

This guide evaluates entity resolution tools from the perspective of product engineers building internal sales, recruiting, or research systems. The tools here are API-first, with evaluation criteria written for builders and pipeline architectures drawn from real workflows. It covers seven tools, five evaluation criteria, and a multi-step matching pipeline.

What Entity Resolution Means When You're Building a Product

Matching records to real-world entities

Entity resolution is the process of identifying when two or more records refer to the same real-world entity, whether that entity is a company, a person, or a location. Three broad matching approaches handle different types of input noise. Deterministic matching relies on exact key matches like shared domains or tax IDs. Probabilistic matching uses weighted similarity scores across multiple fields to estimate match likelihood, while ML-based matching trains a model on labeled pairs to learn what "same entity" looks like in your specific data.

Why this matters before enrichment

For product engineers, entity resolution is a pre-enrichment pipeline step. If company identity is not resolved first, every downstream enrichment call wastes credits on the wrong entity or returns conflicting data that poisons your production database. One team building a recruiting platform put it directly: "The problem right now is the mapping part. What we search in the lookup and what we find in your data." Their raw data was fine. The matching layer between input and output was where records fell apart.

When entity resolution works, enrichment calls go to the right company, people searches return the right candidates, and downstream systems get clean, consistent records. When it does not, you spend weeks building post-processing logic that still misses edge cases.

Three Matching Problems Builders Run Into

Company name normalization from messy source data

Company names arrive in different formats depending on the source. Government filings use legal names ("Nielsen Holdings PLC"), CRM imports use trading names ("NielsenIQ"), scraped web data uses abbreviations ("Nielsen"), and partner feeds use whatever the data entry clerk typed. A single company can appear as dozens of distinct records across your pipeline.

One team processing insurance industry data started with 20 million raw company representations from government tax filings. After initial exact-match deduplication, they were down to roughly 7 million, but the remaining matches were unreliable. Their previous approach used an LLM-powered search API to find company websites from legal names, but the accuracy was poor - "When it matches a bad one [...] and we use that to do a bad grouping, that's where the issue is." Without confidence scores on the fuzzy matches, they had no way to flag uncertain results for review.

Location and attribute disambiguation

Matching gets harder when the same string maps to different real-world entities depending on context. One team building a recruiting matchmaking platform discovered this when querying for candidates in Atlanta: "I put country, United States, state, Georgia, city, Atlanta. As soon as I put city, Atlanta, there is another city called Atlanta in Michigan." The API returned results from both states with no disambiguation.

The same problem applies to professional attributes. Certifications appear as "CISSP", the full written form, or "CISSP R" depending on how the person listed them on their profile. Job titles appear as "CEO" or "Chief Executive Officer" and the two do not match in a keyword search. The team described the cumulative cost of handling these edge cases in post-processing - "You have no idea, some of the logics we have done on our side is like crazy."

Identity resolution from partial signals

Person-level entity resolution adds another layer. When you only have an email address or a partial name, resolving to a canonical person record requires matching across multiple signals. One sales automation platform serving 60,000 end users measured their email-to-profile match rate at 67 to 70 percent. Their target was 90 to 95 percent, but 15 to 20 percent of profiles were missing from the public data sources they relied on.

This is entity resolution at the person level: taking a partial identifier (an email domain, a first name and company) and resolving it to a specific person with a verified professional profile. The accuracy gap between 70 percent and 95 percent is the difference between a usable product and one that requires constant manual intervention.

What to Look For in an Entity Resolution API

API availability and latency

Can you call it as a REST endpoint from your production code? Some entity resolution tools are desktop applications, batch-only services, or Python libraries you run in your own environment. Product engineers embedding resolution into a live workflow need a callable API with sub-second response times.

Confidence scoring

Does the API return a confidence score on matches? Without one, you cannot build a review queue for uncertain matches or set thresholds for automatic acceptance. One team described the downstream cost: their previous tool returned the wrong company match for AOL (returning Netscape) with no confidence signal to catch the error. Every match that passes unchecked propagates errors through the entire enrichment pipeline.

B2B entity support

General-purpose entity resolution tools handle record linkage across any schema you provide, but they require you to supply your own reference data. B2B-specific tools come with a canonical company or people graph already built. If your input is messy company names and you need to resolve them to real companies with firmographic data, a tool with a built-in company graph saves months of reference data construction.

Matching approach

Different approaches handle different types of input noise. Deterministic matching is fast and precise but fails on typos and abbreviations. Probabilistic matching (Jaro-Winkler, Levenshtein) handles character-level errors but cannot match "SBUX" to "Starbucks" or "MCDNLDS" to "McDonald's". Semantic and LLM-based matching captures these conceptual equivalences but costs more per call.

BCG's data science team found that when working with 10 million company records across 45 data sources, "no single name-matching method can address all the nuances found in textual data. Most methods will accomplish 80 percent of a design solution." The remaining 20 percent requires combining multiple approaches or adding a human review step.

Pricing model

Per-API-call pricing fits on-demand resolution, while platform licenses are better for batch processing against a warehouse. Open-source libraries have no licensing cost but require your own infrastructure and reference data. Match the pricing model to your resolution pattern: if you resolve at ingest time (one record at a time), per-call pricing makes sense. If you resolve in bulk nightly, a batch-oriented tool or library is more cost-effective.

Best Entity Resolution APIs and Tools

Tool	Type	B2B Company Graph	Confidence Scoring	Pricing	G2 Rating
Crustdata	REST API (3-endpoint pipeline)	Yes	Yes	Free identification, credit-based enrichment	4.4/5 (Product Hunt)
PeopleDataLabs	REST API	Yes	Yes	Free tier, from $98/mo	4.3/5 (G2, 17 reviews)
Tilores	GraphQL API	No (bring your own)	Configurable	Quote-based	No reviews
FullContact	REST API	Yes	Yes	From $500/mo (annual contract)	4.3/5 (G2, 100 reviews)
Senzing	Embeddable SDK + REST API	No (bring your own)	Yes	From $58,560/yr (10M records)	4.8/5 (G2, 17 reviews)
Splink	Python library (open-source)	No (bring your own)	Probability scores	Free (compute only)	Not listed
Dedupe	Python library (open-source)	No (bring your own)	Probability scores	Free (cloud from $9/1K rows)	Not listed

Crustdata

Crustdata's entity resolution capability comes from three APIs working together. The Company Identification API takes a messy company name, website, or company ID and resolves it to a canonical company record. It is free, consuming zero credits. The Web Search API finds canonical company URLs when you only have a legal name and no domain. The Company Enrichment API then returns 250+ firmographic datapoints for the resolved company.

The three-step pipeline (web search to find a canonical URL, then identify to resolve the company, then enrich for firmographic data) is the architecture one insurance-industry data team built after their previous LLM-based matching approach failed at scale. In testing, this pipeline matched 499 out of 500 companies correctly from messy government filing data. A full walkthrough with code is in the pipeline architecture section below.

Key features:

Company Identification endpoint is free (zero credits)
Confidence-ranked results with is_full_domain_match flag for exact matches
Web Search API for discovering canonical company URLs from names-only inputs
250+ enrichment datapoints available after resolution (firmographics, funding, headcount, web traffic, leadership)
Watcher API for ongoing company monitoring after initial resolution

Pros:

"The API is clean, fast, and well-documented. Integration was almost effortless," according to users on Product Hunt
The identification step being free means you can validate company matches at scale before committing credits to enrichment
Real-time webhook alerts for downstream triggers (job changes, funding rounds, headcount shifts)

Cons:

Users on Product Hunt note that "pricing will be an issue for startups and small businesses" at high enrichment volumes

Best for: Product engineers who need company name resolution as a step before enrichment in sales, recruiting, or research pipelines. The free identification endpoint makes it practical to resolve millions of records before paying for enrichment on confirmed matches. Sign up for the free tier (100 credits included) to test the three-step pipeline against your own data.

PeopleDataLabs Cleaner API

PeopleDataLabs offers a Cleaner API that takes a messy company name, website, or profile URL and returns a normalized canonical form. It resolves the input to PDL's canonical entity graph, which you can then use for enrichment through their separate Person or Company Enrichment APIs. The Cleaner API handles the normalization step. Enrichment is a separate call and a separate charge.

Key features:

REST API with a free tier (100 lookups per month, expanding to 10,000 on the Pro plan)
Normalizes company names to canonical entities including standardized names, domains, and industry classifications
Supports company, school, location, and job title cleaning endpoints
Elasticsearch-compatible query syntax for advanced search filters

Pros:

"The API is well documented and relatively easy to use. You can use Elasticsearch queries to do finely tuned searches," according to G2 reviewers
"Set it up one time and it continued to work with basically no interruption or maintenance," per another G2 review
Support team cited as responsive and helpful across multiple reviews

Cons:

"The throttle limit on how many queries can be sent per minute using people search is a little prohibitive," reports one G2 user
"The data seems to trust a single social platform a bit much. If a person is not on that platform, they will not show up in search results," notes another reviewer
"The problem of data recency is one area for constant improvement," per G2

Best for: Teams that need a lightweight normalization layer before their own enrichment pipeline. The free tier is practical for testing, and the Elasticsearch query syntax gives data engineers familiar patterns to work with. Pricing starts at $98 per month for the Pro plan. See how Crustdata compares on the Crustdata vs People Data Labs page.

Tilores

Tilores is an API-first entity resolution platform built for real-time identity graph construction. It was originally developed inside a European credit bureau and uses a GraphQL API with a Python SDK. Unlike enrichment APIs that come with a pre-built company graph, Tilores is data-agnostic. You bring your own records including companies, people, transactions, or anything with structured attributes, define matching rules, and Tilores builds and maintains a unified entity graph in real time.

Key features:

GraphQL API for real-time entity resolution and search
Handles companies, people, and arbitrary entity types
Built-in deduplication and merge logic with configurable matching rules
Python SDK for pipeline integration
Designed for high-throughput ingestion with real-time graph updates

Pros:

Lightweight deployment with no infrastructure to manage beyond the API connection
Real-time graph construction means entities update as new records arrive, with no batch reprocessing
Data-agnostic design works for B2B company matching, person identity resolution, and non-standard entity types in the same system

Cons:

Pricing is not publicly listed (quote-based), making it harder to evaluate cost at scale without a sales conversation
Because it is data-agnostic, you need to supply your own reference data for company matching. There is no built-in company graph.
GraphQL API requires learning a different query pattern than the REST APIs most data engineers already work with

Best for: Teams building their own identity graph who want real-time entity resolution without managing the infrastructure for graph construction and maintenance. Most useful when your entity types go beyond standard companies and people, or when you need a single system that handles multiple entity types.

FullContact Resolve

FullContact offers an identity resolution API called Resolve that links partial identifiers (a name, a domain, a social handle, an email) to a canonical entity in their identity graph. It covers both companies and people, with the identity graph built from hundreds of data sources. FullContact focuses specifically on the resolution step: taking fragmented inputs and returning a unified identity.

Key features:

REST API with asynchronous webhook processing for higher match rates
Multi-signal identity resolution (name, domain, email, phone, social handles)
Covers both company and person entities
Identity graph built and maintained by FullContact from hundreds of sources
Integrates with major CRM and marketing platforms

Pros:

"The API is very easy to integrate, is well documented and very stable. Asynchronous processing via webhooks increases the matching ratio," per G2
"The data is very usable and reliable, plus it is up to date," reports a Capterra user
Cross-platform integration simplifies deployment for teams already using mainstream CRM or marketing tools

Cons:

"We see a 50% matching rate. There is room for improvements there," notes one G2 reviewer. Match rates vary significantly depending on the quality of input signals.
"Some data from the Enrich API have too many false positives, especially age, employment, and education," reports another G2 user
Pricing starts at $500 per month with no self-serve option. All access requires a sales call and annual contract, which adds friction for product engineers who want to test before committing. (Source)

Best for: Teams needing person-level identity resolution across multiple signal types (email, phone, social handles) with a pre-built identity graph. The asynchronous webhook processing is useful when match rates matter more than response latency. The minimum $500 per month commitment means this fits teams with established resolution volume rather than early-stage experimentation.

Senzing

Senzing is an embeddable entity resolution engine delivered as a Docker container with a REST API. It was originally built for government and financial entity resolution at scale, and uses an approach it calls "entity-centric learning" that improves matching accuracy as more records are ingested. Unlike SaaS APIs, Senzing runs in your own infrastructure, which means you control the data, the matching logic, and the latency.

Key features:

Embeddable SDK and REST API delivered as Docker images
No manual tuning required because the engine learns matching patterns from the data itself
Confidence scoring on all merges with a built-in review tool for manual splits and merges
Handles both company and person entities
Processes records in real-time or batch mode

Pros:

"I have had to work with teams of people and spend a great deal of time getting entity resolution tuned. With Senzing, it just happens," per G2 (4.8/5 rating, 17 reviews)
"Senzing is just so simple to use and scale," notes another G2 reviewer
Built-in confidence ratings and a review interface for handling uncertain matches reduce the need for custom review queue development

Cons:

"The hardest part is getting your data into a format that Senzing can read (JSONL)," reports a G2 user. Data preparation is a meaningful upfront investment.
"There is a small learning curve in the beginning when preparing the dataset for entity resolution," per another review
Pricing is annual and volume-based: 10 million records costs $58,560 per year, scaling to $234,600 per year at 100 million records (Senzing pricing page). This is infrastructure-scale pricing designed for teams processing millions of records annually.

Best for: Teams with their own infrastructure who want to embed entity resolution into their data pipeline and control the matching logic end to end. Senzing is most practical when you are processing millions of records and need the engine to learn and improve without manual rule tuning. The annual pricing model means it fits teams that have already validated their resolution workflow and need production-grade throughput.

Splink

Splink is an open-source Python library for probabilistic record linkage built by the UK Ministry of Justice. It uses the Fellegi-Sunter model with expectation-maximization to estimate match probabilities across record pairs. It runs on DuckDB, Spark, or Athena, meaning you execute it in your own environment against your own data. It is not an API you call. You bring your own reference dataset, define comparison columns, and Splink estimates which record pairs are matches.

Key features:

Open-source (Apache 2.0) with no licensing cost
Runs on DuckDB (laptop-scale), Spark, or Athena (warehouse-scale)
Probabilistic linking with Fellegi-Sunter/EM algorithm
Visual diagnostics for match quality analysis (waterfall charts, cluster dashboards)
Used in production by NHS England, UK Ministry of Defence, and Department for Education

Pros:

"Relatively simple to use by default with DuckDB," per a GitHub discussion. Getting started does not require a distributed computing setup.
Free and open-source with no per-record or per-call costs. Compute is the only expense.
2,200+ GitHub stars and active maintenance with regular releases

Cons:

Splink's documentation explicitly states it is "not designed for linking a single column containing a 'bag of words', for example, a table with a single 'company name' column and no other details". If your input is company names only (no addresses, IDs, or other attributes), Splink is not the right tool.
Splink 4 introduced breaking syntax changes from Splink 3, requiring script migration for existing users
Users reported that small errors in configuration produced unhelpful error messages, a problem significant enough to prompt the Splink 4 redesign

Best for: Data engineers who want full control over matching logic and already have a multi-column reference dataset to match against. Splink is strongest when you have records with multiple attributes (name, address, date of birth, ID fragments) and need probabilistic linking at scale. If your input is only company names with no additional attributes, use an API-based tool instead.

Dedupe (Python library) and Dedupe.io

Dedupe is an open-source Python library that uses machine learning (active learning) for entity resolution. You label a small set of record pairs as matches or non-matches, and Dedupe trains a model that generalizes to your full dataset. Dedupe.io is a hosted web interface and API wrapper around the same library, adding a UI for labeling and a cloud processing option.

Key features:

Active learning that requires only a small number of labeled examples to train a matching model
Handles deduplication (within one dataset) and record linkage (across two datasets)
Automatic blocking rules that avoid exhaustive pairwise comparisons
4,500+ GitHub stars with broad adoption in the Python data community
Dedupe.io cloud service starts at $9 per 1,000 rows with a free tier

Pros:

The active learning approach means you do not need to write matching rules manually. The model learns from your labels.
4,500+ GitHub stars reflects broad adoption and community support
A dedicated Python library focused on entity resolution, with both deduplication and record linkage modes

Cons:

Users report that multiprocessing fails in AWS Lambda and other serverless environments, which limits deployment options for production pipelines
The interactive labeling interface has usability issues that can slow down the training process
More batch-oriented than real-time. Dedupe is designed for offline matching against a dataset, not for resolving individual records at API call time.

Best for: Teams with labeled training data (or willingness to label) who want ML-based matching without building a model from scratch. Dedupe works well for periodic batch deduplication of CRM or warehouse data. For real-time, per-record resolution in a live product pipeline, an API-based tool is a better fit.

Building a Multi-Step Matching Pipeline

The three-step resolution pattern

When a single matching method tops out at 80 percent accuracy, the practical solution is to chain complementary methods. The pattern that worked for an insurance-industry data team processing government tax filings follows three steps:

Web search to find a canonical URL. Start with the messy company name and use a web search API to find the most likely company website. This handles cases where you have a legal name ("General Electric Company") but no domain.
Company identification to resolve the URL. Pass the discovered URL (or whatever identifier you have) to a company identification API that matches it against a canonical company graph. The API returns confidence-ranked candidates.
LLM disambiguation for edge cases. When the identification API returns multiple candidates with similar confidence scores, pass the candidates to an LLM with the original context (name, address, industry) and let it pick the best match.

This pipeline catches different error types at each stage. Web search handles the "no identifier" problem. The identification API takes over for normalization and variant matching, while LLM disambiguation resolves the ambiguous cases where both "NielsenIQ" and "Nielsen Holdings" are valid companies and context determines which one you mean.

Code example

A Claude Code agent with Crustdata's MCP server configured can execute this pipeline conversationally. For teams building it into production code, here is the direct API path in Python:

import requests

CRUSTDATA_API_KEY = "your_api_key"
HEADERS = {"Authorization": f"Bearer {CRUSTDATA_API_KEY}"}

# Step 1: Web search to find canonical URL from a messy legal name
search_resp = requests.post(
    "https://api.crustdata.com/screener/websearch",
    headers=HEADERS,
    json={"query": "Acme Holdings LLC official website", "num_results": 3}
)
top_url = search_resp.json()["results"][0]["url"]

# Step 2: Company Identify (free, zero credits)
identify_resp = requests.post(
    "https://api.crustdata.com/screener/identify",
    headers=HEADERS,
    json={"query_company_website": top_url, "count": 3}
)
best_match = identify_resp.json()[0]
company_id = best_match["company_id"]

# Step 3: Enrich the resolved company (1 credit)
enrich_resp = requests.get(
    "https://api.crustdata.com/screener/company",
    headers=HEADERS,
    params={
        "company_id": company_id,
        "fields": "headcount,funding_and_investment,taxonomy"
    }
)
company_data = enrich_resp.json()

import requests

CRUSTDATA_API_KEY = "your_api_key"
HEADERS = {"Authorization": f"Bearer {CRUSTDATA_API_KEY}"}

# Step 1: Web search to find canonical URL from a messy legal name
search_resp = requests.post(
    "https://api.crustdata.com/screener/websearch",
    headers=HEADERS,
    json={"query": "Acme Holdings LLC official website", "num_results": 3}
)
top_url = search_resp.json()["results"][0]["url"]

# Step 2: Company Identify (free, zero credits)
identify_resp = requests.post(
    "https://api.crustdata.com/screener/identify",
    headers=HEADERS,
    json={"query_company_website": top_url, "count": 3}
)
best_match = identify_resp.json()[0]
company_id = best_match["company_id"]

# Step 3: Enrich the resolved company (1 credit)
enrich_resp = requests.get(
    "https://api.crustdata.com/screener/company",
    headers=HEADERS,
    params={
        "company_id": company_id,
        "fields": "headcount,funding_and_investment,taxonomy"
    }
)
company_data = enrich_resp.json()

import requests

CRUSTDATA_API_KEY = "your_api_key"
HEADERS = {"Authorization": f"Bearer {CRUSTDATA_API_KEY}"}

# Step 1: Web search to find canonical URL from a messy legal name
search_resp = requests.post(
    "https://api.crustdata.com/screener/websearch",
    headers=HEADERS,
    json={"query": "Acme Holdings LLC official website", "num_results": 3}
)
top_url = search_resp.json()["results"][0]["url"]

# Step 2: Company Identify (free, zero credits)
identify_resp = requests.post(
    "https://api.crustdata.com/screener/identify",
    headers=HEADERS,
    json={"query_company_website": top_url, "count": 3}
)
best_match = identify_resp.json()[0]
company_id = best_match["company_id"]

# Step 3: Enrich the resolved company (1 credit)
enrich_resp = requests.get(
    "https://api.crustdata.com/screener/company",
    headers=HEADERS,
    params={
        "company_id": company_id,
        "fields": "headcount,funding_and_investment,taxonomy"
    }
)
company_data = enrich_resp.json()

The identification step is free. You pay only for enrichment after the company is confirmed, which means you can run identification against millions of messy records without cost, then enrich only the confirmed matches.

Handling edge cases with a review queue

Build a confidence threshold into the pipeline. When the identification API returns a match with a high confidence score and is_full_domain_match is true, auto-accept it. When confidence is lower or multiple candidates score similarly, route the record to a review queue.

The cost of skipping this step compounds over time. One team's previous tool returned Netscape as a match for AOL with no confidence signal. That bad match propagated through their enrichment pipeline and corrupted downstream company groupings. A confidence-based review queue catches these before they spread.

For teams processing large volumes, a 24-hour retry pattern also helps: if a company is not yet in the identification database, the API returns a pending status. Queue these records and retry the next day. The database updates continuously, so records that fail today may resolve tomorrow.

Conclusion

Entity resolution for builders is a pipeline step you add to your existing architecture. The right tool depends on where your data is and what your pipeline needs. If you need company name resolution before enrichment, an API with a canonical company graph (like Crustdata's free Company Identification API) handles that in a single call. If you need to build custom matching logic against your own reference data, open-source libraries like Splink or Dedupe give you full control. If you need to embed resolution into your own infrastructure at scale, Senzing runs entirely in your environment.

Start with the resolution step. Sign up for Crustdata's free tier (100 credits included) and test the Company Identify + Web Search + Enrichment pipeline against your own messy data. For teams building internal sales tools or recruiting platforms at scale, book a demo to walk through the architecture.

Abhilash writes about data-driven automation, enrichment systems, and API-powered intelligence for GTM, recruiting, and investment use cases. He writes for builders who care about accuracy, latency, and reliability with technical guidelines and tips.