Why Lookalike Candidate Search Is Still Broken and How to Build It Better

Lookalike candidate search fails because tools flatten profiles into keywords. Learn how to build one that works using composable filters, recency weighting, and adjacent talent pool discovery.

Published

Apr 25, 2026

Written by

Chris Pisarski

Reviewed by

Manmohit Grewal

Read time

minutes

Most AI sourcing tools offer some version of "find more like this." Paste a profile, and the tool returns similar candidates. In practice, the feature rarely works. Recruiters who rely on it report that fewer than a third of the returned profiles are worth opening, while the rest are padded with keyword-adjacent noise that wastes hours of review time.

The root cause is architectural. These tools flatten your input into keyword tokens before searching, which destroys the semantic context that made the original profile a good fit in the first place. The output is a search that matches on overlapping words rather than overlapping qualifications.

The failure has a name, a specific mechanism, and a solution. Here is how it works and what to build instead.

How Lookalike Candidate Search Is Supposed To Work

A recruiter has a strong candidate, either someone they placed previously, a high performer on a client's team, or a profile they found manually. They paste that profile into a sourcing tool. The tool analyzes it, extracts the relevant signals (title, skills, tenure, company type, industry), and searches its database for people with similar backgrounds. Results come back ranked by a match score, typically displayed as a percentage.

Tools that offer this include LinkedIn Recruiter's "Similar Profiles" feature (which returns up to 25 results), Juicebox, SeekOut, and several others. The promise is the same across all of them, that one good profile should be enough to surface more.

Where Lookalike Search Breaks Down

The keyword flattening problem

When a recruiter pastes a reference profile into a sourcing tool, the tool needs to convert that profile into a query it can run against its database. Most tools do this by extracting keywords from the profile and running a token-matching search. The tool pulls out individual words and phrases, then looks for other profiles that contain enough of those same tokens to cross a similarity threshold.

This is where the process breaks. A recruiter sourcing for a cloud infrastructure engineer with GCP experience might expect the tool to understand that this means distributed systems work, infrastructure-as-code, Kubernetes, and the specific skill set that comes with operating at scale on Google Cloud. Instead, the tool tokenizes the input. "Cloud" becomes one keyword. "GCP" becomes another. The tool then matches against any profile that contains enough of these tokens, regardless of whether the person actually does cloud infrastructure work.

One recruiting team we spoke with described running a lookalike search for cloud engineers and finding that the tool had split "GCP" into individual characters, matching profiles that had nothing to do with Google Cloud Platform. The search returned results, and the match scores looked reasonable, but the candidates were wrong.

This happens because these tools translate natural language into categorical filters without preserving the relationships between terms, so the tool sees a bag of keywords and returns profiles with overlapping bags regardless of whether the underlying roles are related.

Match scores that do not reflect reality

The downstream effect of keyword flattening is inflated match scores. A profile that shares many of the same individual tokens as the reference profile will score highly, even if the person works in a completely different function.

Recruiting teams sourcing for specialized technical roles report that out of several hundred "matched" candidates from a single lookalike search, roughly a third are actually worth reviewing. The rest share surface-level keywords with the reference profile but differ in seniority level, specialization, or the type of company they work at. The match percentage looks high because of the keyword overlap, but the fit is low because keyword overlap is a poor proxy for role similarity.

This creates a paradox where the tool appears to be working (high match scores, many results) while actively wasting the recruiter's time (most results are irrelevant).

The recency problem

Keyword flattening is not the only failure mode, because lookalike tools also struggle with time. Candidate profiles contain historical data, and someone who worked in computer vision research at a major tech company for two years in 2016 and then moved into product management will still show up in a computer vision lookalike search in 2026, because the keywords are on their profile. They held the title and listed the skills, and ten years of career movement away from that function does not change the token match.

Recruiting teams report seeing candidates who have been at a large company for eight or more years in a role unrelated to the search query, but who still appear as high-confidence matches because their historical profile contains the right keywords. The tool treats a 2016 job title the same as a 2026 job title.

Without recency weighting, lookalike search biases toward profiles with the most accumulated keywords, which tends to mean the longest careers and the most historical roles, regardless of whether the person currently does the relevant work.

Big-company bias

A related problem is that lookalike tools over-index on employees at large, well-known companies. These profiles tend to be the most complete (large companies often encourage employees to maintain detailed profiles), the most numerous (a company with 50,000 employees creates 50,000 searchable profiles), and the most keyword-rich (standardized titles and skill endorsements).

For recruiters sourcing specialized roles, this creates a distortion. A search for robotics engineers returns results dominated by employees at the largest names in the space, even when the recruiter's client is a 200-person hardware company that needs someone comfortable operating in a smaller, less structured environment. Specialists at mid-stage companies, academic labs, or adjacent industries are pushed down or excluded entirely.

These people do match on keywords, but keyword match is a poor proxy for candidate fit when the recruiter needs someone at a particular company stage, with a particular specialization, in an environment that matches their client's.

The Missing Capability: Adjacent Talent Pool Discovery

Beyond the keyword flattening problem, few lookalike tools provide the feature recruiters actually want next, which is suggesting where else to look.

When a recruiter exhausts the obvious talent pool for a specialized role, the next step is lateral thinking. A recruiter hiring robotics engineers might know that the direct talent pool (people with "robotics" in their title) is small and heavily competed over. What they need is a tool that suggests related pools, like defense contractors building autonomous systems, medical device companies working on surgical robotics, avionics teams at aerospace firms, or autonomous vehicle companies whose engineers have overlapping skill sets.

This kind of adjacent thinking is how experienced recruiters actually expand their searches. One team we spoke with described this as the most valuable capability they wished existed, where after exhausting the direct pool the tool would recommend adjacent industries and company types where transferable skills exist. Instead of returning more of the same (which is what existing lookalike search does), the tool would point in new directions.

Few sourcing tools do this programmatically. Recruiters rely on their own industry knowledge and manual research to identify adjacent pools. For generalist recruiters or those working outside their area of specialization, these adjacent pools are invisible.

What A Working Lookalike System Actually Needs

Fixing lookalike search requires changing the architecture rather than layering a better algorithm on top of the same design. Three structural requirements separate a system that works from one that matches keywords.

Composable filters instead of opaque matching. Rather than pasting a profile into a black box and hoping the tool interprets it correctly, the system should decompose what makes a candidate a fit into explicit, adjustable filters. Title, seniority, skills, company size, industry, geography, and years of experience should each be a discrete filter that the recruiter can see, adjust, and override. When a filter produces irrelevant results, the recruiter changes that specific filter instead of re-running the entire black-box search and hoping for different output.

Recency weighting. The system should distinguish between what someone is doing now and what they did five or ten years ago. Filtering by when someone held a relevant role, whether they recently changed jobs, and how long they have been in their current position prevents the historical-keyword problem. A profile with the right keywords from 2016 should not rank the same as one from 2026.

Company and industry expansion. The system should support searching across related company types and industries beyond the ones represented in the reference profile. If the reference candidate works in robotics, the search should be expandable to defense, medical devices, aerospace, and autonomous vehicles without requiring the recruiter to manually build separate searches for each adjacent sector.

How To Build Lookalike Search That Works

The following workflow shows how to construct a working "find more like this" system using structured APIs instead of opaque matching. This approach uses the People Enrichment API to decompose a reference profile into structured data, then the People Discovery API to search across a billion-plus profiles using composable filters.

Each step uses the robotics engineer search from earlier as a running example, showing how the API solves the exact problems that keyword flattening creates.

Step 1: Decompose the reference profile into structured data

Start with the reference candidate's professional network profile url. Use the People Enrichment API to pull a structured breakdown of their profile into discrete, queryable fields.

The response returns 90+ datapoints as structured JSON, including current title, employer, past employers with dates, skills, education, years of experience, and location. Each piece of the profile comes back as its own field rather than a blob of text.

This is the opposite of what keyword-flattening tools do. Instead of reducing the profile to a bag of tokens, enrichment preserves the structure so you can see that someone is a Senior Robotics Engineer (title) at a 200-person hardware company (current_company_employee_count) in Austin, Texas (location) who previously worked in defense systems (past_employers). Each of these becomes a filter you can use in the next step.

Step 2: Build composable search filters from the structured profile

Take the structured fields from the enrichment response and translate them into People Discovery API filters. This is where the GCP problem from earlier gets solved. Instead of letting a tool tokenize "cloud infrastructure engineer with GCP experience" into a keyword bag, you define exactly what each dimension means.

curl --request POST \
  --url 'https://api.crustdata.com/screener/person/search/' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "filters": [
      {
        "filter_type": "CURRENT_TITLE",
        "type": "in",
        "value": ["Robotics Engineer", "Controls Engineer", "Automation Engineer"]
      },
      {
        "filter_type": "SENIORITY_LEVEL",
        "type": "in",
        "value": ["Senior", "Lead", "Manager"]
      },
      {
        "filter_type": "REGION",
        "type": "in",
        "value": ["United States"]
      },
      {
        "filter_type": "INDUSTRY",
        "type": "in",
        "value": ["Robotics Engineering", "Automation Machinery Manufacturing"]
      }
    ],
    "page": 1
  }'

curl --request POST \
  --url 'https://api.crustdata.com/screener/person/search/' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "filters": [
      {
        "filter_type": "CURRENT_TITLE",
        "type": "in",
        "value": ["Robotics Engineer", "Controls Engineer", "Automation Engineer"]
      },
      {
        "filter_type": "SENIORITY_LEVEL",
        "type": "in",
        "value": ["Senior", "Lead", "Manager"]
      },
      {
        "filter_type": "REGION",
        "type": "in",
        "value": ["United States"]
      },
      {
        "filter_type": "INDUSTRY",
        "type": "in",
        "value": ["Robotics Engineering", "Automation Machinery Manufacturing"]
      }
    ],
    "page": 1
  }'

curl --request POST \
  --url 'https://api.crustdata.com/screener/person/search/' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "filters": [
      {
        "filter_type": "CURRENT_TITLE",
        "type": "in",
        "value": ["Robotics Engineer", "Controls Engineer", "Automation Engineer"]
      },
      {
        "filter_type": "SENIORITY_LEVEL",
        "type": "in",
        "value": ["Senior", "Lead", "Manager"]
      },
      {
        "filter_type": "REGION",
        "type": "in",
        "value": ["United States"]
      },
      {
        "filter_type": "INDUSTRY",
        "type": "in",
        "value": ["Robotics Engineering", "Automation Machinery Manufacturing"]
      }
    ],
    "page": 1
  }'

Each filter is explicit. "Robotics Engineer" is matched as a complete title, so there is no risk of the tool splitting it into "Robotics" and "Engineer". If "Automation Engineer" is bringing in too many manufacturing-automation results, remove it from the title array. If the geographic filter is too narrow, swap "United States" for specific metro regions using the autocomplete endpoint.

The recruiter controls what "similar" means for each search, and every filter is visible and adjustable.

Step 3: Add recency and activity signals

This step solves the eight-year-AWS problem. The CURRENT_TITLE filter from Step 2 already excludes people whose robotics experience is purely historical, because it only matches the title they hold right now. But the In-DB People Search API goes further by letting you filter on exact dates. You can combine current_employers.title with current_employers.start_date to find people who started a relevant role within a specific window, or use past_employers.title with past_employers.start_date to find people who held the title recently at a previous employer.

curl --request POST \
  --url 'https://api.crustdata.com/screener/persondb/search' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "filters": {
      "op": "and",
      "conditions": [
        {
          "filter_type": "current_employers.title",
          "type": "(.) ",
          "value": "Robotics Engineer"
        },
        {
          "filter_type": "current_employers.start_date",
          "type": ">",
          "value": "2024-01-01"
        },
        {
          "filter_type": "region",
          "type": "(.) ",
          "value": "United States"
        }
      ]
    },
    "limit": 25
  }'

curl --request POST \
  --url 'https://api.crustdata.com/screener/persondb/search' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "filters": {
      "op": "and",
      "conditions": [
        {
          "filter_type": "current_employers.title",
          "type": "(.) ",
          "value": "Robotics Engineer"
        },
        {
          "filter_type": "current_employers.start_date",
          "type": ">",
          "value": "2024-01-01"
        },
        {
          "filter_type": "region",
          "type": "(.) ",
          "value": "United States"
        }
      ]
    },
    "limit": 25
  }'

curl --request POST \
  --url 'https://api.crustdata.com/screener/persondb/search' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "filters": {
      "op": "and",
      "conditions": [
        {
          "filter_type": "current_employers.title",
          "type": "(.) ",
          "value": "Robotics Engineer"
        },
        {
          "filter_type": "current_employers.start_date",
          "type": ">",
          "value": "2024-01-01"
        },
        {
          "filter_type": "region",
          "type": "(.) ",
          "value": "United States"
        }
      ]
    },
    "limit": 25
  }'

This query returns people whose current title fuzzy-matches "Robotics Engineer" and who started that role after January 2024. Someone who held a robotics title in 2016 and is now a product manager will not appear because their current title does not match. Someone who has carried the same robotics title at the same company since 2015 will also not appear because their start date falls outside the window. What surfaces are people who recently moved into the role, meaning their experience is current and they may be open to conversations.

You can also query across past employers. Replacing current_employers with past_employers in the title and start date filters finds people who held a robotics engineering title at a previous company within the same time window. This catches candidates who recently moved into adjacent roles like systems engineering or technical program management but were doing hands-on robotics work within the last two years.

Step 4: Expand to adjacent talent pools

This is where the search goes from "find more of the same" to "find candidates the recruiter wouldn't have thought to search for." Modify the industry filter to include adjacent sectors where transferable skills exist.

curl --request POST \
  --url 'https://api.crustdata.com/screener/person/search/' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "filters": [
      {
        "filter_type": "CURRENT_TITLE",
        "type": "in",
        "value": ["Robotics Engineer", "Controls Engineer", "Systems Engineer", "Mechatronics Engineer"]
      },
      {
        "filter_type": "SENIORITY_LEVEL",
        "type": "in",
        "value": ["Senior", "Lead", "Manager"]
      },
      {
        "filter_type": "INDUSTRY",
        "type": "in",
        "value": ["Defense and Space Manufacturing", "Medical Equipment Manufacturing", "Aviation and Aerospace Component Manufacturing", "Motor Vehicle Manufacturing"]
      },
      {
        "filter_type": "REGION",
        "type": "in",
        "value": ["United States"]
      }
    ],
    "page": 1
  }'

curl --request POST \
  --url 'https://api.crustdata.com/screener/person/search/' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "filters": [
      {
        "filter_type": "CURRENT_TITLE",
        "type": "in",
        "value": ["Robotics Engineer", "Controls Engineer", "Systems Engineer", "Mechatronics Engineer"]
      },
      {
        "filter_type": "SENIORITY_LEVEL",
        "type": "in",
        "value": ["Senior", "Lead", "Manager"]
      },
      {
        "filter_type": "INDUSTRY",
        "type": "in",
        "value": ["Defense and Space Manufacturing", "Medical Equipment Manufacturing", "Aviation and Aerospace Component Manufacturing", "Motor Vehicle Manufacturing"]
      },
      {
        "filter_type": "REGION",
        "type": "in",
        "value": ["United States"]
      }
    ],
    "page": 1
  }'

curl --request POST \
  --url 'https://api.crustdata.com/screener/person/search/' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "filters": [
      {
        "filter_type": "CURRENT_TITLE",
        "type": "in",
        "value": ["Robotics Engineer", "Controls Engineer", "Systems Engineer", "Mechatronics Engineer"]
      },
      {
        "filter_type": "SENIORITY_LEVEL",
        "type": "in",
        "value": ["Senior", "Lead", "Manager"]
      },
      {
        "filter_type": "INDUSTRY",
        "type": "in",
        "value": ["Defense and Space Manufacturing", "Medical Equipment Manufacturing", "Aviation and Aerospace Component Manufacturing", "Motor Vehicle Manufacturing"]
      },
      {
        "filter_type": "REGION",
        "type": "in",
        "value": ["United States"]
      }
    ],
    "page": 1
  }'

The industry filter now targets defense contractors, medical device firms, aerospace companies, and automotive manufacturers where robotics-adjacent engineers work. The title array has also expanded to include "Systems Engineer" and "Mechatronics Engineer," titles that are common in these adjacent industries for people doing overlapping work.

This is the recruiter's domain knowledge encoded as search parameters rather than inferred by an AI model. The recruiter defines which industries are adjacent, and the API executes the expanded search. For recruiters working outside their specialty, the company data from the enrichment response (Step 1) includes industry classifications and competitor lists that can inform which adjacent sectors to target.

Step 5: Monitor for new candidates continuously

Rather than re-running searches manually, set up a Watcher that pushes a webhook notification when a new candidate matches your criteria.

curl --request POST \
  --url 'https://api.crustdata.com/watcher/watch' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "entity_type": "person",
    "triggers": [
      {"event_type": "job_change", "filter": {"title_keyword": "Robotics Engineer OR Controls Engineer OR Mechatronics Engineer"}},
      {"event_type": "profile_update", "filter": {"skill_keyword": "robotics OR ROS OR motion planning"}}
    ],
    "webhook_url": "https://your-endpoint.com/webhook"
  }'

curl --request POST \
  --url 'https://api.crustdata.com/watcher/watch' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "entity_type": "person",
    "triggers": [
      {"event_type": "job_change", "filter": {"title_keyword": "Robotics Engineer OR Controls Engineer OR Mechatronics Engineer"}},
      {"event_type": "profile_update", "filter": {"skill_keyword": "robotics OR ROS OR motion planning"}}
    ],
    "webhook_url": "https://your-endpoint.com/webhook"
  }'

curl --request POST \
  --url 'https://api.crustdata.com/watcher/watch' \
  --header 'Authorization: Token $CRUSTDATA_API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "entity_type": "person",
    "triggers": [
      {"event_type": "job_change", "filter": {"title_keyword": "Robotics Engineer OR Controls Engineer OR Mechatronics Engineer"}},
      {"event_type": "profile_update", "filter": {"skill_keyword": "robotics OR ROS OR motion planning"}}
    ],
    "webhook_url": "https://your-endpoint.com/webhook"
  }'

When someone changes jobs into a relevant role, updates their profile with matching skills, or moves to a company in your target set, you receive an alert without polling. This turns a one-time search into an ongoing pipeline where the talent pool updates itself.

Full API documentation covers each endpoint in detail, and the existing guide on building a candidate search engine walks through the technical implementation end-to-end.

What Changes When You Get This Right

Lookalike candidate search is the right concept. Finding people similar to a known strong candidate is exactly what recruiters need. Existing tools fail at it because they reduce a rich professional profile to a set of keywords and then measure overlap, losing everything that made the original profile a good match.

The fix is structural. A working system decomposes the reference profile into queryable fields, lets the recruiter define what "similar" means through composable filters, weights for recency so historical keywords do not dominate, and expands into adjacent industries when the direct pool runs thin. Adding continuous monitoring means the search keeps working after you walk away.

Recruiting teams that build on this architecture spend their time evaluating candidates who actually fit, instead of scrolling past hundreds of keyword matches that looked right on paper.

Crustdata's People Discovery and People Enrichment APIs provide the data layer for recruiting workflows, covering over a billion profiles with 60+ composable search filters and real-time enrichment. Book a demo to see it working on your own searches.

Chris writes about how modern teams use real-time data to make better decisions across sales, recruiting, and investment. His focus is on highlighting how live people and company insights help teams spot opportunities earlier, personalize outreach with context, and build stronger pipelines.