Fuzzy Matching Algorithms
What you need to know if you're using Fuzzy Matching Algorithms
in customer data applications
Fuzzy Matching &
Unifying Customer Data
Most people who work with data believe ‘phonetic’ algorithms like Soundex, and ‘distance’ algorithms like Levenshtein, are reliably accurate matching algorithms. But if you look deeper, you can see – they’re not. They are far from it!
This fact is no more clearly proven than by looking to nearly every published resource from Accenture to Gartner, that state average duplication rates in customer databases today are still around 5 percent, and in hospital MPI databases that number climbs to 9.4 percent. You have to ask yourself — why?
Customer data is the lifeblood every business – and it’s critical. Every touchpoint, department, and person relies on the accuracy of the data these systems hold.
“Conventional” Fuzzy Matching
Conventional matching algorithms are specifically written and narrowly designed to solve specific patterns of difference in data.
Each algorithm generates measures for different data scenarios. It’s important to understand that the decision as to which algorithm is best is not driven by the user – it’s the data that really determines the best algorithm. It’s up to the end user to figure out which algorithm.
This process of determining the best algorithm requires an approach of build, test, analyze, tweak and repeat. When data assumptions fail for one algorithm, pick another and try again until you pick a winner.
It’s also important to recognize that any one field of data may require multiple approaches to matching. For example, a distance algorithm would detect the similarity between Thompson and Thomson, but not detect similarity between a name like Lindsey and Linzy. Both types of data defects in that field of data would require testing of different phonetic and distance algorithms applied to that same field of data.
Other data issues require approaches not addressed by conventional matching algorithms. Such as names like Chuck and Charles or the relationship in city names like New York or NYC and Brooklyn require a completely different approach.
The Fact is: Data isn’t Perfect
#DataGymnastics and #RegExHell
When it comes to working with fuzzy matching algorithms to match and unify customer data, it isn’t exactly easy. As a matter of fact, it’s hard That’s because there are many nuances to customer data, and as result, fuzzy matching algorithms are only part of the matching equation.
Every instance of data inaccuracy starts with the point of entry, and in every instance, the contact record was created by a human – regardless of its source. Take a moment to think about that statement. Your data and the data you acquire comes from somewhere – and the genesis is a human, with fingers on a keyboard.
Conventional matching processes use a library of algorithms like Soundex, Metaphone and Levenstein, and require significant data wrangling to extract, transform, standardize and normalize data prior to matching. The algorithms must then be folded into substring matchkeys to find potential fuzzy and phonetic matches.
It’s a long iterative process of trial and error, playing with various algorithms and matchcodes just to figure out how to get ‘adequate’ results. Customer data is unique – and the techniques required to match on it are unlike any other form of data matching.
The data tells the whole story —
and data doesn't lie.
What most people don’t realize is, the underlying algorithms and process every other Data Quality, Data Integration, and Data Analytics Platform relies on – are fundamentally flawed.
If you dig into the data science and research around conventional matching algorithms you’ll find that a staggering 66% of the matches reported by Soundex was incorrect, and 25% were completely missed! — and these numbers are not unique to just Soundex.
Missed matches and
false matches galore!
Phonetic algorithms don’t understand the nuance of names. They have no tolerance for random typos or misspellings, misfielded names, or double names (e.g. hyphenated married name), and the algorithms must be applied to each name element in isolation (first, middle, last).
Phonetic algorithms only deal with sound, not pronunciation. They don’t understand the pronunciation of words like Milan and Mulan have different stressed syllables, and it certainly doesn’t recognize Chuck from Charles, Elizabeth from Liz or Betty. You’re left to figure that out!
Even as an algorithm built for phonetics, it’s too often incapable of matching names that clearly sound the same; such as such as Lee and Leigh, Walker and Waker or Thomson and Thompson. Take a name like ‘Matthew McConaughey’. The Soundex code for McConaughey is M252. With that same code McConaughey would also match to Magnus, Mocking, Mackenzie and for that matter – Messemaeckers van de Graaff!
And – contrary to popular belief – the Soundex phonetic algorithm was NEVER developed for the purposes of “matching” names – it indexes – it does NOT match.
"Doing the same thing over and over and expecting different results."
Meet the next-generation
approach to matching and
unifying contact data.
Why Syniti Intelligent Matching Engine?
Unlike competing applications or scripted SQL queries – the Syniti Matching Engine doesn’t require data standardization, correction or manipulation prior to matching. It doesn’t require two different data sources to be normalized into a common format or a target database. It even treats addresses as an object so you can match on unstandardized address with different inputs, and even poorly structured global address data.
The 360 Matching Engine matches entire records, and doesn’t rely on a single algorithm applied to a field, or extended match keys. The Matching Engine uses multiple sophisticated approaches specifically for the nuances of contact data.
The Engine intelligently grades and scores matches – using all available data to confidently determine which records are a true match and which records are NOT!
Your Matching Process
Simply Can't Handle This…
ProTip for Data Scientists, Analysts, and DBA’s
The Syniti Data Matching Engine can identify Individual level, household level, and business level matches all in one routine – without the need to create a new Matchcode or generate new match keys on the data!
While these rules have been designed and tested to be extremely effective – the matching logic AND the scoring rules are fully user configurable to give you the ability to identify duplicates using any data e.g. account number, Social Security Number, phone number, date of birth, blood type, eye color, shoe size – or anything else known about the customer.
Try that with your matching process!
Smarter — Match Logic
Because one methodology cannot be relied on exclusively to deal with all variations found in a typical database, Syniti Data Matching uses multiple algorithms and lexicons to ensure that all types of difference are detected, taking a 3-dimensional view of the data, never relying on any single item of data being correct or consistent!
The logic identifies and strips out noise words, such as ‘of’ & ‘the’. AND it creates relationships between words like ‘cars’ & ‘motors and ‘inc’ & ‘incorporated’’. AND the tokens account for miskeyed data. AND the tokens account for the consonant sounds, AND the vowel sounds, AND the stressed syllables in the name.
The Syniti Matching Engine incorporates a robust pre-tuned yet highly configurable scoring engine, blending probabilistic analysis with Deterministic rules, into a separate, but tightly connected process to identify matching records.
The engine works through the data by taking two records at a time from the cluster and compares them – field-by-field, analyzing and grading points of similarity. The Matching Engine applies a weight to any item of data within the records being compared – to specify how much each field or field group contributes to the overall match score.
You can also instruct the Syniti Matching Engine to identify “Automatic False-Positives” or “Automatic Positive Matches” when selected data elements do or do not match. You can even weight individual attributes positively or negatively – creating matching and unmatching probabilities that will increase or decrease the composite score.