If electronic identity matching was as simple as a computer finding a given name on a list, there would be no need for this article. However, it is a bit more complicated than that. Names have a structure and typology that are influenced by many cultural and regional factors. Us humans know these differences and usually recognise them without even thinking twice. Machines have to be taught how to make these distinctions. That’s the more complicated bit.
Simple algorithms to match names have been in use for some time. These algorithms compute similarities and linear thresholds and can then trigger alerts or not. However, this basic method is by no means fool proof – there are either lots of mistakes or multiple false positives, alerts that are triggered but are not relevant.
Considering the sheer volume of data that now needs to be sifted through, the complexity of some of that data and the speed with which the correct data needs to be identified, more sophisticated methods need to be used if organisations want the correct results and a minimal number of false positives.
The good news is that such technology already exists. These more advanced algorithms have dictionaries, linguistics and machine-learning techniques as a base.
Let’s explain a bit further. Dictionary-based matching mechanisms are used to make distinctions between different names. Firstly, there’s translation, words that have the same meaning but completely different spellings in different languages. For example, Germany is the version used in England, Deutschland in Germany and Allemagne in France.
These mechanisms also apply to multilingual name variants, such as the English John, the French Jean and Spanish Juan. They identify nicknames, such as Bob being a derivative of Robert. They also pick up on synonyms, such as pseudonyms and abbreviations, compound names such as the United States of America, and common name components such as titles. Last but not least, these mechanisms also recognise particles, such as de, von, van, bin and of.
That’s not all of it though. Sometimes even more sophisticated techniques are required to ensure machines can nail the fact that two names actually relate to the same person.
This is when linguistics and machine-learning techniques come into play. Let’s look at transliteration, the ability to compare names in different alphabets – say locating an Arabic name that has been written in the Cyrillic alphabet, the alphabet used Eastern Europe, North and Central Asia. At FircoSoft, transliteration is based on a representation of each character in an alphabet by its Latin equivalent. The complexity of the mapping varies according to what language is under the spotlight – matching up Chinese symbols to Latin characters, for example. Then there’s language-specific matching rules – such as patronyms and declensions for Cyrillic names or vowellessness for Arabic names.
Transliteration is also used to overcome the lack of standardization for Romanization – the conversion of a different writing system into Latin. While dictionary-based mechanisms can recognise common variants, they might miss the less common variants. Internal transliteration mechanisms can ensure no variants slip through the net.
What’s next? In next week’s post we look at volume control and hit rates – how to get accurate results while not being overwhelmed by data.
Also in our series on identity matching: