Posted by AdminAug 18, 2020 12:00:00 AM
2 minutes to read
by Kemp Williams, PhD
An ESL teacher with a student named Abdul might assume that the student’s native language is Arabic. The teacher would almost certainly be wrong. Abdul might be Muslim, but there’s very little chance Arabic is his first language. Abdul is an Arabic name almost never used by Arabs.
Abdul is actually two words in Arabic: abd meaning ‘slave’ or ‘servant’, and ul, a form of the definite article ‘the’. In Arab countries, Abdul is always followed by one of the names of God found in the Quran, the Islamic scriptures, or the Sunnah, the sayings and teachings of Mohammed.
So even though Abdul Rahman is really three words in Arabic, it is a single name meaning ‘Servant of the Merciful’. Similarly, Abdul Aziz means ‘Servant of the Strong’, and Abdul Hakim is ‘Servant of the Wise’. There are more than 100 such combinations of Abdul and a theophoric name, or name of God.
It would make no sense for a native Arabic speaker to be called Abdul, since ‘servant of the’ is obviously missing something. As Islam spread eastward, however, toward non-Arab countries like Iran, Pakistan, and Indonesia, Arabic names went with it. In places where Arabic was not a native language, something linguists call a reanalysis took place. Abdul was interpreted as an independent given name. If you don’t speak Arabic, you don’t sense that Abdul is missing anything. So you’re much more likely to meet an Abdul from Pakistan than from Saudi Arabia.
Even in countries where Abdul is used as a single given name, however, it is not used with the same frequency. It turns out that Abdul as a singleton is much more common in Afghanistan and Pakistan than it is in Iran, India, Malaysia, and Turkey, despite the fact that all these countries have significant Muslim populations. The frequency of Abdul as a single given name is actually more dependent upon literacy rates. The literacy rate in Turkey, for example, is 96%, while in Afghanistan it is 43%. Both countries are nearly 100% Muslim, but the ability to study Arabic will be easier for Turks than for Afghans. And someone who studies Arabic is more likely to realize that Abdul is really not a very good Arabic name.
In a name-matching scenario, it is usually necessary to know something about how a name is parsed into its given name and surname components. For example, you don’t want Jim Henry to match Hank James even though Jim is a common nickname for James, and Hank is a common nickname for Henry. Knowing whether Abdul is part of a larger name phrase or is being used on its own would be useful for accurately matching it to other possible representations of the same name. This is especially true when the names have been transliterated from other scripts into the Roman alphabet. The name Abdul Rahman might be transliterated Abderrehman. Comparing Abdul to the first one might be reasonable when considering a full name match; comparing Abdul to the second one makes a lot less sense.
With named entity extraction, on the other hand, knowing how a personal name should be parsed ahead of time is not necessary to successfully find the name in unstructured text. Rosoka’s data extraction engine relies instead on multiple meanings, called semantic vectors, being possible for each lexical item. The semantic vectors are not immutable, but can change as the context surrounding the name is analyzed. So Abdul can easily be recognized with a semantic vector like <name_prefix> that is able to change during processing to become <given_name>. This semantic vector approach means Abdul can be extracted correctly whether it’s found in a document in Urdu from Pakistan or a document in Arabic from Egypt.