Deduplication Method for Ukrainian Last Names, Medicinal Names, and Toponyms Based on Metaphone Phonetic Algorithm

Hu, Zhengbing та Buriachok, Volodymyr та Sokolov, V. Y. (2020) Deduplication Method for Ukrainian Last Names, Medicinal Names, and Toponyms Based on Metaphone Phonetic Algorithm Advances in Intelligent Systems and Computing (1247). с. 518-533. ISSN 2194-5357

[thumbnail of Hu_Z_Buriachok_V_Sokolov_V_AISC_1247.pdf]
Перегляд
Текст
Hu_Z_Buriachok_V_Sokolov_V_AISC_1247.pdf

Download (93kB) | Перегляд
Офіційне посилання: https://link.springer.com/chapter/10.1007%2F978-3-...

Анотація

This paper attempts to optimize the phonetic search processes for fuzzy matching tasks, such as deduplication of data in various databases and registers to reduce the number of errors in personal data entry (for instance, last names). The analysis of the most common last names in the territory of Ukraine shows that the majority of these last names are of Ukrainian and Russian origin (which are also reduced to phonetic rules of the Ukrainian language). The rules for pronouncing and writing last names in Ukrainian are fundamentally different from the basic algorithms for English and quite different for the Russian language, so the phonetic algorithm should take into account the peculiarities of the formation of Ukrainian last names. The use of the phonetic algorithm gives significant advantages in search and deduplication in comparison with already known algorithms: calculation of Levenshtein, Damerau-Levenshtein, Hamming, Jaro or Jaro-Winkler distance, Q-gram index, etc. The task of searching by last name was previously formalized in English, Russian and some other languages, but for the Ukrainian language such an attempt was made for the first time. The paper presents the results of the experiment on the formation of phonetic indices, as well as the results of increasing productivity when using the generated indices. A method of tailoring the search to other domains and several related languages is presented separately, for example, the search for medicines. Also, search optimization by place names in Ukrainian and Russian was separately worked out. Since in Ukraine there is an abrupt change in the names of cities and streets, the latest relevant data was collected to obtain an up-to-date list of names. Among the existing phonetic search algorithms for the Cyrillic language group, the Metaphone has proven itself in the best way.

Тип елементу : Стаття
Додаткова інформація: DOI: 10.1007/978-3-030-55506-1_47 EID: 2-s2.0-85089719737
Ключові слова: Deduplication; Drug; Fuzzy coincidence; International nonproprietary name; Medication; Medicine; Metaphone; Phonetic algorithm; Phonetic rule; Toponym; Ukrainian last name; Ukrainian surname
Типологія: Це архівна тематика Київського університету імені Бориса Грінченка > Статті у наукометричних базах > Scopus
Підрозділи: Це архівні підрозділи Київського університету імені Бориса Грінченка > Факультет інформаційних технологій та математики > Кафедра інформаційної та кібернетичної безпеки імені професора Володимира Бурячка
Користувач, що депонує: Volodymyr Sokolov
Дата внесення: 31 Серп 2020 06:56
Останні зміни: 31 Серп 2020 06:56
URI: https://elibrary.kubg.edu.ua/id/eprint/31677

Actions (login required)

Перегляд елементу Перегляд елементу