Fuzzy Matching Strategy Based on Specific Fields for Modified Data Leak Validation: An Experimental Study of Many-to-One Matching

Main Article Content

Fadlilah Izzatus Sabila
Catur Adi Nugroho
Fauziah Fauziah

Abstract

Personal data leakage is becoming an increasingly serious issue, especially when the leaked data has been partially modified to avoid direct matching with the original source. This study develops a fuzzy approach based on algorithmic mapping of each attribute (field-algorithm pairing) as well as a weighting scheme based on relevance, to support a many-to-one data match between the leaked data and the original database. Four algorithms are used: Levenshtein, Jaro-Winkler, Token Sort Ratio, and Cosine Similarity, selected based on the semantic characteristics of the attributes. Experiments were conducted on 10,000 synthetic data with various modification scenarios, including clean data, light modification, and weight modification Results showed high performance in both clean data and light modification (F1-score 0.90–1.00), but significantly decreased in heavy modification (F1-score 0.10–0.45). This approach offers a lightweight yet effective solution for the early stages of identity verification in data leak investigations, as well as opening up opportunities for further development through a combination of algorithms and adaptive adjustment of matching thresholds.

Article Details

How to Cite
Sabila, F. I., Nugroho, C. A., & Fauziah, F. (2026). Fuzzy Matching Strategy Based on Specific Fields for Modified Data Leak Validation: An Experimental Study of Many-to-One Matching. Jurnal Ilmiah Giga, 28(2), 68–76. https://doi.org/10.47313/jig.v28i2.4264
Section
Articles

References

H. Kasyap, U. I. Atmaca, C. Maple, G. Cormode, and J. He, “Privacy-preserving Fuzzy Name Matching for Sharing Financial Intelligence,” Nov. 2024, [Online]. Available: http://arxiv.org/abs/2407.19979

B. P. K., “Fuzzy Approach to Record Linkages,” Preprint ArXiv:2402.03464, 2024, doi: 10.48550.

Z. Zhang, “Review on String-Matching Algorithm,” SHS Web of Conferences, vol. 144, p. 03018, 2022, doi: 10.1051/shsconf/202214403018.

V. Gupta, M. Singh, and V. K. Bhalla, “Pattern matching algorithms for intrusion detection and prevention system: A comparative analysis,” Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2014, pp. 50–54, Nov. 2014, doi: 10.1109/ICACCI.2014.6968595.

Y. van Gennip, B. Hunter, A. Ma, D. Moyer, R. de Vera, and A. L. Bertozzi, “Unsupervised record matching with noisy and incomplete data,” Int J Data Sci Anal, vol. 6, no. 2, pp. 109–129, Sep. 2018, doi: 10.1007/s41060-018-0129-7.

X. Shu, J. Zhang, D. D. Yao, and W. C. Feng, “Rapid screening of transformed data leaks with efficient algorithms and parallel computing,” in CODASPY 2015 - Proceedings of the 5th ACM Conference on Data and Application Security and Privacy, Association for Computing Machinery, Mar. 2015, pp. 147–149. doi: 10.1145/2699026.2699130.

X. Shu, J. Zhang, D. Yao, and W. C. Feng, “Fast detection of transformed data leaks,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 3, pp. 528–542, Mar. 2016, doi: 10.1109/TIFS.2015.2503271.

D. Deng, G. Li, H. Wen, H. V. Jagadish, and J. Feng, “META: An efficient matching-based method for error-tolerant autocompletion,” Proceedings of the VLDB Endowment, vol. 9, no. 10, pp. 828–839, Jun. 2016, doi: 10.14778/2977797.2977808; Taxonomy: ACM-PUBTYPE; PAGEGROUP: STRING: PUBLICATION.

A. Duyster and T. Kociumaka, “Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts,” Mar. 2025, [Online]. Available: http://arxiv.org/abs/2503.03488

W. Andrzejeski, B. Bebel, P. Boinski, M. Sienkiewicz, and R. Wrembel, “Text Similarity Measures in A Data Deduplication Pipeline for Customers Records,” in Proceedings of the 25th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), Mar. 2023, pp. 33–42. [Online]. Available: https://ceur-ws.org/Vol-3369/

P. Charalampopoulos, T. Kociumaka, and P. Wellnitz, “Faster approximate pattern matching: A unified approach,” in Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS, IEEE Computer Society, Nov. 2020, pp. 978–989. doi: 10.1109/FOCS46700.2020.00095.

A. R. Kaufman and A. Klevs, “Adaptive Fuzzy String Matching: How to Merge Datasets with Only One (Messy) Identifying Field,” Political Analysis, vol. 30, no. 4, pp. 590–596, 2022, doi: DOI: 10.1017/pan. 2021.38.

K. Hosseini, F. Nanni, and M. Coll Ardanuy, “DeezyMatch: A Flexible Deep Learning Approach to Fuzzy String Matching,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Q. Liu and D. Schlangen, Eds., Online: Association for Computational Linguistics, Oct. 2020, pp. 62–69. doi: 10.18653/v1/2020.emnlp-demos.9.

M. S. M. Rudwan and J. V. Fonou-Dombeu, “Hybridizing Fuzzy String Matching and Machine Learning for Improved Ontology Alignment,” Future Internet, vol. 15, no. 7, 2023, doi: 10.3390/fi15070229.

Y. Zhang, S. Xu, M. Zheng, and X. Li, “Field Weights Computation for Probabilistic Record Linkage in Presence of Missing Data,” Intern J Pattern Recognit Artif Intell, vol. 34, no. 14, Dec. 2020, doi: 10.1142/S0218001420590466.