Fuzzy Matching Strategy Based on Specific Fields for Modified Data Leak Validation: An Experimental Study of Many-to-One Matching
Main Article Content
Abstract
Personal data leakage is becoming an increasingly serious issue, especially when the leaked data has been partially modified to avoid direct matching with the original source. This study develops a fuzzy approach based on algorithmic mapping of each attribute (field-algorithm pairing) as well as a weighting scheme based on relevance, to support a many-to-one data match between the leaked data and the original database. Four algorithms are used: Levenshtein, Jaro-Winkler, Token Sort Ratio, and Cosine Similarity, selected based on the semantic characteristics of the attributes. Experiments were conducted on 10,000 synthetic data with various modification scenarios, including clean data, light modification, and weight modification Results showed high performance in both clean data and light modification (F1-score 0.90–1.00), but significantly decreased in heavy modification (F1-score 0.10–0.45). This approach offers a lightweight yet effective solution for the early stages of identity verification in data leak investigations, as well as opening up opportunities for further development through a combination of algorithms and adaptive adjustment of matching thresholds.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
References
H. Kasyap, U. I. Atmaca, C. Maple, G. Cormode, and J. He, “Privacy-preserving Fuzzy Name Matching for Sharing Financial Intelligence,” Nov. 2024, [Online]. Available: http://arxiv.org/abs/2407.19979
B. P. K., “Fuzzy Approach to Record Linkages,” Preprint ArXiv:2402.03464, 2024, doi: 10.48550.
Z. Zhang, “Review on String-Matching Algorithm,” SHS Web of Conferences, vol. 144, p. 03018, 2022, doi: 10.1051/shsconf/202214403018.
V. Gupta, M. Singh, and V. K. Bhalla, “Pattern matching algorithms for intrusion detection and prevention system: A comparative analysis,” Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2014, pp. 50–54, Nov. 2014, doi: 10.1109/ICACCI.2014.6968595.
Y. van Gennip, B. Hunter, A. Ma, D. Moyer, R. de Vera, and A. L. Bertozzi, “Unsupervised record matching with noisy and incomplete data,” Int J Data Sci Anal, vol. 6, no. 2, pp. 109–129, Sep. 2018, doi: 10.1007/s41060-018-0129-7.
X. Shu, J. Zhang, D. D. Yao, and W. C. Feng, “Rapid screening of transformed data leaks with efficient algorithms and parallel computing,” in CODASPY 2015 - Proceedings of the 5th ACM Conference on Data and Application Security and Privacy, Association for Computing Machinery, Mar. 2015, pp. 147–149. doi: 10.1145/2699026.2699130.
X. Shu, J. Zhang, D. Yao, and W. C. Feng, “Fast detection of transformed data leaks,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 3, pp. 528–542, Mar. 2016, doi: 10.1109/TIFS.2015.2503271.
D. Deng, G. Li, H. Wen, H. V. Jagadish, and J. Feng, “META: An efficient matching-based method for error-tolerant autocompletion,” Proceedings of the VLDB Endowment, vol. 9, no. 10, pp. 828–839, Jun. 2016, doi: 10.14778/2977797.2977808; Taxonomy: ACM-PUBTYPE; PAGEGROUP: STRING: PUBLICATION.
A. Duyster and T. Kociumaka, “Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts,” Mar. 2025, [Online]. Available: http://arxiv.org/abs/2503.03488
W. Andrzejeski, B. Bebel, P. Boinski, M. Sienkiewicz, and R. Wrembel, “Text Similarity Measures in A Data Deduplication Pipeline for Customers Records,” in Proceedings of the 25th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), Mar. 2023, pp. 33–42. [Online]. Available: https://ceur-ws.org/Vol-3369/
P. Charalampopoulos, T. Kociumaka, and P. Wellnitz, “Faster approximate pattern matching: A unified approach,” in Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS, IEEE Computer Society, Nov. 2020, pp. 978–989. doi: 10.1109/FOCS46700.2020.00095.
A. R. Kaufman and A. Klevs, “Adaptive Fuzzy String Matching: How to Merge Datasets with Only One (Messy) Identifying Field,” Political Analysis, vol. 30, no. 4, pp. 590–596, 2022, doi: DOI: 10.1017/pan. 2021.38.
K. Hosseini, F. Nanni, and M. Coll Ardanuy, “DeezyMatch: A Flexible Deep Learning Approach to Fuzzy String Matching,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Q. Liu and D. Schlangen, Eds., Online: Association for Computational Linguistics, Oct. 2020, pp. 62–69. doi: 10.18653/v1/2020.emnlp-demos.9.
M. S. M. Rudwan and J. V. Fonou-Dombeu, “Hybridizing Fuzzy String Matching and Machine Learning for Improved Ontology Alignment,” Future Internet, vol. 15, no. 7, 2023, doi: 10.3390/fi15070229.
Y. Zhang, S. Xu, M. Zheng, and X. Li, “Field Weights Computation for Probabilistic Record Linkage in Presence of Missing Data,” Intern J Pattern Recognit Artif Intell, vol. 34, no. 14, Dec. 2020, doi: 10.1142/S0218001420590466.