Wang et al (2022) describe the frustration when real-world performance does not match expectations obtained from benchmark datasets. This difference is the “reality-ideality” gap which is all too common in real-world applications of entity resolution.
Why does it happen? They posit that three main issues limit the generalizability of current benchmarks, specifically in the context of deep learning approaches to entity resolution:
1. 𝐓𝐡𝐞𝐫𝐞 𝐢𝐬 𝐥𝐞𝐚𝐤𝐚𝐠𝐞 𝐟𝐫𝐨𝐦 𝐭𝐡𝐞 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐞𝐭 𝐢𝐧𝐭𝐨 𝐭𝐡𝐞 𝐭𝐞𝐬𝐭 𝐬𝐞𝐭. In typical benchmark constructions, record pairs are randomly sampled, leading to the same cluster being represented in both the train and test dataset. This biases results, especially in deep learning approaches which rely on learning record embeddings.
2. 𝐑𝐞𝐚𝐥-𝐰𝐨𝐫𝐥𝐝 𝐝𝐚𝐭𝐚 𝐢𝐬 𝐦𝐮𝐜𝐡 𝐦𝐨𝐫𝐞 𝐢𝐦𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐝 𝐭𝐡𝐚𝐧 𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤 𝐝𝐚𝐭𝐚𝐬𝐞𝐭𝐬 in terms of matching vs non-matching record pairs. In other words, there is much more opportunity for error in real data than in a benchmark dataset.
3. Partly as a consequence of the two above issues, 𝐭𝐲𝐩𝐢𝐜𝐚𝐥 𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤𝐬 𝐮𝐧𝐝𝐞𝐫𝐞𝐬𝐭𝐢𝐦𝐚𝐭𝐞 𝐭𝐡𝐞 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐜𝐞 𝐨𝐟 𝐚𝐝𝐝𝐢𝐭𝐢𝐨𝐧𝐚𝐥 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬 𝐚𝐧𝐝 𝐦𝐮𝐥𝐭𝐢𝐦𝐨𝐝𝐚𝐥 𝐢𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧. This leads to under-specified systems which do not perform as well as they could.
The paper goes on to define clear tasks for entity resolution systems and detail issues with current benchmarks:
“Our findings reveal that previous benchmarks biased the evaluation of the progress of current entity matching approaches, and there is still a long way to go to build effective entity matchers.”