Using machine learning to link electronic health records in cancer registries : on the tradeoff between linkage quality and manual effort

dc.contributor.authorRöchner, Philipp
dc.contributor.authorRothlauf, Franz
dc.date.accessioned2025-07-28T06:01:34Z
dc.date.available2025-07-28T06:01:34Z
dc.date.issued2024
dc.description.abstractBackground: Cancer registries link a large number of electronic health records reported by medical institutions to already registered records of the matching individual and tumor. Records are automatically linked using deterministic and probabilistic approaches; machine learning is rarely used. Records that cannot be matched automatically with sufficient accuracy are typically processed manually. For application, it is important to know how well record linkage approaches match real-world records and how much manual effort is required to achieve the desired linkage quality. We study the task of linking reported records to the matching registered tumor in cancer registries. Methods: We compare the tradeoff between linkage quality and manual effort of five machine learning methods (logistic regression, random forest, gradient boosting, neural network, and a stacked method) to a deterministic baseline. The record linkage methods are compared in a two-class setting (no-match/ match) and a three-class setting (no-match/ undecided/ match). A cancer registry collected and linked the dataset consisting of categorical variables matching 145,755 reported records with 33,289 registered tumors. Results: In the two-class setting, the gradient boosting, neural network, and stacked models have higher accuracy and 𝐹1 score (accuracy: 0.968 − 0.978, 𝐹1 score: 0.983 − 0.988) than the deterministic baseline (accuracy: 0.964, 𝐹1 score: 0.980) when the same records are manually processed (0.89% of all records). In the three-class setting, these three machine learning methods can automatically process all reported records and still have higher accuracy and 𝐹1 score than the deterministic baseline. The linkage quality of the machine learning methods studied, except for the neural network, increase as the number of manually processed records increases. Conclusion: Machine learning methods can significantly improve linkage quality and reduce the manual effort required by medical coders to match tumor records in cancer registries compared to a deterministic baseline. Our results help cancer registries estimate how linkage quality increases as more records are manually processed.en
dc.identifier.doihttps://doi.org/10.25358/openscience-12880
dc.identifier.urihttps://openscience.ub.uni-mainz.de/handle/20.500.12030/12901
dc.language.isoeng
dc.rightsCC-BY-4.0
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subject.ddc004 Informatikde
dc.subject.ddc004 Data processingen
dc.subject.ddc610 Medizinde
dc.subject.ddc610 Medical sciencesen
dc.titleUsing machine learning to link electronic health records in cancer registries : on the tradeoff between linkage quality and manual efforten
dc.typeZeitschriftenaufsatz
jgu.journal.issue185
jgu.journal.titleInternational journal of medical informatics
jgu.organisation.departmentFB 03 Rechts- und Wirtschaftswissenschaften
jgu.organisation.nameJohannes Gutenberg-Universität Mainz
jgu.organisation.number2300
jgu.organisation.placeMainz
jgu.organisation.rorhttps://ror.org/023b0x485
jgu.pages.alternative105387
jgu.publisher.doi10.1016/j.ijmedinf.2024.105387
jgu.publisher.eissn1872-8243
jgu.publisher.year2024
jgu.rights.accessrightsopenAccess
jgu.subject.ddccode004
jgu.subject.ddccode610
jgu.subject.dfgGeistes- und Sozialwissenschaften
jgu.type.contenttypeScientific article
jgu.type.dinitypeArticleen_GB
jgu.type.resourceText
jgu.type.versionPublished version

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
using_machine_learning_to_lin-20250728080134074609.pdf
Size:
1.03 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
5.1 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections