CARE 2.0 : reducing false-positive sequencing error corrections using machine learning
dc.contributor.author | Kallenborn, Felix | |
dc.contributor.author | Cascitti, Julian | |
dc.contributor.author | Schmidt, Bertil | |
dc.date.accessioned | 2022-10-31T08:39:13Z | |
dc.date.available | 2022-10-31T08:39:13Z | |
dc.date.issued | 2022 | |
dc.description.abstract | Background Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. Results We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0’s hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. Conclusion False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE. | en_GB |
dc.description.sponsorship | Gefördert durch die Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 491381577 | de |
dc.identifier.doi | http://doi.org/10.25358/openscience-8122 | |
dc.identifier.uri | https://openscience.ub.uni-mainz.de/handle/20.500.12030/8137 | |
dc.language.iso | eng | de |
dc.rights | CC-BY-4.0 | * |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | * |
dc.subject.ddc | 004 Informatik | de_DE |
dc.subject.ddc | 004 Data processing | en_GB |
dc.title | CARE 2.0 : reducing false-positive sequencing error corrections using machine learning | en_GB |
dc.type | Zeitschriftenaufsatz | de |
jgu.journal.title | BMC bioinformatics | de |
jgu.journal.volume | 23 | de |
jgu.organisation.department | FB 08 Physik, Mathematik u. Informatik | de |
jgu.organisation.name | Johannes Gutenberg-Universität Mainz | |
jgu.organisation.number | 7940 | |
jgu.organisation.place | Mainz | |
jgu.organisation.ror | https://ror.org/023b0x485 | |
jgu.pages.alternative | 227 | de |
jgu.publisher.doi | 10.1186/s12859-022-04754-3 | de |
jgu.publisher.issn | 1471-2105 | de |
jgu.publisher.name | Springer Nature | de |
jgu.publisher.place | London | de |
jgu.publisher.year | 2022 | |
jgu.rights.accessrights | openAccess | |
jgu.subject.ddccode | 004 | de |
jgu.subject.dfg | Ingenieurwissenschaften | de |
jgu.type.contenttype | Scientific article | de |
jgu.type.dinitype | Article | en_GB |
jgu.type.resource | Text | de |
jgu.type.version | Published version | de |