CARE 2.0 : reducing false-positive sequencing error corrections using machine learning

Kallenborn, Felix; Cascitti, Julian; Schmidt, Bertil

doi:http://doi.org/10.25358/openscience-8122

CARE 2.0 : reducing false-positive sequencing error corrections using machine learning

dc.contributor.author	Kallenborn, Felix
dc.contributor.author	Cascitti, Julian
dc.contributor.author	Schmidt, Bertil
dc.date.accessioned	2022-10-31T08:39:13Z
dc.date.available	2022-10-31T08:39:13Z
dc.date.issued	2022
dc.description.abstract	Background Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. Results We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0’s hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. Conclusion False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE.	en_GB
dc.description.sponsorship	Gefördert durch die Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 491381577	de
dc.identifier.doi	http://doi.org/10.25358/openscience-8122
dc.identifier.uri	https://openscience.ub.uni-mainz.de/handle/20.500.12030/8137
dc.language.iso	eng	de
dc.rights	CC-BY-4.0	*
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	*
dc.subject.ddc	004 Informatik	de_DE
dc.subject.ddc	004 Data processing	en_GB
dc.title	CARE 2.0 : reducing false-positive sequencing error corrections using machine learning	en_GB
dc.type	Zeitschriftenaufsatz	de
jgu.apc.netprice	1533,99	de
jgu.apc.price	1825,45	de
jgu.apc.taxrate	19	de
jgu.apc.transformationcontract	Springer (DEAL)	de
jgu.dfg.year	2022
jgu.journal.title	BMC bioinformatics	de
jgu.journal.volume	23	de
jgu.nationalcurrency.eur	1533,99
jgu.organisation.department	FB 08 Physik, Mathematik u. Informatik	de
jgu.organisation.name	Johannes Gutenberg-Universität Mainz
jgu.organisation.number	7940
jgu.organisation.place	Mainz
jgu.organisation.ror	https://ror.org/023b0x485
jgu.pages.alternative	227	de
jgu.publisher.doi	10.1186/s12859-022-04754-3	de
jgu.publisher.issn	1471-2105	de
jgu.publisher.name	Springer Nature	de
jgu.publisher.place	London	de
jgu.publisher.year	2022
jgu.rights.accessrights	openAccess
jgu.subject.ddccode	004	de
jgu.subject.dfg	Ingenieurwissenschaften	de
jgu.type.contenttype	Scientific article	de
jgu.type.dinitype	Article	en_GB
jgu.type.resource	Text	de
jgu.type.version	Published version	de

Files

Original bundle

Now showing 1 - 1 of 1

Name:: care_20__reducing_falsepositi-20221020145914717.pdf
Size:: 1.53 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 3.57 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

DFG-491381577-G