LoRDEC: hybrid correction of long reads




Overview

In a nutshell, LoRDEC is a program for error correcting long sequencing reads using short reads. It implements a hybrid correction approach. It uses little memory and is very efficient. Most importantly it scales up to process very large data sets. It can be applied to long reads obtained with either Pacific Biosciences SMRT sequencing (SMRT = Single Molecule Real Time) or with Oxford Nanopore MINion technology.

Why is LoRDEC algorithm different?

  1. It is efficient and can process large read data sets, included from eukaryotic or vertebrate species, on a usual computing server, and even works on desktop/laptop computers.
  2. It adopts a novel graph based approach: it builds a succinct De Bruijn Graph (DBG) representing the short reads, and seeks a corrective sequence for each erroneous region of a long read by traversing chosen paths in the graph.
View of a long read by LoRDEC: blocks of error-free k-mers (drawn as rectangles), separated by regions of erroneous k-mers (horizontal line). To bridge erroneous regions, LoRDEC finds path in a De Bruijn graph of the short reads.

For the sake of efficiency, LoRDEC analyzes reads through their k-mers. If a k-mer is error-free, we call it solid (circles).
Long read is viewed by LoRDEC as a series of blocks of error-free k-mers (drawn as rectangles), separated by regions of erroneous k-mers (horizontal line). To bridge erroneous regions, LoRDEC finds path in a De Bruijn graph of the short reads (shown as directed paths of circles outside the horizontal line). Such correction paths are shown as directed paths of circles outside the horizontal line.

More infos

LoRDEC is a bioinformatics software published and initially developed in collaboration with Leena Salmela of University of Helsinki (Finland).
LoRDEC processes data coming from high throughput sequencing machines of the second and third generations. These data are called sequencing reads, or simply reads for short. Technically speaking it processes short reads and long reads to correct errors in the long reads.

Third generation DNA sequencing technologies yield long, but erroneous sequencing reads, while second generation technologies yield short read with low error rate. Hence, the need for correcting the long reads.

LoRDEC has been accepted by the Elixir Service Delivery Plan in 2019, and since then LoRDEC is registered in the ELIXIR bio.tools database under this entry), and documented following the EDAM ontology.

LoRDEC has been used in large genome and transcriptome projects by numerous genomic institutes over the world. Some infos about the first hundred such projects can be found at https://www.lirmm.fr/~rivals/lordec/.

High Performance Computing

We also provide scripts for a parallel use of LoRDEC on High Performance Computing servers. If you want to correct long reads but lack short reads, then you can look at LoRMA, which is non hybrid error correction tool for long reads.

Evaluation of the Parallel efficiency of LoRDEC
Evaluation of the Parallel efficiency of LoRDEC

NB: Elixir is a European organisation that coordinates life science resources from across Europe, and the Institut Français de Bioinformatique is the French node of Elixir.

Funding

This work was supported by


dipwmsearch

dipwmsearch

Protein binding sites in DNA or RNA sequences are modeled by probabilistic motifs. A Position Weight Matrix (PWM) is a simple, powerful, and widely used representation of such motifs. Because PWMs assume that sequence positions are independent of eachother (which is too restrictive for some binding or interaction sites), a generalisation of PWMs, termed di-nucleotidic…

Bioinformatics Biology Nucleic acid sites, features and motifs Protein sites, features and motifs Sequence analysis Sequence motif recognition Sequence similarity search Sequence motif FASTA
PhyML 3.0

PhyML 3.0

Overview: new algorithms, methods and utilities PhyML is a software package that uses modern statistical approaches to build phylogenetic trees from the analysis of alignments of nucleotide or amino acid sequences. The main tool in this package builds phylogenies under the maximum likelihood criterion. It implements a large number of substitution models coupled to efficient…

Phylogenetics Phylogenomics Phylogenetic inference (AI methods) FASTA PHYLIP format
DExTER

DExTER

Overview DExTER (Domain Exploration To Explain gene Regulation) is a bioinformatics tool designed to automatically identify genomic regions whose nucleotide composition correlates with gene expression levels. Unlike traditional approaches focusing on short transcription factor binding sites (6-12 bp), DExTER detects Long Regulatory Elements (LREs) that can span tens to hundreds of nucleotides. This makes it…

Gene expression Gene regulation Sequence analysis Expression correlation analysis Regression analysis Sequence analysis Sequence motif discovery Gene expression matrix Nucleotide code Sequence motif (nucleic acid) CSV FASTA TSV