SamTools
SamTools stat
Figure 1 : values of multiple samples explained by different metrics
source : nf-core RNAseq MultiQC
In figure 1, the x-axis represents the quantity in either readings or nucleotides per million.
To understand these types of metrics, it is essential to understand the concept of a flag. Flags are annotations found in SAM or BAM files, which indicate the alignment status of the reads. For a deeper understanding, I recommend using a dedicated website where you can become more familiar with the different flags : https://broadinstitute.github.io/picard/explain-flags.html
Metrics:
Total sequences : this is the total number of reads for each sample.
Mapped & paired : these are the reads that are correctly paired and mapped. These reads have the 0x1 flag (paired read), but not the 0x4 (unmapped read) and 0x8 (unmapped paired read) flags.
Mapped in proper pair : this is a read with the 0x2 (mapped read) and 0x1 (paired read) flags, which is correctly mapped and paired.
Duplicated : duplicated reads are reads which have the same sequences. For RNAseq data, they can appear during PCR, but also due to overexpression of genes. The values can be very high for RNAseq data. Duplicates have the 0x400 flag (PCR read or optical duplicate).
QC Failed : these are the reads that did not pass the quality test. They have the 0x200 flag (reads failing a quality test). For RNAseq data, the quality test was performed by FastQC, for SAREK, it is directly performed with the BWA-mem alignment.
Reads MQ0 : these are the reads having an alignment quality of 0. There are three options for the origin of this quality: either the read is not part of the reference (due to contamination or poorly known reference genome), or the sequencing program missed the true position (due to the heuristic technique of the program which will take the first most valid position without testing everything), or it is not a good match (the error is due to the high repeated sequence rate in the reference genome and the tested organism).
Mapped bases (CIGAR) : this is the number of bases filtered by CIGAR character strings. CIGAR indicates the number of nucleotides and then the type of mapping in its file. The number of bases is counted for: matches (M), insertions (I), sequence matches (=) and bad sequence matches (X), but it is not counted for deletions (D), skipped regions (N), soft clippings (S) and hard clippings (H).
Bases Trimmed : this is the number of bases trimmed during alignment. For SAREK, there are no trimmed bases in the pipeline, BWA can do soft or hard clipping. For RNAseq data, the STAR aligner does soft clipping and HISAT2 trims for the 3’, 5’ ends. So, there should not be any trimmed bases if the STAR or BWA-mem aligners are used.
Duplicated bases : This is the number of duplicated bases.
Different Chromosomes : This is the number of reads that have their paired reads on a different chromosome.
Other orientation : These are the pairs of reads that are neither inward nor outward (see below).
Inward pairs : inward pairs are reads characterized by the fact that their 3’ orientations are directed towards their pairs on the other strand.
.
Figure 2 : inward pairs.
source : Illumina Paired End Libraries - Inward and Outwardly Directed Reads
Outward pairs : outward pairs are reads characterized by having their 3’ orientations directed in the opposite direction to their pairs on the other strand.
Figure 3 : outward pairs.
source : Illumina Paired End Libraries - Inward and Outwardly Directed Reads