RSeQC
Read duplication
ReadDuplication.py uses two techniques to measure duplication. It either measures the number of reads with the same sequence or the number of reads aligning to the same location on the reference genome.
In the nf-core pipeline, the measurement is based on alignment, thus on the reference genome.
Figure 1 : graph representing the duplication of reads, based on alignment, for each sample.
source : nf-core RNAseq MultiQC
The x-axis shows the number of times the readings are duplicated at a given point.
The y-axis is in Log10, which means that the values increase exponentially. Therefore, one should not rely on the proximity between the largest and smallest values. For example, in the graph above, we observe 60,000,000 reads with one occurrence compared to 1,000 reads with 100 occurrences, the difference is enormous. If we do not expect too much duplication , the curve should be almost a right angle with a huge value at occurrence 1 and values that drop sharply just after. In the represented graph, however, we observe many duplications, especially for occurrences from 2 to 50. This comes from the RNAseq dataset where we expect to have duplications due to the overexpression of certain genes.