Exercice 5 : préprocessing et calling - GATK
Charger le module GATK.
Solutionsearch_module gatk module load bioinfo/GATK/4.4.0.0 module load devel/python/Python-3.11.1 devel/java/17.0.6
Marquer les duplicats (GATK MarkDuplicates).
SolutionBash
for bam in *.bam; do echo "gatk --java-options \"-Xms4000m -Xmx7g\" MarkDuplicates --INPUT $bam --METRICS_FILE ${bam}.metrics --TMP_DIR . --ASSUME_SORT_ORDER coordinate --CREATE_INDEX true --OUTPUT ${bam%.bam}.md.bam"; done > 4_md.jobs
Perl
\ls *.bam | perl -lne '$out=$_; $out=~s/\.bam//; print "gatk --java-options \"-Xms4000m -Xmx7g\" MarkDuplicates --INPUT $_ --METRICS_FILE $_.metrics --TMP_DIR . --ASSUME_SORT_ORDER coordinate --CREATE_INDEX true --OUTPUT $out.md.bam"' > 4_md.jobs
sarray -J 4_md -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=10G 4_md.jobs
Combien de lecture sont dupliquées dans le run SRR7062654 ?
Solutiongrep ^LIBRARY -A1 SRR7062654.bam.metrics | datamash transpose
UNPAIRED_READ_DUPLICATES: 278 READ_PAIR_DUPLICATES: 27580
samtools view -f 1024 SRR7062654.md.bam | wc -l 55438 = 27580x2 + 278
Recalibrer les bases (BQSR), étape 1 GATK BaseRecalibrator.
SolutionBash
for bam in *.md.bam; do echo "gatk --java-options -Xmx7g BaseRecalibrator -I $bam -O ${bam%.bam}.recal.table --tmp-dir . -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa --known-sites Gallus_gallus_incl_consequences_chr25-26.vcf --verbosity INFO"; done > 5_recal1.jobs
Perl
\ls *.md.bam | perl -lne '$out=$_; $out=~s/\.bam//; print "gatk --java-options -Xmx7g BaseRecalibrator -I $_ -O $out.recal.table --tmp-dir . -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa --known-sites Gallus_gallus_incl_consequences_chr25-26.vcf --verbosity INFO"' > 5_recal1.jobs
sarray -J 5_recal1 -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=10G 5_recal1.jobs
- Erreur1 : remarque il faut l'index (.fai) de la référence.
samtools faidx Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa
- Erreur2 : remarque il faut le dictionaire (.dict) de la référence.
gatk --java-options "-Xms4000m -Xmx7g" CreateSequenceDictionary -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa
- Erreur1 : remarque il faut l'index (.fai) de la référence.
Recalibrer les bases (BQSR), étape 2 GATK ApplyBQSR.
SolutionBash
for bam in *.md.bam; do echo "gatk --java-options -Xmx14g ApplyBQSR -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa --input $bam --output ${bam%.bam}.recal.bam --bqsr-recal-file ${bam%.bam}.recal.table"; done > 6_recal2.jobs
Perl
\ls *.md.bam | perl -lne '$out=$_; $out=~s/\.bam//; print "gatk --java-options -Xmx14g ApplyBQSR -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa --input $_ --output $out.recal.bam --bqsr-recal-file $out.recal.table"' > 6_recal2.jobs
sarray -J 6_recal2 -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=15G 6_recal2.jobs
A l'aide de samtools stats comparer avant / après recalibration.
Solutionsamtools stat SRR7062654.bam > SRR7062654.bam.stat /usr/local/bioinfo/src/samtools/samtools-1.18/misc/plot-bamstats SRR7062654.bam.stat -p SRR7062654.bam.stat samtools stat SRR7062654.md.recal.bam > SRR7062654.md.recal.bam.stat /usr/local/bioinfo/src/samtools/samtools-1.18/misc/plot-bamstats SRR7062654.md.recal.bam.stat -p SRR7062654.md.recal.bam.stat mkdir ~/public_html/SRR7062654.bam.stat cp SRR7062654.bam.stat* ~/public_html/SRR7062654.bam.stat mkdir ~/public_html/SRR7062654.md.recal.bam.stat cp SRR7062654.md.recal.bam.stat* ~/public_html/SRR7062654.md.recal.bam.stat
Lancer le calling pour produire un G.VCF par échantillon (GATK HaplotypeCaller).
SolutionBash
for bam in *.md.recal.bam; do echo "gatk --java-options \"-Xmx10g -Xms1000m\" HaplotypeCaller -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa -I $bam -O ${bam%.bam}.g.vcf -ERC GVCF"; done > 7_HaplotypeCaller.jobs
Perl
\ls *.md.recal.bam | perl -lne '$out=$_; $out=~s/\.bam//; print "gatk --java-options \"-Xmx10g -Xms1000m\" HaplotypeCaller -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa -I $_ -O $out.g.vcf -ERC GVCF"' > 7_HaplotypeCaller.jobs
sarray -J 7_haplotypecaller -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=15G 7_HaplotypeCaller.jobs
A partir des G.VCF individuels créer un G.VCF multisample.
SolutionBash
cmd="gatk --java-options -Xmx10g CombineGVCFs -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa -O SRR.g.vcf" for vcf in *.g.vcf; do cmd+=" --variant $vcf"; done echo $cmd > 8_CombineGVCFs.jobs
💡 l'opérateur
+=
est un raccourci decmd=cmd+" --variant $vcf"
Perl
\ls *.g.vcf | perl -ne 'BEGIN { print "gatk --java-options -Xmx10g CombineGVCFs -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa -O SRR.g.vcf"} chomp(); print " --variant $_"; END{print "\n"}' > 8_CombineGVCFs.jobs
sarray -J 8_CombineGVCFs -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=15G 8_CombineGVCFs.jobs
Faire le Calling GATK GenotypeGVCFs.
Solutioncat > 9_GenotypeGVCFs.jobs gatk --java-options "-Xmx10g" GenotypeGVCFs -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa -V SRR.g.vcf -O SRR.vcf.gz ^C sarray -J 9_GenotypeGVCFs -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=15G 9_GenotypeGVCFs.jobs