Exercice 5 : préprocessing et calling - GATK


  1. Charger le module GATK.

    Solution
    search_module gatk
    module load bioinfo/GATK/4.4.0.0
    module load devel/python/Python-3.11.1 devel/java/17.0.6
    
  1. Marquer les duplicats (GATK MarkDuplicates).

    Solution

    Bash

    for bam in *.bam; do echo "gatk --java-options \"-Xms4000m -Xmx7g\" MarkDuplicates --INPUT $bam --METRICS_FILE ${bam}.metrics --TMP_DIR . --ASSUME_SORT_ORDER coordinate --CREATE_INDEX true --OUTPUT ${bam%.bam}.md.bam"; done > 4_md.jobs
    

    Perl

    \ls *.bam | perl -lne '$out=$_; $out=~s/\.bam//; print "gatk --java-options \"-Xms4000m -Xmx7g\" MarkDuplicates --INPUT $_ --METRICS_FILE $_.metrics --TMP_DIR . --ASSUME_SORT_ORDER coordinate --CREATE_INDEX true --OUTPUT $out.md.bam"' > 4_md.jobs
    
    sarray -J 4_md -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=10G 4_md.jobs
    
  2. Combien de lecture sont dupliquées dans le run SRR7062654 ?

    Solution
    grep ^LIBRARY -A1 SRR7062654.bam.metrics | datamash transpose
    
    UNPAIRED_READ_DUPLICATES: 278
    READ_PAIR_DUPLICATES: 27580
    
    samtools view -f 1024 SRR7062654.md.bam | wc -l
    55438 = 27580x2 + 278
    
  3. Recalibrer les bases (BQSR), étape 1 GATK BaseRecalibrator.

    Solution

    Bash

    for bam in *.md.bam; do echo "gatk --java-options -Xmx7g BaseRecalibrator -I $bam -O ${bam%.bam}.recal.table --tmp-dir . -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa --known-sites Gallus_gallus_incl_consequences_chr25-26.vcf --verbosity INFO"; done > 5_recal1.jobs
    

    Perl

    \ls *.md.bam | perl -lne '$out=$_; $out=~s/\.bam//; print "gatk --java-options -Xmx7g BaseRecalibrator -I $_ -O $out.recal.table --tmp-dir . -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa --known-sites Gallus_gallus_incl_consequences_chr25-26.vcf --verbosity INFO"' > 5_recal1.jobs
    
    sarray -J 5_recal1 -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=10G 5_recal1.jobs
    
    • Erreur1 : remarque il faut l'index (.fai) de la référence.
      samtools faidx Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa
      
    • Erreur2 : remarque il faut le dictionaire (.dict) de la référence.
      gatk --java-options "-Xms4000m -Xmx7g" CreateSequenceDictionary -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa
      
  4. Recalibrer les bases (BQSR), étape 2 GATK ApplyBQSR.

    Solution

    Bash

    for bam in *.md.bam; do echo "gatk --java-options -Xmx14g ApplyBQSR -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa --input $bam --output ${bam%.bam}.recal.bam --bqsr-recal-file ${bam%.bam}.recal.table"; done > 6_recal2.jobs
    

    Perl

    \ls *.md.bam | perl -lne '$out=$_; $out=~s/\.bam//; print "gatk --java-options -Xmx14g ApplyBQSR -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa --input $_ --output $out.recal.bam --bqsr-recal-file $out.recal.table"' > 6_recal2.jobs
    
    sarray -J 6_recal2 -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=15G 6_recal2.jobs
    
  5. A l'aide de samtools stats comparer avant / après recalibration.

    Solution
    samtools stat SRR7062654.bam > SRR7062654.bam.stat
    /usr/local/bioinfo/src/samtools/samtools-1.18/misc/plot-bamstats SRR7062654.bam.stat -p SRR7062654.bam.stat
    samtools stat SRR7062654.md.recal.bam > SRR7062654.md.recal.bam.stat
    /usr/local/bioinfo/src/samtools/samtools-1.18/misc/plot-bamstats SRR7062654.md.recal.bam.stat -p SRR7062654.md.recal.bam.stat
    
    mkdir ~/public_html/SRR7062654.bam.stat
    cp SRR7062654.bam.stat* ~/public_html/SRR7062654.bam.stat
    mkdir ~/public_html/SRR7062654.md.recal.bam.stat
    cp SRR7062654.md.recal.bam.stat* ~/public_html/SRR7062654.md.recal.bam.stat
    
  6. Lancer le calling pour produire un G.VCF par échantillon (GATK HaplotypeCaller).

    Solution

    Bash

    for bam in *.md.recal.bam; do echo "gatk --java-options \"-Xmx10g -Xms1000m\" HaplotypeCaller -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa -I $bam -O ${bam%.bam}.g.vcf -ERC GVCF"; done > 7_HaplotypeCaller.jobs
    

    Perl

    \ls *.md.recal.bam | perl -lne '$out=$_; $out=~s/\.bam//; print "gatk --java-options \"-Xmx10g -Xms1000m\" HaplotypeCaller -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa -I $_ -O $out.g.vcf -ERC GVCF"' > 7_HaplotypeCaller.jobs
    
    sarray -J 7_haplotypecaller -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=15G 7_HaplotypeCaller.jobs
    
  7. A partir des G.VCF individuels créer un G.VCF multisample.

    Solution

    Bash

    cmd="gatk --java-options -Xmx10g CombineGVCFs -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa -O SRR.g.vcf"
    for vcf in *.g.vcf; do cmd+=" --variant $vcf"; done
    echo $cmd > 8_CombineGVCFs.jobs
    

    💡 l'opérateur += est un raccourci de cmd=cmd+" --variant $vcf"

    Perl

    \ls *.g.vcf | perl -ne 'BEGIN { print "gatk --java-options -Xmx10g CombineGVCFs -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa -O SRR.g.vcf"} chomp(); print " --variant $_"; END{print "\n"}' > 8_CombineGVCFs.jobs
    
    sarray -J 8_CombineGVCFs -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=15G 8_CombineGVCFs.jobs
    
  8. Faire le Calling GATK GenotypeGVCFs.

    Solution
    cat > 9_GenotypeGVCFs.jobs
    gatk --java-options "-Xmx10g" GenotypeGVCFs -R Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa -V SRR.g.vcf -O SRR.vcf.gz
    ^C
    sarray -J 9_GenotypeGVCFs -e LOGS/%x_%j.err -o LOGS/%x_%j.out --mem=15G 9_GenotypeGVCFs.jobs
    

results matching ""

    No results matching ""