Short-read alignment and small size variants calling

image


Nextflow and nf-core/sarek course.

不支持嵌入的PDF对象: nf-core/sarek


Part 1: execute the test pipeline

  1. Log on to the genobioinfo server.

    ssh -X <username>@genobioinfo.toulouse.inrae.fr
    
  2. Create (first time only) the Nextflow directories and links with script the /usr/local/bioinfo/src/NextflowWorkflows/create_nfx_dirs.sh.

    sh /usr/local/bioinfo/src/NextflowWorkflows/create_nfx_dirs.sh
    
  3. Create a nextflow-sarek directory in your working directory. Create inside a LOGS directory.

    Solution
    cd ~/work
    mkdir nextflow-sarek
    cd nextflow-sarek
    mkdir LOGS
    
  4. Search the last nf-core module version installed on genobioinfo and load it.

    Solution
    search_module nfcore
    module load bioinfo/NextflowWorkflows/nfcore-Nextflow-v23.04.3
    
  5. Move to a computer node (with 4G of memory) and execute nextflow, how many sub commands are available?

    Solution

    15 sub-commands are available:

    srun --mem 4G --pty bash
    nextflow
    Usage: nextflow [options] COMMAND [arg...]                       
    
    Options:
     -C
        Use the specified configuration file(s) overriding any defaults
     -D
        Set JVM properties
     -bg
        Execute nextflow in background
     -c, -config
        Add the specified file to configuration set
     -config-ignore-includes
        Disable the parsing of config includes
     -d, -dockerize
        Launch nextflow via Docker (experimental)
     -h
        Print this help
     -log
        Set nextflow log file path
     -q, -quiet
        Do not print information messages
     -syslog
        Send logs to syslog server (eg. localhost:514)
     -trace
        Enable trace level logging for the specified package name - multiple packages can be provided separating them with a comma e.g. '-trace nextflow,io.seqera'
     -v, -version
        Print the program version
    
    Commands:
    clean         Clean up project cache and work directories
    clone         Clone a project into a folder
    config        Print a project configuration
    console       Launch Nextflow interactive console
    drop          Delete the local copy of a project
    help          Print the usage help for a command
    info          Print project and system runtime information
    kuberun       Execute a workflow in a Kubernetes cluster (experimental)
    list          List all downloaded projects
    log           Print executions log and runtime info
    pull          Download or update a project
    run           Execute a pipeline project
    secrets       Manage pipeline secrets (preview)
    self-update   Update nextflow runtime to the latest available version
    view          View project script file(s)
    
  6. Execute the following command to ensure that the sarek pipeline version 3.3.2 can run properly on our cluster:

    nextflow run nf-core/sarek -r 3.3.2 -profile genotoul,test --outdir test -work-dir work_test
    
    • Has the pipeline been successfully completed?
    • Does nextflow use singularity?
    • Which variant calling software was used?
    • In which directory did the job CREATE_INTERVALS_BED run?
    Solution
    nextflow run nf-core/sarek -r 3.3.2 -profile genotoul,test --outdir test -work-dir work_test
    
     nf-core/sarek v3.3.2-gf034b73
    ------------------------------------------------------
    Core Nextflow options
    revision                  : 3.3.2
    runName                   : stoic_snyder
    containerEngine           : singularity
    ...
    
    Main options
    split_fastq               : 0
    intervals                 : https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.interval_list
    tools                     : strelka
    ...
    
    [28/646f54] process > NFCORE_SAREK:SAREK:PREPARE_GENOME:TABIX_KNOWN_INDELS (mills_and_1000G.indels.vcf)                             [100%] 1 of 1 ✔
    [-        ] process > NFCORE_SAREK:SAREK:PREPARE_GENOME:TABIX_PON                                                                   -
    [63/93aad8] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:CREATE_INTERVALS_BED (genome.interval_list)                              [100%] 1 of 1 ✔
    [61/2910ad] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:GATK4_INTERVALLISTTOBED (genome)                                         [100%] 1 of 1 ✔
    [1a/79797b] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:TABIX_BGZIPTABIX_INTERVAL_SPLIT (chr22_1-40001)                          [100%] 1 of 1 ✔
    ...
    
    -[nf-core/sarek] Pipeline completed successfully-
    Completed at: 06-Nov-2023 15:09:16
    Duration    : 2m 37s
    CPU hours   : (a few seconds)
    ...
    
    • Has the pipeline been successfully completed? Yes, it has been successfully completed as stated at the end of the execution log.

    • Does nextflow use singularity? Yes, the genotoul profile defines singularity as container engine. It is indicated at the beginning of the execution log.

    • Which variant calling software was used? Caller is indicated in the sarek main options at the tools section. The caller used is strelka.

    • In which directory did the job CREATE_INTERVALS_BED run? Look at the line [63/93aad8] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:CREATE_INTERVALS_BED. The job ran in the work/63/93aad855ad899aa4f7ebae39c89351 directory

Part 2: configure the pipeline with your data

  1. Leave the compute node using the exit command and create a sample.csv file with one line per fastq pair, see example.

    Solution

    The sample.csv file should contains the following lines:

    cat > sample.csv
    patient,sample,lane,fastq_1,fastq_2
    chicken,SRR7062654,1,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062654_R1.fastq.gz,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062654_R2.fastq.gz
    chicken,SRR7062655,1,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062655_R1.fastq.gz,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062655_R2.fastq.gz
    ^C
    
  2. Create a nextflow.config file containing the genome information.

    cat > nextflow.config
    params {
     genomes {
       'Gallus_gallus-5.0_25-26' {
         fasta        = "${params.genomes_base}/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa"
         species      = 'Gallus_gallus'
         known_indels = "${params.genomes_base}/Gallus_gallus_incl_consequences_chr25-26.vcf.gz"
       }
     }
    }
    ^C
    

Part 3: run the pipeline with the chicken data

  1. Run sarek with the following options:

    -r 3.3.2
    -profile genotoul
    --genome_base /save/formation/public_html/16_SGS-SNP/Data_Chr25-26/
    --genome Gallus_gallus-5.0_25-26
    --tools haplotypecaller,freebayes,deepvariant
    --joint_germline
    --input sample.csv
    --outdir results_chicken
    
    Solution

    Write a sarek_chicken.sh script as follows:

    cat > sarek_chicken.sh
    #!/bin/bash
    #SBATCH -J sarek_chicken
    #SBATCH -p workq
    #SBATCH --export=ALL
    #SBATCH --cpus-per-task=1
    #SBATCH --mem=4G
    #SBATCH -e LOGS/%x_%j.err
    #SBATCH -o LOGS/%x_%j.out
    module purge
    module load bioinfo/NextflowWorkflows/nfcore-Nextflow-v23.04.3
    nextflow run nf-core/sarek \
     -r 3.3.2 \
     -profile genotoul \
     --genome 'Gallus_gallus-5.0_25-26' \
     --genomes_base /save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/ \
     --tools haplotypecaller,freebayes,deepvariant \
     --joint_germline \
     --input sample.csv \
     --outdir results_chicken
    ^C
    

    And run it using sbatch:

    sbatch sarek_chicken.sh
    
  2. Has the pipeline been successfully completed?

    Solution

    Yes, it has been successfully completed.

    Two ways to check:

    • read the file LOGS/sarek_chicken*.out written by slurm and check the end message,
    • use the nextflow log command and check the status of the last executed nextflow command.

    💡 The nextflow command can provide a great deal of information about previous pipeline executions. Type the command nextflow log -l to list all the fields that can be included in the output of this command. For example, the following command allows you to see the slurm job_id assigned to each of the jobs run by the last executed pipeline:

    nextflow log $(nextflow log -q | tail -1) -f name,status,workdir,native_id.

    Notes

    If the pipeline failed, you can perform an error recovery using the nextfflow -resume option.

    If you have to analyse new samples with the same reference genome, you can reuse indexes generated during a previous execution.

    • you can create symbolic links to the fasta and vcf.gz files into the directory results_chicken/reference to group all files in the same directory,
    • and edit your nextflow.config file to add all required indexes.

    Here is a nextflow.config file example :

    params {
     genomes {
       'Gallus_gallus-5.0' {
         fasta            = "${params.genomes_base}/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa"
         fai              = "${params.genomes_base}/fai/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa.fai"
         dict             = "${params.genomes_base}/dict/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.dict"
         species          = 'Gallus_gallus'
         bwa              = "${params.genomes_base}/bwa/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa.{amb,ann,bwt,pac,sa}"
         known_indels     = "${params.genomes_base}/Gallus_gallus_incl_consequences.vcf.gz"
         known_indels_tbi = "${params.genomes_base}/Gallus_gallus_incl_consequences.vcf.gz.tbi"
       }
     }
    }
    

    During the next execution, the retrieved indexes will not be rebuilt but directly reused.

  3. Now we need to make our HTML reports viewable.

    • Copy the directory results_chicken/multiqc into your public_html, and explore the multiqc_report.html with your navigator.
    • Copy the directory results_chicken/pipeline_info into your public_html and check execution_report_*.html, execution_timeline_*.html and software_versions.yml files.

    Explore the results directory results_chicken.

  4. Let's make a rough comparison of the output of the three caller used. Only HaplotypeCaller provides us with a merged vcf of the two samples (results_chicken/variant_calling/haplotypecaller/joint_variant_calling/joint_germline.vcf.gz). For Freebayes and Deepvariant, we have to do the merging by hand. Merge the vcf of each sample for these two caller using bcftools merge.

    Solution
    search_module bcftools
    module load bioinfo/Bcftools/1.17
    for caller in deepvariant freebayes; do
     echo "bcftools merge results_chicken/variant_calling/$caller/SRR706265?/SRR706265?.$caller.vcf.gz -o $caller.vcf.gz"
    done > bcftools_merge.jobs
    sarray -J bcftools_merge -e LOGS/%x_%j.err -o LOGS/%x_%j.out bcftools_merge.jobs
    

    In the current directory, create a symbolic link to the HaplotypeCaller merged vcf with the name haplotypecaller.vcf.gz.

    Solution
    ln -s results_chicken/variant_calling/haplotypecaller/joint_variant_calling/joint_germline.vcf.gz haplotypecaller.vcf.gz
    

    For each merged vcf, extract chromosome/position columns to a tsv file and use jvenn to compare data.

    Solution
    for caller in deepvariant freebayes haplotypecaller; do
     zcat $caller.vcf.gz | grep -v ^# | cut -f 1,2 > $caller.tsv
    done
    

    What are your thoughts on the results of this comparison?


Merci de remplir le questionnaire de satisfaction suivant: https://sondages.inrae.fr/index.php/84236?lang=fr

results matching ""

    No results matching ""