Short-read alignment and small size variants calling
Nextflow and nf-core/sarek course.
Part 1: execute the test pipeline
Log on to the genobioinfo server.
ssh -X <username>@genobioinfo.toulouse.inrae.fr
Create (first time only) the Nextflow directories and links with script the
/usr/local/bioinfo/src/NextflowWorkflows/create_nfx_dirs.sh
.sh /usr/local/bioinfo/src/NextflowWorkflows/create_nfx_dirs.sh
Create a
nextflow-sarek
directory in your working directory. Create inside a LOGS directory.Solutioncd ~/work mkdir nextflow-sarek cd nextflow-sarek mkdir LOGS
Search the last nf-core module version installed on genobioinfo and load it.
Solutionsearch_module nfcore module load bioinfo/NextflowWorkflows/nfcore-Nextflow-v23.04.3
Move to a computer node (with 4G of memory) and execute nextflow, how many sub commands are available?
Solution15 sub-commands are available:
srun --mem 4G --pty bash nextflow Usage: nextflow [options] COMMAND [arg...] Options: -C Use the specified configuration file(s) overriding any defaults -D Set JVM properties -bg Execute nextflow in background -c, -config Add the specified file to configuration set -config-ignore-includes Disable the parsing of config includes -d, -dockerize Launch nextflow via Docker (experimental) -h Print this help -log Set nextflow log file path -q, -quiet Do not print information messages -syslog Send logs to syslog server (eg. localhost:514) -trace Enable trace level logging for the specified package name - multiple packages can be provided separating them with a comma e.g. '-trace nextflow,io.seqera' -v, -version Print the program version Commands: clean Clean up project cache and work directories clone Clone a project into a folder config Print a project configuration console Launch Nextflow interactive console drop Delete the local copy of a project help Print the usage help for a command info Print project and system runtime information kuberun Execute a workflow in a Kubernetes cluster (experimental) list List all downloaded projects log Print executions log and runtime info pull Download or update a project run Execute a pipeline project secrets Manage pipeline secrets (preview) self-update Update nextflow runtime to the latest available version view View project script file(s)
Execute the following command to ensure that the sarek pipeline version 3.3.2 can run properly on our cluster:
nextflow run nf-core/sarek -r 3.3.2 -profile genotoul,test --outdir test -work-dir work_test
- Has the pipeline been successfully completed?
- Does nextflow use singularity?
- Which variant calling software was used?
- In which directory did the job
CREATE_INTERVALS_BED
run?
Solutionnextflow run nf-core/sarek -r 3.3.2 -profile genotoul,test --outdir test -work-dir work_test nf-core/sarek v3.3.2-gf034b73 ------------------------------------------------------ Core Nextflow options revision : 3.3.2 runName : stoic_snyder containerEngine : singularity ... Main options split_fastq : 0 intervals : https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.interval_list tools : strelka ... [28/646f54] process > NFCORE_SAREK:SAREK:PREPARE_GENOME:TABIX_KNOWN_INDELS (mills_and_1000G.indels.vcf) [100%] 1 of 1 ✔ [- ] process > NFCORE_SAREK:SAREK:PREPARE_GENOME:TABIX_PON - [63/93aad8] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:CREATE_INTERVALS_BED (genome.interval_list) [100%] 1 of 1 ✔ [61/2910ad] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:GATK4_INTERVALLISTTOBED (genome) [100%] 1 of 1 ✔ [1a/79797b] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:TABIX_BGZIPTABIX_INTERVAL_SPLIT (chr22_1-40001) [100%] 1 of 1 ✔ ... -[nf-core/sarek] Pipeline completed successfully- Completed at: 06-Nov-2023 15:09:16 Duration : 2m 37s CPU hours : (a few seconds) ...
Has the pipeline been successfully completed? Yes, it has been successfully completed as stated at the end of the execution log.
Does nextflow use singularity? Yes, the genotoul profile defines singularity as container engine. It is indicated at the beginning of the execution log.
Which variant calling software was used? Caller is indicated in the sarek main options at the tools section. The caller used is strelka.
In which directory did the job
CREATE_INTERVALS_BED
run? Look at the line[63/93aad8] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:CREATE_INTERVALS_BED
. The job ran in thework/63/93aad855ad899aa4f7ebae39c89351
directory
Part 2: configure the pipeline with your data
Leave the compute node using the
exit
command and create asample.csv
file with one line per fastq pair, see example.SolutionThe
sample.csv
file should contains the following lines:cat > sample.csv patient,sample,lane,fastq_1,fastq_2 chicken,SRR7062654,1,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062654_R1.fastq.gz,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062654_R2.fastq.gz chicken,SRR7062655,1,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062655_R1.fastq.gz,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062655_R2.fastq.gz ^C
Create a
nextflow.config
file containing the genome information.cat > nextflow.config params { genomes { 'Gallus_gallus-5.0_25-26' { fasta = "${params.genomes_base}/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa" species = 'Gallus_gallus' known_indels = "${params.genomes_base}/Gallus_gallus_incl_consequences_chr25-26.vcf.gz" } } } ^C
Part 3: run the pipeline with the chicken data
Run sarek with the following options:
-r 3.3.2 -profile genotoul --genome_base /save/formation/public_html/16_SGS-SNP/Data_Chr25-26/ --genome Gallus_gallus-5.0_25-26 --tools haplotypecaller,freebayes,deepvariant --joint_germline --input sample.csv --outdir results_chicken
SolutionWrite a
sarek_chicken.sh
script as follows:cat > sarek_chicken.sh #!/bin/bash #SBATCH -J sarek_chicken #SBATCH -p workq #SBATCH --export=ALL #SBATCH --cpus-per-task=1 #SBATCH --mem=4G #SBATCH -e LOGS/%x_%j.err #SBATCH -o LOGS/%x_%j.out module purge module load bioinfo/NextflowWorkflows/nfcore-Nextflow-v23.04.3 nextflow run nf-core/sarek \ -r 3.3.2 \ -profile genotoul \ --genome 'Gallus_gallus-5.0_25-26' \ --genomes_base /save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/ \ --tools haplotypecaller,freebayes,deepvariant \ --joint_germline \ --input sample.csv \ --outdir results_chicken ^C
And run it using
sbatch
:sbatch sarek_chicken.sh
Has the pipeline been successfully completed?
SolutionYes, it has been successfully completed.
Two ways to check:
- read the file
LOGS/sarek_chicken*.out
written by slurm and check the end message, - use the
nextflow log
command and check the status of the last executed nextflow command.
💡 The nextflow command can provide a great deal of information about previous pipeline executions. Type the command
nextflow log -l
to list all the fields that can be included in the output of this command. For example, the following command allows you to see the slurm job_id assigned to each of the jobs run by the last executed pipeline:nextflow log $(nextflow log -q | tail -1) -f name,status,workdir,native_id
.NotesIf the pipeline failed, you can perform an error recovery using the nextfflow
-resume
option.If you have to analyse new samples with the same reference genome, you can reuse indexes generated during a previous execution.
- you can create symbolic links to the
fasta
andvcf.gz
files into the directoryresults_chicken/reference
to group all files in the same directory, - and edit your
nextflow.config
file to add all required indexes.
Here is a
nextflow.config
file example :params { genomes { 'Gallus_gallus-5.0' { fasta = "${params.genomes_base}/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa" fai = "${params.genomes_base}/fai/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa.fai" dict = "${params.genomes_base}/dict/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.dict" species = 'Gallus_gallus' bwa = "${params.genomes_base}/bwa/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa.{amb,ann,bwt,pac,sa}" known_indels = "${params.genomes_base}/Gallus_gallus_incl_consequences.vcf.gz" known_indels_tbi = "${params.genomes_base}/Gallus_gallus_incl_consequences.vcf.gz.tbi" } } }
During the next execution, the retrieved indexes will not be rebuilt but directly reused.
- read the file
Now we need to make our HTML reports viewable.
- Copy the directory
results_chicken/multiqc
into yourpublic_html
, and explore themultiqc_report.html
with your navigator. - Copy the directory
results_chicken/pipeline_info
into yourpublic_html
and checkexecution_report_*.html
,execution_timeline_*.html
andsoftware_versions.yml
files.
Explore the results directory
results_chicken
.- Copy the directory
Let's make a rough comparison of the output of the three caller used. Only HaplotypeCaller provides us with a merged vcf of the two samples (
results_chicken/variant_calling/haplotypecaller/joint_variant_calling/joint_germline.vcf.gz
). For Freebayes and Deepvariant, we have to do the merging by hand. Merge the vcf of each sample for these two caller usingbcftools merge
.Solutionsearch_module bcftools module load bioinfo/Bcftools/1.17 for caller in deepvariant freebayes; do echo "bcftools merge results_chicken/variant_calling/$caller/SRR706265?/SRR706265?.$caller.vcf.gz -o $caller.vcf.gz" done > bcftools_merge.jobs sarray -J bcftools_merge -e LOGS/%x_%j.err -o LOGS/%x_%j.out bcftools_merge.jobs
In the current directory, create a symbolic link to the HaplotypeCaller merged vcf with the name
haplotypecaller.vcf.gz
.Solutionln -s results_chicken/variant_calling/haplotypecaller/joint_variant_calling/joint_germline.vcf.gz haplotypecaller.vcf.gz
For each merged vcf, extract chromosome/position columns to a tsv file and use jvenn to compare data.
Solutionfor caller in deepvariant freebayes haplotypecaller; do zcat $caller.vcf.gz | grep -v ^# | cut -f 1,2 > $caller.tsv done
What are your thoughts on the results of this comparison?
Merci de remplir le questionnaire de satisfaction suivant: https://sondages.inrae.fr/index.php/84236?lang=fr