Running Freyja on other pathogens - Measles

If you want to detect other pathogens using freyja, this is a tutorial on how to run the pipeline for Measles samples.

STEPS

1. Download the reference genome for your pathogen. It can be downloaded from NCBI database or use the references provided by Nextstrain . For the purpose of this tutorial, we will be using NC_001498.1. The reference genome file can also be found here.

2. Download the latest lineage barcodes from Freyja barcodes repository. For this analysis we will be using Measles barcodes which can also be found here.

Please note that there are two sets of barcodes available for Measles. Here, we use the whole genome barcodes but you may also use the N450 region barcodes. It is important, however, to use the right reference genome according to the barcode used. If you are using region N450 barcodes, please make sure to use the reference that includes N450 region only.

3. Prepare your primer file in a bed format. For this tutorial, we will be using artic-measles V1.0.0 panel also provided here. This file will be used to simulate measles amplicon sequencing reads and trim the primers in the downstream analysis.

4. Prepare your reads, by assessing quality of the reads and removing the sequencing adapters. We will be using simulated reads produced by MixAmp, an amplicon read simulator developed by our team.

For instance, we generated reads from two different measles samples using this code as following:

  • Pure sample: Reads simulated from a H1 lineage sample individually.

  • Mixed samples: Reads with mixing proportions H1:20% and D9:80%.

Pure sample read simulation:

mixamp simulate-proportions GCA_031128185.1.fna primer.bed --outdir measles-H1-100/

Mixed sample read simulation:

mixamp simulate-proportions GCA_031128185.1.fna,GCA_031129565.1.fna primer.bed --outdir measles-H1-20-D9-80/ --proportions 0.2,0.8

5. Align your reads to your reference genome using an aligner of your choice. Here, we use minimap2 with parameters set for aligning short reads to a reference genome.

minimap2 -ax sr reference.fasta reads.fastq > aligned.sam

From here on, the instruction will be the same for mixed and pure samples. Please change the file names accordingly.

  1. Convert sam format file to a bam file using samtools.

samtools view -bS aligned.sam > aligned.bam
  1. Sort your bam file.

samtools sort -o aligned_sorted.bam aligned.bam
  1. Index your bam file.

samtools index aligned_sorted.bam

9. Remove primers from the reads. The following command will remove reads with mean quality score -q less than 15 and minimum read length of 50 (after trimming). Flag -e keeps reads without primers in the output and the flag -x will make sure that reads that occur at the specified offset positions relative to primer positions will also be trimmed.

ivar trim -b primer.bed -p trimmed -i aligned_sorted.bam -q 15 -m 50 -e -x 3
  1. Sort and index the trimmed bam file.

samtools sort -o trimmed_sorted.bam trimmed.bam && samtools index trimmed_sorted.bam

11. Generate coverage depth for each single genomic location in the reference. This will be used in freyja pipeline downstream analysis.

samtools mpileup -aa -A -d 600000 -Q 20 -q 0 -B -f reference.fasta trimmed_sorted.bam | cut -f1-4 > depths.tsv
  1. Call variants using variant caller of your choice. We recommend using Lofreq or ivar and included both commands for your reference.

# call variants using lofreq
lofreq call -f reference.fasta -a 1 -b 1 -o variants.vcf trimmed_sorted.bam
# Call variants using ivar
freyja variants trimmed_sorted.bam --variants variants.tsv --depths depths.tsv --ref reference.fasta
  1. Run freyja demix to estimate lineage prevalence.

freyja demix variants.tsv depths.tsv --output freyja_demix.txt --barcodes barcodes.csv

The final demix outputs for the pure and mixed sample are as following:

Mixed sample output:

summarized      [('Other', 0.9999999968413253)]
lineages        MEASLES-D9 MEASLES-H1
abundances      0.79692605 0.20307394
resid   214.51679168207156
coverage        91.39927016484208

Pure sample output

summarized      [('Other', 0.999999999926792)]
lineages        MEASLES-H1
abundances      1.00000000
resid   53.868769540487826
coverage        89.52434881087203