Tuesday, December 26, 2017

Sampling the taxonomy database

I was a little frustrated that every time I wanted to try out my new Bio::DB::Taxonomy-based script, it would take a few minutes to run.... and then I would find the bug in my script, fix it, and run it again.  I couldn't find a small example database to run my script on, and so I created one.  This is the NCBI taxonomy, spliced for just Firmicutes (taxid: 1239)  Enjoy!

# Filter the nodes file.
# I used a recursive function printChildren to
# print the taxonomy lines.
perl -F'\t\|\t' -MData::Dumper -lane '
  BEGIN{
    sub printChildren{
      my $parent=shift;
      return if(!$child{$parent});
      for my $child(values($child{$parent})){
        print $nodes{$child};
        printChildren($child);
      }
    }
  }
  push(@{$child{$F[1]}},$F[0]);
  $nodes{$F[0]}.=$_;
  END{printChildren(1239);
}' < nodes.dmp > exampleNodes.dmp

# Backtrack and filter the names file.
perl -F'\t\|\t' -Mautodie -lane '
  BEGIN{
    open($fh, "names.dmp");
    while(<$fh>){
      my($nodeID)=split(/\t\|\t/);
      $names{$nodeID}.=$_;
    }
    close $fh;
    print STDERR "Indexed!  Searching and printing.";
  }
  chomp $names{$F[0]};
  print $names{$F[0]};
' < exampleNodes.dmp > exampleNames.dmp

Show that the taxonomy has been significantly filtered to one or two orders of magnitude.
$ wc -l *.dmp
   107700 exampleNames.dmp
    79466 exampleNodes.dmp
  2401017 names.dmp
  1614627 nodes.dmp
  4202810 total



Then, loading the database in BioPerl:
use Bio::DB::Taxonomy;
$db = Bio::DB::Taxonomy->new(-source=>"flatfile", -nodesfile=>"exampleNodes.dmp", -namesfile=>"exampleNames.dmp");



Friday, June 23, 2017

Custom Kraken database

Like many labs around the world, we use Kraken for contamination detection.  This isn't its intended purpose however because it is supposed to be a taxonomic classifier for metagenomics samples.

Despite that, it has been incredibly useful.  We just want to make it more useful.  Many weird species, especially those without genome assemblies on NCBI, show up really weird in Kraken.  So when look for contamination in a weird species, we get random hits from random species or even a majority of unclassified reads.  On the other hand, if we run Kraken on a standard Escherichia coli genome, we get near-perfect results.

Working with our subject matter experts (SMEs), I was able to piece together a preliminary set of completed genome assemblies where the risk of their being contaminated was virtually none.  Then, I tried making a custom Kraken database.  The instructions are straightforward... until they're not.  There were some unsaid things such as the need to erase the description on sequence headers.  Below is my workflow for creaking a totally new Kraken database and not simply adding onto a pre-existing one.

# First, make a directory of fasta files 
# whose extensions are .fasta. The specific
# extension doesn't matter, but it helps with
# the find command later.

# Display the assemblies directory with tree.
# This is a for-example, not necessarily what I have.
# Each subfolder should be formatted Genus_species.
$ tree genomeAssemblies/* 
Escherichia_coli
├──Escherichia_coli_44.fasta
└──Escherichia_coli_45.fasta
Salmonella_bongori
├── NCTC-12419.fasta
└── N268-08.fasta
Salmonella_enterica
└──
Salmonella_enterica_90.fasta

# Download the taxonomy and initialize the Kraken
# database.
$ kraken-build --download-taxonomy --db custom-kraken

# Alter the Shigella species so that they are subspecies
# to E. coli
cd custom-kraken/taxonomy
mv nodes.dmp nodes.bak
# Anything with Genus Shigella (taxid: 620), 
# change to parent E. coli (taxid: 562)
perl -Mautodie -e 'open($fh, "nodes.bak"); while(<$fh>){@F=split /\t\|\t/; if($F[1]==620){ $F[1]=562; $_=join("\t|\t", @F); } print; } close $fh;' > nodes.dmp
cd -

# Next format the fasta files using a custom script
# which can be found in my misc github repo
# https://github.com/lskatz/lskScripts/blob/master/formatFastaForKraken.pl 
# This script adds a kraken tag and removes the 
# fasta description. Modifications are in-place.
$ ls genomeAssemblies | xargs -P 12 -n 1 bash -c '
  genusSpecies=$(sed "s/_/ /g" <<< $0); 
  taxid=$(grep "|"$'\t'"$genusSpecies"$'\t'"|" custom-kraken/taxonomy/names.dmp | cut -f 1 | head -n 1)
  formatFastaForKraken.pl --taxid=$taxid $0/*.fasta
'

# This automation with find/xargs only works down to 
# the species level. For any deeper diving, you can 
# use the custom script to replace the taxid with 
# something more specific.
$ formatFastaForKraken.pl --taxid=41520 genomeAssemblies/Salmonella_enterica/Salmonella_enterica_90.fasta

# Add genome assemblies to the custom database
find . -name '*.fasta' | \
  xargs -P 12 -n 1 kraken-build --db custom-kraken --add-to-library

# Build the database
kraken-build --build --threads 12 custom-kraken