Tuesday, December 26, 2017

Sampling the taxonomy database

I was a little frustrated that every time I wanted to try out my new Bio::DB::Taxonomy-based script, it would take a few minutes to run.... and then I would find the bug in my script, fix it, and run it again.  I couldn't find a small example database to run my script on, and so I created one.  This is the NCBI taxonomy, spliced for just Firmicutes (taxid: 1239)  Enjoy!

# Filter the nodes file.
# I used a recursive function printChildren to
# print the taxonomy lines.
perl -F'\t\|\t' -MData::Dumper -lane '
  BEGIN{
    sub printChildren{
      my $parent=shift;
      return if(!$child{$parent});
      for my $child(values($child{$parent})){
        print $nodes{$child};
        printChildren($child);
      }
    }
  }
  push(@{$child{$F[1]}},$F[0]);
  $nodes{$F[0]}.=$_;
  END{printChildren(1239);
}' < nodes.dmp > exampleNodes.dmp

# Backtrack and filter the names file.
perl -F'\t\|\t' -Mautodie -lane '
  BEGIN{
    open($fh, "names.dmp");
    while(<$fh>){
      my($nodeID)=split(/\t\|\t/);
      $names{$nodeID}.=$_;
    }
    close $fh;
    print STDERR "Indexed!  Searching and printing.";
  }
  chomp $names{$F[0]};
  print $names{$F[0]};
' < exampleNodes.dmp > exampleNames.dmp

Show that the taxonomy has been significantly filtered to one or two orders of magnitude.
$ wc -l *.dmp
   107700 exampleNames.dmp
    79466 exampleNodes.dmp
  2401017 names.dmp
  1614627 nodes.dmp
  4202810 total



Then, loading the database in BioPerl:
use Bio::DB::Taxonomy;
$db = Bio::DB::Taxonomy->new(-source=>"flatfile", -nodesfile=>"exampleNodes.dmp", -namesfile=>"exampleNames.dmp");