Thursday, October 10, 2019

Minimizer perl module

I looked all over cpan but did not find a module to make Minimizers.  Therefore I went ahead and made it in Bio::Minimizer.

Minimizers are l-mers of k-mers, where k is the number of nucleotides in a substring of DNA.  A minimizer is a substring of a k-mer with length l.  The minimizer is lexicographically the smallest l-mer in the k-mer.  They were described in a 2004 paper: https://academic.oup.com/bioinformatics/article/20/18/3363/202143

They seem simple but they have some interesting properties, copied and pasted from the 2004 paper in bold.  In the paper, w is equivalent to the notation l.  Not all properties are discussed here in this post.  My comments are inline.
  1. If two strings have a significant exact match, then at least one of the minimizers chosen from one will also be chosen from the other.  This means that in two strings of DNA, if there is an exact match represented by a k-mer, then there will be an exact match in an l-mer.
  2. If two strings have a substring of length w+k-1 in common, then they have a (w,k)-minimizer in common.  This property basically takes the previous property and applies mathematical variables to it.
  3. A new minimizer occurs about once every half-window width.  If you are thinking about breaking down a genome to a sliding window of k-mers, then consider this.  If you use minimizers, that genome is relatively well indexed with minimizers instead.
  4. A minimizer represents a set of multiple k-mers.  Ok, so if you have something in genome A that might have a match in B, one fast thing you can do is find all minimizers in genome A and genome B.  Find where exact matches are between the genomes, and even some inexact matches, you could find places that have identical minimizers, and extend the alignment from there.
Anyway, please have fun with the module and if it gets more useful over time, I will develop it more.

https://metacpan.org/pod/Bio::Minimizer
https://github.com/lskatz/bio--minimizer



Monday, September 30, 2019

Cluster detection using Mash

I have been racking my brain on what the command line version of a rapid cluster detection should be.  As in, how can I rapidly figure out if two random genomes on NCBI are related or not?

My workflow depends largely on the min-hash algorithm as implemented by Mash and the assemblies that NCBI is generating behind the scenes.  Then at the end, you get a large mash file consisting of sketches for all genomes.

1) Write down the bioprojects to monitor.  I recorded all genomes in the PDP contributors page. https://www.ncbi.nlm.nih.gov/pathogens/contributors/

2) Save a spreadsheet of two columns, SRA and SRS identifiers for all genomes from these bioprojects.  I chose to do that with Edirect using this kind of query, and piping the results to SRR.tsv:

esearch -db bioproject -query $bioproject | \
      elink -target sra | \
      esummary |\
      xtract -pattern DocumentSummary -element Run@acc -element Sample@acc > SRR.tsv


3) Optionally, get the assembly name so that you can get out of NCBI identifiers and back to names that have meaning to you.  I did that with this kind of code:

esearch -db sra -query $SRR | elink -target biosample | esummary | xtract -pattern DocumentSummary -group Attributes -element Attribute\@attribute_name Attribute

4a) Download the assemblies using this trick https://github.com/ncbi/SKESA/issues/12#issuecomment-431503915.  Here is my perl subroutine that mimics the shell code.

sub downloadSkesaAsm{
  my($SRR, $outfile) = @_;

  my $URL="";
  open(my $fh, "srapath -f names -r $SRR.realign | ") or die "ERROR: could not run srapath -f names -r $SRR.realign: $!";
  while(<$fh>){
    chomp;
    my $eighthField = (split(/\|/,$_))[7] || "";
    if($eighthField =~ m@(/traces/|^http)@ ){
      $URL = $eighthField;
    }
  }
  close $fh;
  if(!$URL){
    # print STDERR "ERROR: could not find URL for $SRR\n";
    return "";
  }

  system("dump-ref-fasta $URL > $outfile.tmp");
  if($?){
    logmsg "ERROR with dump-ref-fasta $URL ($SRR): $!";
    return "";
  }

  mv("$outfile.tmp", $outfile);

  return $outfile;
}

4b) If the assembly does not exist, then download the fastq file.  To keep consistency and to get an assembly, run Skesa on the file before moving onto the next step.

5) Mash sketch the assembly and optionally remove the assembly.

6) Mash paste on all the sketches to produce one mega-mash file.  Optionally, remove the old mash sketches.  Personally, I keep these sketches because it helps reduce redundancy on future runs.  However, there are better ways to keep records.

7) Cluster detection!  Run mash dist on the mega file vs itself and record matches under a certain threshold.



Wednesday, April 24, 2019

Fixing the R package manager

R's package manager is broken.  I present to you my solution for it.

$HOME is your home directory.

1) mkdir -pv $HOME/R/tmp

2) Edit $HOME/.Renviron and add these contents to ensure that your temporary directory is local, universally found in your environment, and not bound by any tempdir restrictions (e.g., noexec).  The tradeoff is that your home directory might not be the fastest drive (but that's not why you're using R anyway is it?).

TMP=$HOME/R/tmp
TMPDIR=$HOME/R/tmp
TEMP=$HOME/R/tmp


3) Edit $HOME/.Rprofile and add these contents so that each set of libraries is specific to the version of R.

major <- R.Version()$major
minor <- R.Version()$minor

majorDir <- paste("/scicomp/home/gzu2/R/",major,sep="")
minorDir <- paste("/scicomp/home/gzu2/R/",major,"/",minor,sep="")
if(!file.exists(majorDir)){
  dir.create(majorDir)
}
if(!file.exists(minorDir)){
  dir.create(minorDir)
}


.libPaths(c(minorDir, .libPaths()))



Wednesday, March 6, 2019

Mash perl module

Hey y'all, I made a new perl module to read and write Mash files.  I made it a complete package, adding it to CPAN with documentation and adding unit testing with continuous integration on Travis-CI.  Most of the back end relies on the binary, but I re-implemented the mash distance formula in Perl.

Why?  I am the developer for Mashtree and it is becoming increasingly difficult to parse Mash files, and it became evident that I needed some object-oriented programming.  Because it might be helpful to the community, I made it an actual perl module.

One difficult part of the process was not being able to read the cap'n proto formatted mash files.  This is due to no relevant perl parser that was available for it.  Therefore, I looked for a different way to serialize mash sketches.  Ultimately because mash info displays JSON, I went with JSON and gzipped JSON.

Here is a quick synopsis: 
use strict;
use
warnings;
use
Mash;

# Sketch all fastq files into one mash file.

system
("mash sketch *.fastq.gz > all.msh");
die
if $?;
# Read the mash file.

my
$msh = Mash->new("all.msh"); 
# All-vs-all distances 
my $distHash = $msh->dist($msh); 
# Or, one liner perl -MMash -e '$msh1 = Mash->new("1.msh"); $msh2 = Mash->new("2.msh"); $dist = $msh1->dist($msh2);

I ran a preliminary all-vs-all test of 303 genomes.  To make the file, I used Mash v2.  To run all-vs-all distances, I ran mash dist vs $msh->dist().  In this test, the binary was 15x faster (1.9s vs 28s).  One more benchmark shows that the resulting json.gz files are approximately the same size as the original msh files.
Let me know what you think!  As of the time of this post, the version is 0.5.
https://metacpan.org/pod/Mash
https://github.com/lskatz/perl-mash
If you would like to contribute, my wish list includes
  • Speed optimization
  • Mash paste
  • Native reading of mash files
  • Native writing of mash files