Just a pipe in the pipeline: 2015

Wednesday, October 14, 2015

Krona snippet

I couldn't find any kind of conversion Kraken and Krona recently and so I wrote up a little pipeline. Full script is available at lskScripts: https://github.com/lskatz/lskScripts/blob/master/qsub/launch_kraken.sh

The main points of the script are shown below.

run $KRAKENDIR/kraken --fastq-input --paired --db=$KRAKEN_DEFAULT_DB --preload --gzip-compressed --quick --threads $NSLOTS --output $KRAKENOUT $READS

run kraken-translate --db $KRAKEN_DEFAULT_DB $KRAKENOUT | cut -f 2- | sort | uniq -c |\
  perl -lane '
              s/^ +//;   # remove leading spaces
              s/ +/\t/;  # change first set of spaces from uniq -c to a tab
              s/;/\t/g;  # change the semicolon-delimited taxonomy to tab-delimited
              print;
             ' |\
  sort -k1,1nr > $KRAKENTAXONOMY


run $KRONADIR/ktImportText -o $HTML $KRAKENTAXONOMY

Wednesday, April 15, 2015

The confusing vocabulary of public health and bioinformatics

In the very beginning of our Listeria project, we basically had an idea that we could use evolution to help in epidemiological investigations. The idea was out there in the world, and so I can't claim credit, but no one yet had the resources to sequence all Listeria monocytogenes genomes in a given area before HHS backed the project. I would love to write more about this project actually but I digress.

The point here is that we had a vocabulary problem. In the beginning in the CDC lab (I can't say what all was happening in the other agencies), we had some of the best epidemiologists and some of the best Listeria microbiology experts in the room, and also one bioinformatician (me). Thus, three different disciplines, one important project, and one huge vocabulary problem. I tried collecting all the terms that everyone had to learn in the room but that started turning into epi/micro/bioinformatics notes and became too huge.

Instead, I present here a list of confounding vocabulary terms that we all came into the room thinking we knew but instead had different meanings for different people. I also present my own suggested disambiguation for these terms.

term	Bioinformatics	Microbiology	Epidemiology	Suggested disambiguation
Cluster	A group on a dendrogram OR: a set of computer nodes that work together in an environment to facilitate parallelized analysis		A group of related illnesses	Use a prefix: phylo-cluster, epi-cluster, computer cluster
Traceback	The act of finding the last common ancestor for a group of taxa on a tree		The act of finding the common source of a group of cases	I feel like this is just a buzzword and can be replaced. Bioinformatics: state that you are finding the last common ancestor. Public health: state that you are finding the common source.
Related	When two genomes or genes are genetically similar	When two organisms are similar using some typing scheme (e.g., serogrouping, PFGE, etc)	When two cases of infected individuals are similar (e.g., by the food they ate, by their geography, etc)	If it's unclear, I prefer to use "phylo-related" or "epi-related" or something similar
Coding (via homologous)	Computer program	Where a gene gets turned into a functional unit	A way to categorize public health events such as outbreaks	Context. Use it in complete sentences!
Mapping	Assigning sequence reads to where they match on a reference genome		geographic information system	Again, context
(phonetic ambiguity) GUI or gooey	A graphical user interface	Something a little bit gross		Context is probably your friend here

Are there any other terms to bring up? Please list them here in the comments!

Sunday, April 5, 2015

Blog posts

A few blog posts that I've really liked for my niche of public health and bioinformatics. I think one purpose of this blog can be to start gathering these posts and publications. Maybe I should also start gathering tweets?

Wednesday, March 11, 2015

FASTQ must die! Long live SAM/BAM! (seconded!)

Here is a short rant but it becomes proactive.

For those of you who don't know me, I dislike having two fastq files per genome if they are paired end (PE). So I always shuffle them with my custom script run_assembly_shuffleReads.pl.

I was going through Lyve-SET, and putting in SNAP, when it broke with the following error message

The two FASTQ files are not the same size! Make sure they contain
the same reads in the same order, without comments.
DEVTEAM: Allowing run, but you need to run on only one thread because the file splitter
Doesn't understand how to deal with files that don't match byte-for-byte.
Loading index from directory... 0s. 5412686 bases, seed size 16
ConfDif MaxHits MaxDist MaxSeed ConfAd %Used %Unique %Multi %!Found %Error Reads/s
PairedFASTQReader: failed to read mate. The FASTQ files may not match properly.

I got so frustrated because the input files are indeed paired end (PE) and there is no way to disable checking for PE in SNAP as far as I know. Or if they are truly broken pairs, then please describe the actual problem set. Besides, I am already checking for PE before running SNAP oh why won't you let me disable your checking??

Anyway that got me thinking about how, even though I have a script to check for interleaved files, it is slightly buggy because it has to check for different formats of deflines: Casava1.7, Casava1.8, ..., fastq-dump -I, fastq-dump, ... , and also other random edge cases. There are too many random and even some nonstandard deflines. Why have all that? Why not have a standard reads format that can describe everything? And then I thought of the SAM format. SAM describes PE, standardizes quality scores, and so much more. Why not just convert all fastq files to unaligned SAM?

Looking online, I found this really forward-thinking rant by Peter Cock: http://blastedbio.blogspot.com/2011/10/fastq-must-die-long-live-sambam.html. He touches on a lot of things I've been thinking. That the deflines are not standardized. That the quality scores are not standardized (well, I guess they mostly are now). But then there are other small things he touches on--the free text area is too messy; the FLAG field in the SAM format can actually hold a value on whether a read passes Q/C; some or most modern tools can read SAM anyway.

So here is the proactive part of this post. And maybe I can try to make my posts in a ranty/proactivey format from now on (we'll see). How to convert a fastq file into a sam file!

Get Picard: http://broadinstitute.github.io/picard/
Convert! For example:

java -jar /apps/x86_64/picard/picard-tools-1.96/FastqToSam.jar F1=SRR1610028_1.fastq F2=SRR1610028_2.fastq SAMPLE_NAME=tmp OUTPUT=out.sam

Do samtools magic

samtools view -bS out.sam > out.bam
samtools sort out.bam out.sorted
samtools index out.sorted.bam

Ok so now out.sorted.bam contains perhaps the most perfect possible representation of your data: 1) standardized format; 2) a single file for a single run instead of forward/reverse files; 3) if you do it right, you can do one group within the sam file per run; 4) compressible and sortable according to the bam format; 5) standardized quality score offsets; and 6) probably more I'm not thinking about right now.

Thoughts?