PrePrint: SAM/BAM format v1.5 extensions for de novo assemblies

Here's a little back-story on my latest preprint (based on my email to samtools-devel), which went live on the biology preprint server bioRvix at the end of last week:
SAM/BAM format v1.5 extensions for de novo assemblies.
Peter J. A. Cock, James K. Bonfield, Bastien Chevreux, Heng Li.
bioRxiv DOI: 10.1101/020024
The current version is a terse three pages (trying to meet an "application note" page limit), but nevertheless should clarify the intended usage of these parts of the SAM/BAM specification.


BLAST+ rejecting query files with zero sequences

This is another brief NCBI BLAST+ bug report blog post, about a regression in BLAST+ 2.2.29 which will be breaking existing pipelines around the world. The problem is a new "feature" which treats an empty query file as an error.


BLAST+ Christmas Wish List

Dear Santa,

Please could you ask the Elves at the NCBI to deliver the following BLAST+ feature requests for Christmas 2014?

Thank you,


P.S. Do they think I have been naughty or nice with my BLAST blog posts?


Column headers in BLAST+ tabular and CSV output

In the last couple of years, my preferred BLAST output format has switched from BLAST XML to plain tabular output. The main reason for this it is easier to parse, and now gives easy access to more fields - BLAST+ 2.2.28 added descriptions and taxonomy output to the tabular and CSV output, but the cumulative effect is BLAST XML has been lagging behind.

However, there is a simple change the NCBI could make to greatly improve the usability of the tabular or CSV output - label the columns with a header line! This is vital meta-data: No-one should be forced to guess-the-columns when presented with a data file. 


BLAST! No frequency ratios needed for composition-based statistics

While working on updating the NCBI BLAST+ wrapper for Galaxy for any changes in the new BLAST+ 2.2.30 release, I hit a cryptic error message from deltablast

$ deltablast -query rhodopsin_proteins.fasta -subject four_human_proteins.fasta -evalue 1e-08 -outfmt "6 qseqid sseqid score" -rpsdb /data/blastdb/cdd_delta
BLAST engine error: /data/blastdb/cdd_delta contains no frequency ratios needed for composition-based statistics.
Please disable composition-based statistics when searching against /data/blastdb/ncbi/cdd/cdd_delta.

To cut a long story short, to fix this you need to download and unpack a newer cdd_delta.tar.gz which now includes another file cdd_delta.freq containing frequency ratio information which the newer deltablast tool requires.

The same applies to the rpsblast tool, although here you just get a warning rather than an error:

$ rpsblast -query four_human_proteins.fasta -db /data/blastdb/cdd_delta -evalue 1e-08 -outfmt "6 qseqid sseqid score"
Warning: /data/blastdb/cdd_delta contain(s) no freq ratios needed for composition-based statistics.
RPSBLAST will be run without composition-based statistics.
sp|Q9BS26|ERP44_HUMAN    gnl|CDD|222416    401
sp|P06213|INSR_HUMAN    gnl|CDD|238021    137
sp|P08100|OPSD_HUMAN    gnl|CDD|215646    411