2014-02-27

BLAST XML output needs more love from NCBI

For some time I had thought that the best option for computer parsing of BLAST+ output was BLAST XML. It had all the key bits of information, and XML is designed for automated parsing. However, with the extra fields added to the tabular or comma separated output in BLAST+ 2.2.28 like the long overdue hit descriptions, and taxonomy fields, I think they are now preferable. BLAST XML is now lagging behind!

BLAST tabular output


The greatly expanded set of columns available to the tabular (and comma separated) output is the motivation behind adding a pick-you-own columns option to the Galaxy BLAST+ wrappers (which already use the new match description column by default):

Planned Galaxy interface for picking BLAST+ columns.
With 44 output fields to choose from, this is a bit overwhelming!

That screenshot shows the proposed column selection for the Galaxy BLAST+ wrappers (Update - now available on the Galaxy Tool Shed), which internally works via the -outfmt 6 command line switch. I think the new taxonomy fields in the BLAST output will be especially popular - for example I know the Blaxter group was able to use this new feature to simplify the code for Blobology (which maps assembly contigs to taxonomic groups).

Notice these BLAST+ output fields explicitly handle multiple IDs/titles/species for a single match as used in the NCBI Non-Redundant (NR) database, where identical sequences from different organisms are collapsed into one sequence record (removing redundancy).

BLAST XML output


So well done to the NCBI for expanding the capabilities of BLAST+'s tabular output :)

However BLAST's XML output needs some love to maintain parity and its utility:
  • Include the taxonomy fields, defining them as optional in the XML DTD for backward compatibility.
  • Hide the internal identifiers like gnl|BL_ORD_ID|1, a bug fixed in the tabular output back in BLAST+ 2.2.23 (Feb 2010).
  • Properly handle secondary identifiers (aliases) as used in the NR database, rather than putting the primary identifier in <Hit_id> and hiding the rest only within <Hit_def> (see this post for details).
Now if only the NCBI ran BLAST+ as an open project, I would log some enhancement requests on their issue tracker. But they don't, so its blog post time! ;)

Update (17 March 2014)


Apparently the NCBI team are planning some updates to the BLAST XML output, which I heard about via Sean Davis on Twitter:

Sean Davis (@seandavis12):
Proposed BLAST XML changes with embedded link for comment: ftp://ftp.ncbi.nlm.nih.gov/blast/documents/NEWXML/ProposedBLASTXMLChanges.pdf

The PDF talks about including the taxonomy information and sorting out multiple identifiers (listed above), plus other issues like the current abuse of the <Iteration> tag originally just for PSI-BLAST. It doesn't address the BL_ORD_ID issue yet, but they are asking for feedback...

Update (18 March 2014)


The NCBI have now posted this on their official Twitter account, and to the blast-announce list:

NCBI Staff (@NCBI):
The BLAST dev team needs your help! Suggestions, comments, etc. needed on proposed XML changes http://www.ncbi.nlm.nih.gov/news/03-17-2014-blast-xml-feedback/ #bioinformatics

Update (5 May 2015)


The NCBI have now released details of the new BLAST XML format (PDF) to the blast-announce list.

Update (June 2015)


The NCBI have released BLAST+ 2.2.31 which offers this new BLAST XML output.

3 comments:

  1. Great that the new XML2 format includes taxonomy info, especially taxid... But when can we expect a command-line executable with this feature? (thanks for your helpful blog!)

    ReplyDelete
    Replies
    1. Your guess is as good as mine.

      I'm more worried about the lose of a single XML output file for multiple-query searches (an Xinclude file and one XML file per query is going to be a pain for output to stdout etc).

      Delete
    2. FYI, looks like you can now generate the new XML2 format in a BLAST executable: http://www.ncbi.nlm.nih.gov/books/NBK131777/

      Delete