Parsing .2bit DNA files

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
Last night on #perl IRC (freenode), someone asked for help parsing .2bit files. The usual responses followed:

Have you searched CPAN?
Have you checked BioPerl?
Can't you just use unpack?

The response was that CPAN hadn't turned up anything, they weren't interested in the monolith that is BioPerl, and they were not that comfortable with unpack. I glanced at the spec and said that I would be happy to provide a full working solution as soon as my daughter went to bed. This turned out to be more challenging than I expected due to lack of clarity in the spec.

The issue is in the section covering sequence records. One field indicates the number of "N blocks" in the sequence, the next field is the "starting positions of the N blocks" and the next field is the "lengths of each N block". Right, and how is a single 32bit integer supposed to describe multiple items. The solution is that, assuming the number of N blocks is X, the next X fields will be starting positions and the X fields after that are the lengths. If X is 0, the next fields are omitted entirely.

Fortunately, the requester was able to provide some sample files and I put together the following code:

#!/usr/bin/perl
use constant            ONE_BYTE => 1;
use constant           FOUR_BYTE => 4;
use constant       BITS_PER_BYTE => 8;
use constant BASES_PER_FOUR_BYTE => 16;
use strict;
use warnings;

my $file = $ARGV[0] or die "Usage: $0 <input>";
open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $
+!";

my $header = parse_header($fh);
my $count  = $header->{CNT};

my %toc;
populate_toc($fh, $count, \%toc);

for my $name (keys %toc) {
    my $offset = $toc{$name};
    my $dna = fetch_record($fh, $offset);
    #print length($dna), "\n";
    print "$dna\n";
}

sub parse_header {
    my ($fh) = @_;

    # Read header
    my $raw = '';
    sysread($fh, $raw, FOUR_BYTE * 4);

    # Parse header
    my ($sig, $ver, $cnt, $reserved) = unpack('l4', $raw);

    # TODO: validate (signature, reverse byte order, version)
    return {SIG => $sig, VER => $ver, CNT => $cnt, RSV => $reserved};
}

sub populate_toc {
    my ($fh, $count, $toc) = @_;

    my ($raw, $size, $name) = ('', '', '');
    for (1 .. $count) {

        # Read size of record name
        sysread($fh, $raw, ONE_BYTE);
        $size = unpack('C', $raw);

        # Read name of reacord
        sysread($fh, $name, $size);

        # Read and store offset
        sysread($fh, $raw, FOUR_BYTE);
        $toc->{$name} = unpack('l', $raw);
    }
}

sub fetch_record {
    my ($fh, $offset) = @_;

    my ($raw, $dna, $size, $cnt) = ('', '', '', '');
    my (@start, @len, %nblock, %mblock);

    # Seek to the record location
    sysseek($fh, $offset, 0);

    # Establish the conversion table
    my %conv = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G');

    # Fetch the DNA size
    sysread($fh, $raw, FOUR_BYTE);
    $size = unpack('l', $raw);

    # Handle the n block and m blocks
    for my $block (\%nblock, \%mblock) {

        # Fetch the n block count
        sysread($fh, $raw, FOUR_BYTE);
        $cnt = unpack('l', $raw);

        if ($cnt) {
            sysread($fh, $raw, FOUR_BYTE * $cnt);
            @start = unpack("l$cnt", $raw);
            sysread($fh, $raw, FOUR_BYTE * $cnt);
            @len = unpack("l$cnt", $raw);
            @{$block}{@start} = @len;
        }
    }

    # throw away reserved field
    sysread($fh, $raw, FOUR_BYTE);

    # Fetch DNA - TODO:  read in configurable size chunks
    my $bytes = ((int($size / BASES_PER_FOUR_BYTE)) + 1) * FOUR_BYTE;
    sysread($fh, $raw, $bytes);
    $dna = join '', 
           map $conv{$_},
           unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw)
+);

    # Fix N blocks
    substr($dna, $_, $nblock{$_}, 'N' x $nblock{$_}) for keys %nblock;

    # Fix M blocks
    substr($dna, $_, $mblock{$_}, lc(substr($dna, $_, $mblock{$_}))) f
+or keys %mblock;

    return substr($dna, 0, $size);
}
__END__
See http://genome.ucsc.edu/FAQ/FAQformat#format7
See also human genome as .2bit
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit

TODO
1.  Documentation, error handling, and tests
2.  Profiling and benchmarking
3.  Handle big/little endian
4.  Dynamically determine if record count justifies SQLite versus in-m
+emory hash
5.  Allow the reading of DNA sequence in configurable size chunks
6.  Add an API to treat the sequence like an iterator
[download]

This leaves me with the following questions:

Is there an existing non-bundled module that does this?
Given that speed is a requirement, how would you improve it?
Is this worth polishing to put on CPAN?
If the goal of .2bit files is maximal compression, do you have any idea why N blocks aren't inserted rather than replaced?

Update: Added a few todo items in case this does make it to CPAN

Cheers - L~R

Comment on Parsing .2bit DNA files Download Code

Replies are listed 'Best First'.
Re: Parsing .2bit DNA files by pc88mxer (Vicar) on Mar 05, 2008 at 21:10 UTC
I'll propose a theory on why the N's are not inserted. Basically, it simplifies I/O. If these files are large, they may find it more efficient to read the entire file into memory with one (or a few) I/O operations than to read it in small segments. Even if you have a scatter/gather I/O capability, generally you are limited to block (or at least byte) boundaries. Reading the whole file in and inserting bits in the middle afterwards is obviously going to be too inefficient for large files. Also, it simplifies random access of the base stream directly from the file. You don't have to keep track of where you really are based on all the previous insertions. The idea that `.2bit` files are more compressed is probably in relationship to the earlier `.nib` format which uses 4 bits per base. Clearly the frequency of N and M bases is such that the `2bit` format is much more compact. I think that a virtual api to a `2bit` file would be useful. So you could do something like: `my $twobit = new TwoBitFile('some/path.2bit'); # returns 1000 bases starting from base 40,000,000: my $bases = $twobit->range(40_000_000, 1000);` [download] and minimal disk I/O is performed. Update: Being able to treat the 2bit file as a virtual string with the ability to run regular expressions on it would also be really cool. Is that something we can do in Perl6?	[reply] [d/l] [select]
Re^2: Parsing .2bit DNA files by Limbic~Region (Chancellor) on Mar 05, 2008 at 23:23 UTC
pc88mxer, With regards to the N blocks, your hypothesis sounds valid. With regards to the virtual API, I was actually thinking of turning it into an iterator. Calling the iterator without any parameters would fetch the next X bytes where X is some default. You could then allow for named parameters to fetch forward by Y, rewind, select a range, etc. It should have gone into the TODO but wasn't really necessary. With regards to Perl 6 and applying a regular expressions on streams - the answer is yes unless things have changed since last I looked. Cheers - L~R	[reply]
Re: Parsing .2bit DNA files by blokhead (Monsignor) on Mar 06, 2008 at 05:49 UTC
One thing that strikes me as odd/inefficient is that you are explicily converting to strings of ASCII '0' and '1' to then convert to A, C, G, T. It seems like it would be more direct to convert a byte (4 DNA bases) at a time. Here is a cute way to do that, at the expense of having a lookup table for all 256 values: `my @CONV = glob( "{T,C,A,G}" x 4 ); my $dna = join "", @CONV[ unpack "C", $raw ];` [download] On my system, this gives the same output as yours. I don't know if it's better, but it is shorter, and it can conveniently use an array instead of a hash. You could also experiment with different tradeoffs on lookup table sizes: `## takes 16 bits (= 8 bases = unsigned short) at a time my @CONV = glob( "{T,C,A,G}" x 8 ); my $dna = join "", @CONV[ unpack "S", $raw ];` [download] For some reason, I had byte-order issues doing this. Of course, you must also be careful that $raw is padded to a multiple of 16 bits! Another cute trick I can think of is that you can do some bit-twiddling to implement the M-blocks (apparently lowercasing a range of characters). In ASCII, you can toggle the case of an alphabetic character by bitwise-XOR'ing it with the space character. So I think you can rewrite: `substr($dna, $_, $mblock{$_}, lc(substr($dna, $_, $mblock{$_})))` [download] as `substr($dna, $_, $mblock{$_}) ^= (" " x $mblock{$_});` [download] Alternatively, you could use %mblock to generate a long mask of chr(0)'s and chr(32)'s that you can XOR with the entire $dna. Again, probably not a big deal but certainly higher cute-value. Of course you could always fix M,N blocks on-the-fly, as you are unpacking them from $raw, but that would require some more work. Since I'm typing one-handed these days and it takes me forever, I think I will pass on playing with some code that does that! ;) blokhead	[reply] [d/l] [select]
Re^2: Parsing .2bit DNA files by bart (Canon) on Mar 06, 2008 at 11:48 UTC
Your idea of looking up the meanings of the sequences byte per byte is brilliant. I do have some doubts about using glob for it... but it even appears to do the right thing on ActivePerl on Windows. Still, I'm wondering if this is not just pure luck. A reliable way to do it would be to generate a list of integers, in this case from 0 to 255, and convert each to a string using base 4 — admittedly, I don't know how to best do it in Perl... As a second step, I'd convert the digits '0' .. '3' to the letters, for example with `tr/0123/TCAG/` [download] Anyway, you say at the expense of having a lookup table for all 256 values WTF? What expense is that? A few k of memory? Seriously, if the proper way to generate the array of meanings is too expensive, I'd just generate it once at startup, and store it in memory. You could also experiment with different trade-offs on lookup table sizes: Yes, but in that case, the lookup table gets much bigger: 64k entries of 8 letters each, that is 256k of text plus overhead of the array. Ouch. I don't think it will be much faster, so I don't think it's worth it. For some reason, I had byte-order issues doing this. Of course you have. You used a machine dependent byte ordering. You should either use 'n' or 'v' as the basic unpack template (probably 'n', for Big Endian), which luckily appears to produce unsigned integers, too. I have some doubts about using `unpack "C", $raw` to convert the byte sequence into numbers. Ouch. That sequence can be millions or even billions of bytes long, and that is a very long* list. I think it's better to convert the $raw string either in short sequences of, say, a few k each (the compromise is in loop count vs. memory usage per loop), `my $dna = ''; use constant CHUNKSIZE => 2048; for (my $offset = 0; $offset < length($raw); $offset += CHUNKSIZE) { $dna.= join '', @CONV[ unpack 'C*', substr $raw, $offset, CHUNKSIZE + ]; }` [download] or maybe even byte per byte with `s///`: `s/(.)/$CONV[ord $1]/sge` [download] but I doubt this will be the fastest way. It'll be as memory cheap as possible, that is true. Finally: don't forget to cut off the junk at the end of the sequence, making the length the same as the number of entries there were expected according to the record header.	[reply] [d/l] [select]
Re: Parsing .2bit DNA files by BrowserUk (Patriarch) on Mar 06, 2008 at 04:37 UTC
The first thing I noticed is that you are detecting the byte order for the header, but then ignoring it and using the platform specific 'l' template then on. That's wrong in two ways: Using 'l' template, which is for signed 32-bit numbers could cause problems. You should probably at least use 'L'. But I think you ought to be using 'N' or 'V'. My interpretation on the spec you linked is that the entire file may need byte order reversal? If you've already successfully parsed a 2bit file, then you may have just lucked out on using the same platform/byteordering as it was created on. My instinct would be to check the sig and save a template char of 'N' or 'V' as appropriate and use that for all future unpacking. Has a knock-on advantage also. See later. The slowest bit of the process seems likely to be `$dna = join '', map $conv{$_}, unpack('(A2)', unpack("B" . $bytes BITS_PER_BYTE , $raw) +);` [download] You should be able to save a bit of time by building a larger lookup table: `my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255;` [download] Now you can convert each byte (4 bases) in the packed DNA to it's ascii with a single array lookup rather than 4 hash loopkups: `## Omit the braces add a comma for a negligable further improvement my $DNA = join '', map{ $byteMap[ $_ ] } unpack 'C', $raw;` [download] Also, build your lookups at compile time not over and over at runtime as now. Of course, you can take that idea a little further and do two bytes at a time: `my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255; my @wordMap = map { $byteMap[ $_ >> 8 ] . $byteMap[ $_ \| 0xff ] } 0 .. 65535; ... my $DNA = join '', map{ $wordMap[ $_ ] } unpack 'n', $raw;` [download] Which ought to be close to an order of magnitude faster than your current method. 1 array lookup -v- 8 hash lookups; 8 times less lower loop overhead. Of course, the 'v' template will be byte-order specific. But, if when you determine the byte order from reading the signature, you store a template of 'N' or 'V' for your 32-bit field processing, then you can just lc that template to obtain the unsigned short template. Also, how sure are you of your conversion table? Are you certain that you should be using 'B' and not 'b'. I might have tested some of this, but it would take me the best part of two days to download the "sample" .2bit file you linked and a quick google didn't locate any others. Final thought. If you built an ordered array of offsets, as well as the named index, when processing the toc, you could provide access by position as well as access by name. (Your ordered hash module might work for this also :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: Parsing .2bit DNA files by Khisanth (Novice) on Mar 07, 2008 at 04:07 UTC
ftp://hgdownload.cse.ucsc.edu/gbdb/ce4/ce4.2bit is much smaller, 24.6MB	[reply]
Re: Parsing .2bit DNA files by bobf (Monsignor) on Mar 06, 2008 at 04:08 UTC
Nice example++. Thanks for posting the code. The 2bit format is used by the BLAT program for aligning DNA sequences. The BLAT suite of programs contains two utilities that might be of interest: faToTwoBit � convert Fasta format sequence files to a dense randomly-accessable .2bit format twoBitToFa � convert from the .2bit format back to fasta The `twoBitToFa` documentation is as follows: `twoBitToFa - Convert all or part of .2bit file to fasta usage: twoBitToFa input.2bit output.fa options: -seq=name - restrict this to just one sequence -start=X - start at given position in sequence (zero-based) -end=X - end at given position in sequence (non-inclusive)` [download] Once the sequence is converted back to fasta format it can be easily manipulated using a variety of tools (Perl, BioPerl, sequence utility programs, etc). So yes, a utility does exist that meets at least some of your listed requirements. OTOH, another tool in the toolbox, especially if it is pure Perl and/or a snazzy wrapper around a command-line utility, is always welcome. :-)	[reply] [d/l] [select]
Re: Parsing .2bit DNA files by bart (Canon) on Mar 06, 2008 at 21:39 UTC
`# Parse header my ($sig, $ver, $cnt, $reserved) = unpack('l4', $raw); # TODO: validate (signature, reverse byte order, version) return {SIG => $sig, VER => $ver, CNT => $cnt, RSV => $reserved};` [download] Man I can't believe you lightly step over what seems to be the most fun part of the whole spec: the fact that these files can be made in the Endianness you like: All fields are 32 bits unless noted. If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures. I can hardly believe you use "`l`" to unpack 32-bit integers. I agree with BrowserUK here: you should be using "`N`" or "`V`", actually, try both, and return the one that works. `for my $template ("N", "V") { # Parse header my ($sig, $ver, $cnt, $reserved) = unpack($template.'4', $raw); if($sig==0x1A412743) { return {unpack => $template, VER => $ver, CNT => $cnt}; } } # no match: not a .2bit header return undef;` [download] In the rest of your code, always use that $template (or `$header->{unpack}`) instead of that '`l`'. Now, that wasn't so hard, was it? For the rest... See blokhead's node — and my reply with my remarks. I agree with his approach as it's probably as fast as you can get in pure Perl. I'd stick to decoding whole bytes in one go. Be careful about memory problems, though: these strings can be very long — and your own approach is even worse, as it uses strings double that size, for decoding, and you're copying it around some more, taking up even more temporary space. Ouch. And using more memory usually means (much) slower — if you're not simply running out of memory, or you have to close some other programs to be able to run yours. None of them good things.	[reply] [d/l] [select]
Re: Parsing .2bit DNA files by ikegami (Patriarch) on Mar 07, 2008 at 06:41 UTC
Is there a reason for using `sysread` instead of `read`, which is buffered? (Switching requires also switching `sysseek` to `seek`.)	[reply] [d/l] [select]
Re: Parsing .2bit DNA files by Anonymous Monk on Oct 03, 2011 at 09:21 UTC
I'm sorry to tell you that if you compare the output generated from your code and the one from the original twobittofa program, they don't match. So something's wrong...	[reply]
Re^2: Parsing .2bit DNA files by Limbic~Region (Chancellor) on Oct 03, 2011 at 11:50 UTC
Anonymous Monk, No, nothing is wrong but thank you for your response. The twobittofa program, which I was not attempting to duplicate converts a .2bit file into a fasta file as I understand it. The intent of my project was not to duplicate that program - if that's what you need, then that's what you should use. Cheers - L~R	[reply]


P is for Practical
	PerlMonks