Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Parsing .2bit DNA files

by Limbic~Region (Chancellor)
on Mar 05, 2008 at 19:13 UTC ( [id://672251]=perlquestion: print w/replies, xml ) Need Help??

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
Last night on #perl IRC (freenode), someone asked for help parsing .2bit files. The usual responses followed:

The response was that CPAN hadn't turned up anything, they weren't interested in the monolith that is BioPerl, and they were not that comfortable with unpack. I glanced at the spec and said that I would be happy to provide a full working solution as soon as my daughter went to bed. This turned out to be more challenging than I expected due to lack of clarity in the spec.

The issue is in the section covering sequence records. One field indicates the number of "N blocks" in the sequence, the next field is the "starting positions of the N blocks" and the next field is the "lengths of each N block". Right, and how is a single 32bit integer supposed to describe multiple items. The solution is that, assuming the number of N blocks is X, the next X fields will be starting positions and the X fields after that are the lengths. If X is 0, the next fields are omitted entirely.

Fortunately, the requester was able to provide some sample files and I put together the following code:

#!/usr/bin/perl use constant ONE_BYTE => 1; use constant FOUR_BYTE => 4; use constant BITS_PER_BYTE => 8; use constant BASES_PER_FOUR_BYTE => 16; use strict; use warnings; my $file = $ARGV[0] or die "Usage: $0 <input>"; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my $header = parse_header($fh); my $count = $header->{CNT}; my %toc; populate_toc($fh, $count, \%toc); for my $name (keys %toc) { my $offset = $toc{$name}; my $dna = fetch_record($fh, $offset); #print length($dna), "\n"; print "$dna\n"; } sub parse_header { my ($fh) = @_; # Read header my $raw = ''; sysread($fh, $raw, FOUR_BYTE * 4); # Parse header my ($sig, $ver, $cnt, $reserved) = unpack('l4', $raw); # TODO: validate (signature, reverse byte order, version) return {SIG => $sig, VER => $ver, CNT => $cnt, RSV => $reserved}; } sub populate_toc { my ($fh, $count, $toc) = @_; my ($raw, $size, $name) = ('', '', ''); for (1 .. $count) { # Read size of record name sysread($fh, $raw, ONE_BYTE); $size = unpack('C', $raw); # Read name of reacord sysread($fh, $name, $size); # Read and store offset sysread($fh, $raw, FOUR_BYTE); $toc->{$name} = unpack('l', $raw); } } sub fetch_record { my ($fh, $offset) = @_; my ($raw, $dna, $size, $cnt) = ('', '', '', ''); my (@start, @len, %nblock, %mblock); # Seek to the record location sysseek($fh, $offset, 0); # Establish the conversion table my %conv = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); # Fetch the DNA size sysread($fh, $raw, FOUR_BYTE); $size = unpack('l', $raw); # Handle the n block and m blocks for my $block (\%nblock, \%mblock) { # Fetch the n block count sysread($fh, $raw, FOUR_BYTE); $cnt = unpack('l', $raw); if ($cnt) { sysread($fh, $raw, FOUR_BYTE * $cnt); @start = unpack("l$cnt", $raw); sysread($fh, $raw, FOUR_BYTE * $cnt); @len = unpack("l$cnt", $raw); @{$block}{@start} = @len; } } # throw away reserved field sysread($fh, $raw, FOUR_BYTE); # Fetch DNA - TODO: read in configurable size chunks my $bytes = ((int($size / BASES_PER_FOUR_BYTE)) + 1) * FOUR_BYTE; sysread($fh, $raw, $bytes); $dna = join '', map $conv{$_}, unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw) +); # Fix N blocks substr($dna, $_, $nblock{$_}, 'N' x $nblock{$_}) for keys %nblock; # Fix M blocks substr($dna, $_, $mblock{$_}, lc(substr($dna, $_, $mblock{$_}))) f +or keys %mblock; return substr($dna, 0, $size); } __END__ See http://genome.ucsc.edu/FAQ/FAQformat#format7 See also human genome as .2bit ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit TODO 1. Documentation, error handling, and tests 2. Profiling and benchmarking 3. Handle big/little endian 4. Dynamically determine if record count justifies SQLite versus in-m +emory hash 5. Allow the reading of DNA sequence in configurable size chunks 6. Add an API to treat the sequence like an iterator

This leaves me with the following questions:

  • Is there an existing non-bundled module that does this?
  • Given that speed is a requirement, how would you improve it?
  • Is this worth polishing to put on CPAN?
  • If the goal of .2bit files is maximal compression, do you have any idea why N blocks aren't inserted rather than replaced?

Update: Added a few todo items in case this does make it to CPAN

Cheers - L~R

Replies are listed 'Best First'.
Re: Parsing .2bit DNA files
by pc88mxer (Vicar) on Mar 05, 2008 at 21:10 UTC
    I'll propose a theory on why the N's are not inserted. Basically, it simplifies I/O. If these files are large, they may find it more efficient to read the entire file into memory with one (or a few) I/O operations than to read it in small segments. Even if you have a scatter/gather I/O capability, generally you are limited to block (or at least byte) boundaries. Reading the whole file in and inserting bits in the middle afterwards is obviously going to be too inefficient for large files.

    Also, it simplifies random access of the base stream directly from the file. You don't have to keep track of where you really are based on all the previous insertions.

    The idea that .2bit files are more compressed is probably in relationship to the earlier .nib format which uses 4 bits per base. Clearly the frequency of N and M bases is such that the 2bit format is much more compact.

    I think that a virtual api to a 2bit file would be useful. So you could do something like:

    my $twobit = new TwoBitFile('some/path.2bit'); # returns 1000 bases starting from base 40,000,000: my $bases = $twobit->range(40_000_000, 1000);
    and minimal disk I/O is performed.

    Update: Being able to treat the 2bit file as a virtual string with the ability to run regular expressions on it would also be really cool. Is that something we can do in Perl6?

      pc88mxer,
      With regards to the N blocks, your hypothesis sounds valid. With regards to the virtual API, I was actually thinking of turning it into an iterator. Calling the iterator without any parameters would fetch the next X bytes where X is some default. You could then allow for named parameters to fetch forward by Y, rewind, select a range, etc. It should have gone into the TODO but wasn't really necessary.

      With regards to Perl 6 and applying a regular expressions on streams - the answer is yes unless things have changed since last I looked.

      Cheers - L~R

Re: Parsing .2bit DNA files
by blokhead (Monsignor) on Mar 06, 2008 at 05:49 UTC
    One thing that strikes me as odd/inefficient is that you are explicily converting to strings of ASCII '0' and '1' to then convert to A, C, G, T. It seems like it would be more direct to convert a byte (4 DNA bases) at a time. Here is a cute way to do that, at the expense of having a lookup table for all 256 values:
    my @CONV = glob( "{T,C,A,G}" x 4 ); my $dna = join "", @CONV[ unpack "C*", $raw ];
    On my system, this gives the same output as yours. I don't know if it's better, but it is shorter, and it can conveniently use an array instead of a hash. You could also experiment with different tradeoffs on lookup table sizes:
    ## takes 16 bits (= 8 bases = unsigned short) at a time my @CONV = glob( "{T,C,A,G}" x 8 ); my $dna = join "", @CONV[ unpack "S*", $raw ];
    For some reason, I had byte-order issues doing this. Of course, you must also be careful that $raw is padded to a multiple of 16 bits!

    Another cute trick I can think of is that you can do some bit-twiddling to implement the M-blocks (apparently lowercasing a range of characters). In ASCII, you can toggle the case of an alphabetic character by bitwise-XOR'ing it with the space character. So I think you can rewrite:

    substr($dna, $_, $mblock{$_}, lc(substr($dna, $_, $mblock{$_})))
    as
    substr($dna, $_, $mblock{$_}) ^= (" " x $mblock{$_});
    Alternatively, you could use %mblock to generate a long mask of chr(0)'s and chr(32)'s that you can XOR with the entire $dna. Again, probably not a big deal but certainly higher cute-value.

    Of course you could always fix M,N blocks on-the-fly, as you are unpacking them from $raw, but that would require some more work. Since I'm typing one-handed these days and it takes me forever, I think I will pass on playing with some code that does that! ;)

    blokhead

      Your idea of looking up the meanings of the sequences byte per byte is brilliant. I do have some doubts about using glob for it... but it even appears to do the right thing on ActivePerl on Windows. Still, I'm wondering if this is not just pure luck.

      A reliable way to do it would be to generate a list of integers, in this case from 0 to 255, and convert each to a string using base 4 — admittedly, I don't know how to best do it in Perl... As a second step, I'd convert the digits '0' .. '3' to the letters, for example with

      tr/0123/TCAG/

      Anyway, you say

      at the expense of having a lookup table for all 256 values
      WTF? What expense is that? A few k of memory? Seriously, if the proper way to generate the array of meanings is too expensive, I'd just generate it once at startup, and store it in memory.

      You could also experiment with different trade-offs on lookup table sizes:
      Yes, but in that case, the lookup table gets much bigger: 64k entries of 8 letters each, that is 256k of text plus overhead of the array. Ouch. I don't think it will be much faster, so I don't think it's worth it.
      For some reason, I had byte-order issues doing this.
      Of course you have. You used a machine dependent byte ordering. You should either use 'n' or 'v' as the basic unpack template (probably 'n', for Big Endian), which luckily appears to produce unsigned integers, too.

      I have some doubts about using unpack "C*", $raw to convert the byte sequence into numbers. Ouch. That sequence can be millions or even billions of bytes long, and that is a very long list. I think it's better to convert the $raw string either in short sequences of, say, a few k each (the compromise is in loop count vs. memory usage per loop),

      my $dna = ''; use constant CHUNKSIZE => 2048; for (my $offset = 0; $offset < length($raw); $offset += CHUNKSIZE) { $dna.= join '', @CONV[ unpack 'C*', substr $raw, $offset, CHUNKSIZE + ]; }
      or maybe even byte per byte with s///:
      s/(.)/$CONV[ord $1]/sge
      but I doubt this will be the fastest way. It'll be as memory cheap as possible, that is true.

      Finally: don't forget to cut off the junk at the end of the sequence, making the length the same as the number of entries there were expected according to the record header.

Re: Parsing .2bit DNA files
by BrowserUk (Patriarch) on Mar 06, 2008 at 04:37 UTC

    The first thing I noticed is that you are detecting the byte order for the header, but then ignoring it and using the platform specific 'l' template then on. That's wrong in two ways:

    1. Using 'l' template, which is for signed 32-bit numbers could cause problems.

      You should probably at least use 'L'. But I think you ought to be using 'N' or 'V'.

    2. My interpretation on the spec you linked is that the entire file may need byte order reversal?

      If you've already successfully parsed a 2bit file, then you may have just lucked out on using the same platform/byteordering as it was created on.

      My instinct would be to check the sig and save a template char of 'N' or 'V' as appropriate and use that for all future unpacking. Has a knock-on advantage also. See later.

    The slowest bit of the process seems likely to be

    $dna = join '', map $conv{$_}, unpack('(A2)*', unpack("B" . $bytes * BITS_PER_BYTE , $raw) +);

    You should be able to save a bit of time by building a larger lookup table:

    my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255;

    Now you can convert each byte (4 bases) in the packed DNA to it's ascii with a single array lookup rather than 4 hash loopkups:

    ## Omit the braces add a comma for a negligable further improvement my $DNA = join '', map{ $byteMap[ $_ ] } unpack 'C*', $raw;

    Also, build your lookups at compile time not over and over at runtime as now.

    Of course, you can take that idea a little further and do two bytes at a time:

    my %bitMap = ('00' => 'T', '01' => 'C', '10' => 'A', '11' => 'G'); my @byteMap = map{ join '', map $bitMap{ $_ }, unpack '(A2)4', unpack 'B8', chr } 0 .. 255; my @wordMap = map { $byteMap[ $_ >> 8 ] . $byteMap[ $_ | 0xff ] } 0 .. 65535; ... my $DNA = join '', map{ $wordMap[ $_ ] } unpack 'n*', $raw;

    Which ought to be close to an order of magnitude faster than your current method. 1 array lookup -v- 8 hash lookups; 8 times less lower loop overhead.

    Of course, the 'v' template will be byte-order specific. But, if when you determine the byte order from reading the signature, you store a template of 'N' or 'V' for your 32-bit field processing, then you can just lc that template to obtain the unsigned short template.

    Also, how sure are you of your conversion table? Are you certain that you should be using 'B' and not 'b'.

    I might have tested some of this, but it would take me the best part of two days to download the "sample" .2bit file you linked and a quick google didn't locate any others.

    Final thought. If you built an ordered array of offsets, as well as the named index, when processing the toc, you could provide access by position as well as access by name. (Your ordered hash module might work for this also :)


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      ftp://hgdownload.cse.ucsc.edu/gbdb/ce4/ce4.2bit is much smaller, 24.6MB
Re: Parsing .2bit DNA files
by bobf (Monsignor) on Mar 06, 2008 at 04:08 UTC

    Nice example++. Thanks for posting the code.

    The 2bit format is used by the BLAT program for aligning DNA sequences. The BLAT suite of programs contains two utilities that might be of interest:

    • faToTwoBit – convert Fasta format sequence files to a dense randomly-accessable .2bit format
    • twoBitToFa – convert from the .2bit format back to fasta

    The twoBitToFa documentation is as follows:

    twoBitToFa - Convert all or part of .2bit file to fasta usage: twoBitToFa input.2bit output.fa options: -seq=name - restrict this to just one sequence -start=X - start at given position in sequence (zero-based) -end=X - end at given position in sequence (non-inclusive)

    Once the sequence is converted back to fasta format it can be easily manipulated using a variety of tools (Perl, BioPerl, sequence utility programs, etc).

    So yes, a utility does exist that meets at least some of your listed requirements. OTOH, another tool in the toolbox, especially if it is pure Perl and/or a snazzy wrapper around a command-line utility, is always welcome. :-)

Re: Parsing .2bit DNA files
by bart (Canon) on Mar 06, 2008 at 21:39 UTC
    # Parse header my ($sig, $ver, $cnt, $reserved) = unpack('l4', $raw); # TODO: validate (signature, reverse byte order, version) return {SIG => $sig, VER => $ver, CNT => $cnt, RSV => $reserved};
    Man I can't believe you lightly step over what seems to be the most fun part of the whole spec: the fact that these files can be made in the Endianness you like:
    All fields are 32 bits unless noted. If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures.
    I can hardly believe you use "l" to unpack 32-bit integers. I agree with BrowserUK here: you should be using "N" or "V", actually, try both, and return the one that works.
    for my $template ("N", "V") { # Parse header my ($sig, $ver, $cnt, $reserved) = unpack($template.'4', $raw); if($sig==0x1A412743) { return {unpack => $template, VER => $ver, CNT => $cnt}; } } # no match: not a .2bit header return undef;
    In the rest of your code, always use that $template (or $header->{unpack}) instead of that 'l'.

    Now, that wasn't so hard, was it?

    For the rest... See blokhead's node — and my reply with my remarks. I agree with his approach as it's probably as fast as you can get in pure Perl. I'd stick to decoding whole bytes in one go. Be careful about memory problems, though: these strings can be very long — and your own approach is even worse, as it uses strings double that size, for decoding, and you're copying it around some more, taking up even more temporary space. Ouch.

    And using more memory usually means (much) slower — if you're not simply running out of memory, or you have to close some other programs to be able to run yours. None of them good things.

Re: Parsing .2bit DNA files
by ikegami (Patriarch) on Mar 07, 2008 at 06:41 UTC
    Is there a reason for using sysread instead of read, which is buffered? (Switching requires also switching sysseek to seek.)
Re: Parsing .2bit DNA files
by Anonymous Monk on Oct 03, 2011 at 09:21 UTC

    I'm sorry to tell you that if you compare the output generated from your code and the one from the original twobittofa program, they don't match. So something's wrong...

      Anonymous Monk,
      No, nothing is wrong but thank you for your response. The twobittofa program, which I was not attempting to duplicate converts a .2bit file into a fasta file as I understand it. The intent of my project was not to duplicate that program - if that's what you need, then that's what you should use.

      Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://672251]
Approved by pc88mxer
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-16 21:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found