Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^3: Unpack Fail to Decompress String?

by BrowserUk (Patriarch)
on Jan 10, 2009 at 10:56 UTC ( [id://735378]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Unpack Fail to Decompress String?
in thread Unpack Fail to Decompress String?

I'd favour putting a count of the number of significant characters in the last byte in the last 2 bits of the that byte. (Which may require a zero last byte.)

Sorry guys, but I don't see any way of making a trailing indicator allow for string comparison.

  • If you do this way, the bigger strings sort before shorter strings:
    input Using a significant bit count in the last 2 bits of the last byte ACGT 1b 00 ACGTA 1b 01 ACGTAA 1b 02 ACGTAAA 1b 03 ACGTAAAA 1b 00 00 ACGTAAAAA 1b 00 01
  • This way is a non starter, things sort all over the place:
    input Using the significant bit count in the first byte ACGT 00 1b 00 ACGTA 01 1b 00 ACGTAA 02 1b 00 ACGTAAA 03 1b 00 ACGTAAAA 00 1b 00 00 ACGTAAAAA 01 1b 00 00
  • With a trailing full length, the last comes first and the first last, with those in the middle keeping their correct relative ordering:
    input packed with trailing network (BE) length ACGT 1b 00 00 00 04 ACGTA 1b 00 00 00 00 05 ACGTAA 1b 00 00 00 00 06 ACGTAAA 1b 00 00 00 00 07 ACGTAAAA 1b 00 00 00 00 08 ACGTAAAAA 1b 00 00 00 00 00 09
  • The only method that I can see working is prefixing the output length in network (BE) order to the front:
    input packed with leading network (BE) length ACGT 00 00 00 04 1b ACGTA 00 00 00 05 1b 00 ACGTAA 00 00 00 06 1b 00 ACGTAAA 00 00 00 07 1b 00 ACGTAAAA 00 00 00 08 1b 00 ACGTAAAAA 00 00 00 09 1b 00

    Update: Even that doesn't work! These two sort in the wrong order.

    in: ACGTACGTACGTACGTATT packed( 9): 000000131b1b1b1b3c in: ACGTACGTACGTACGTAAAA packed(10): 000000141b1b1b1b0000

    And the penalty of that for what is intended as a compression mechanism is prohibitive for short sequences. You might lower the overhead by using a short instead of a long, but 64k is not enough for real genome work.

That said, the sorting and comparing of packed ACGT (other than simple equality), is a fairly non-useful thing anyway. Sequence and subsequence representations don't have any intrisic ordering.

The more frequent operation is to search one sequence for the presence of another (sub)sequence, and doing that with packed sequences means you're only checking every fourth position, rather than every position. Performing shifts or rotates on bitstring greater than register size is a prohibitively expensive operation. It far outweights any gains you might get from searching quarter sized strings, as you have to perform 4 searches anyway, and you need the expensive shift operation inbetween each search.

I think the only useful use for packed ACGT (2bit) format is to reduce storage overhead. It's almost certainly quicker to just unpack the sequences for searching. In which case, the prefix byte of the significant bits in the last byte is a good compromise between compression level and implementation ease.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^4: Unpack Fail to Decompress String?
by gone2015 (Deacon) on Jan 10, 2009 at 15:00 UTC

    Expanding on what I suggested:

    1. strings which compress to the same number of bytes can be compared directly.

      As per your example:

        ACGT       1b 00
        ACGTA      1b 01
        ACGTAA     1b 02
        ACGTAAA    1b 03
      

    2. strings which compress to different numbers of bytes can be compared directly up to the last byte of the shorter. If that doesn't settle the matter, then you need to use the last bits of the shorter to select a mask...

      Taking your example:

        ACGTAAA    1b 03
        ACGTAAAA   1b 00 00
      
      the strings match up to (but, for the avoidance of doubt, not including) the last byte of the shorter, so the '3' selects the mask 0xFC, under that mask the last corresponding bytes are equal, so the longer string is the larger. Similarly:
        ACGT TGCA AGA    1b F4 23
        ACGT TGCA AGAC   1b F4 21 00
      
      Actually, it's not strictly necessary to apply the mask to both strings, or indeed to use any mask other than 0xFC -- but it seemed tidier.

    There is some futzing about to be done to compare the compressed forms, though the decision is likely to be made before the last byte of the shorter.

    Of course, if the comparison is only for searching for equality, then it doesn't really matter where you put the length mod 4 (except that if there's little variation in length, putting it at the end keeps it out of the way). Alphabetic order does appeal to a sense of neatness, though.

    As far as implementation of packing/unpacking goes, placing the length mod 4 at the end means that all the fiddling about happens at the end, rather than at both ends. I have modified the code I posted previously to do this and included that below. I look forward to improvements :-)

    Mind you, if performance is the issue, I'd cast this into C.

Re^4: Unpack Fail to Decompress String?
by samwyse (Scribe) on Jan 13, 2009 at 15:23 UTC
    With a trailing full length, the last comes first and the first last, with those in the middle keeping their correct relative ordering
    So, don't use the actual length, use length mod 4.
    inputpacked
    ACGT1b 00
    ACGTA1b 01
    ACGTAA1b 02
    ACGTAAA1b 03
    ACGTAAAA1b 00 00
    ACGTAAAAA1b 00 01

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://735378]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-04-23 20:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found