Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^2: list of unique strings, also eliminating matching substrings

by lindsay_grey (Novice)
on May 21, 2011 at 03:44 UTC ( [id://906033]=note: print w/replies, xml ) Need Help??


in reply to Re: list of unique strings, also eliminating matching substrings
in thread list of unique strings, also eliminating matching substrings

1. I want to eliminate the duplicates within each file.

2. The strings range from 200 to 400 characters.

3. The complete alphabet is A, G, C, T, N.

Note for point 1. I want to eliminate not just the exact duplicates but also those that are contained within a longer string.

  • Comment on Re^2: list of unique strings, also eliminating matching substrings

Replies are listed 'Best First'.
Re^3: list of unique strings, also eliminating matching substrings
by BrowserUk (Patriarch) on May 21, 2011 at 05:53 UTC

    Presumably you've code the obvious two loops method and it is taking too long. Could you supply a timing for one of your datasets?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      7 hours! for about 200,000 starting strings.

        Try this on that same dataset and let me know how you get on. On my generated test data it takes ~30 minutes for 200,000 strings, but how realistic that dataset is ...

        Use thisScript.pl inFile > outFile:

        #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_906020', CLEAN_AFTER_BUILD => 0; #include <string.h> int longCmp( SV *needle, SV *haystack, SV *offset ) { STRLEN ln, lh, o = SvIV( offset ); char *n = SvPV( needle, ln ), *h = SvPV( haystack, lh ); char *nl = n + ln - 1; int diff = lh - ln; int flag = 0, i; h += o; lh -= o; diff -= o; if( diff <= 0 ) return 0; for( i = 0; i < diff; ++i ) { if( ! h[ i + ln - 1 ] ) { i += ln; continue; } if( h[ i ] != *n || h[ i+ ln-1 ] != *nl ) continue; if( strncmp( h+i, n, ln ) ) continue; return i; } return 0; } END_C use Time::HiRes qw[ time ]; sub uniq{ my %x; @x{@_} = (); keys %x } my $start = time; my @uniq = uniq <>; chomp @uniq; @uniq = sort{ length $a <=> length $b } @uniq; my $all = join chr(0), @uniq; my $p = 0; for my $x ( @uniq ) { $p += 1+ length $x; next if longCmp( $x, $all, $p ); print $x; } printf STDERR "Took %.3f\n", time() - $start; __END__

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      7 hours! starting with about 200,00 strings. Thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://906033]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-04-25 04:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found