7 hours! for about 200,000 starting strings.
| [reply] |
Try this on that same dataset and let me know how you get on. On my generated test data it takes ~30 minutes for 200,000 strings, but how realistic that dataset is ...
Use thisScript.pl inFile > outFile:
#! perl -slw
use strict;
use Inline C => Config => BUILD_NOISY => 1;
use Inline C => <<'END_C', NAME => '_906020', CLEAN_AFTER_BUILD => 0;
#include <string.h>
int longCmp( SV *needle, SV *haystack, SV *offset ) {
STRLEN ln, lh, o = SvIV( offset );
char *n = SvPV( needle, ln ), *h = SvPV( haystack, lh );
char *nl = n + ln - 1;
int diff = lh - ln;
int flag = 0, i;
h += o; lh -= o; diff -= o;
if( diff <= 0 ) return 0;
for( i = 0; i < diff; ++i ) {
if( ! h[ i + ln - 1 ] ) { i += ln; continue; }
if( h[ i ] != *n || h[ i+ ln-1 ] != *nl ) continue;
if( strncmp( h+i, n, ln ) ) continue;
return i;
}
return 0;
}
END_C
use Time::HiRes qw[ time ];
sub uniq{ my %x; @x{@_} = (); keys %x }
my $start = time;
my @uniq = uniq <>;
chomp @uniq;
@uniq = sort{ length $a <=> length $b } @uniq;
my $all = join chr(0), @uniq;
my $p = 0;
for my $x ( @uniq ) {
$p += 1+ length $x;
next if longCmp( $x, $all, $p );
print $x;
}
printf STDERR "Took %.3f\n", time() - $start;
__END__
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Starting "make" Stage
i686-apple-darwin10-gcc-4.2.1: Module: No such file or directory
i686-apple-darwin10-gcc-4.2.1: no input files
i686-apple-darwin10-gcc-4.2.1: Module: No such file or directory
i686-apple-darwin10-gcc-4.2.1: no input files
powerpc-apple-darwin10-gcc-4.2.1: Module: No such file or directory
powerpc-apple-darwin10-gcc-4.2.1: no input files
lipo: can't figure out the architecture type of: /var/folders/Jx/Jx+cO
+TNTFTSKERHsDjE+nU+++TI/-Tmp-//cch73GJV.out
make: *** [_906020.o] Error 1
A problem was encountered while attempting to compile and install your
+ Inline
C code. The command that failed was:
make
| [reply] [d/l] |
7 hours! starting with about 200,00 strings. Thanks! | [reply] |