It looks like "PRM" in your sample output is a typo -- or I simply don't understand the conversion. (I would like to help, but think before we optimize we should fully understand what we're optimizing.) I'm going to assume it's just a typo for now.
Also, I realize this is just example code, but have you profiled real code with your real files to see where you're spending all your time? If you're spending all your time in IO, this algorithm may not need improving.
With those caveats in mind, I wanted to give it a shot using a bit-twiddling approach. This solution pushes as much work as I can manage into internals (without resorting to XS), and performs bitwise operations on the strings to derive the result. I think you will find the performance to be decent.:
my $str1 = "12345 ABC 987 MNO";
my $str2 = " CDE";
sub merge {
my $s2 = $_[1] =~ tr/ /\x{0}/r;
return $s2 | $_[0] & ( "\x{ff}" x length $_[0] ^ $s2 =~ tr/\x{00}/\x
+{FF}/cr );
}
print merge( $str1, $str2 ), "\n";
...output...
12345 CDE 987 MNO
I really wish I could get rid of the tr///'s, but my bit-fu is maxed out for now. ;)
This bitwise method works by masking off the strings and then or-ing them together. It's a little hard to read, but that might be the price one pays for a fairly efficient solution.
A last resort, if this doesn't do it for you, would be to re-implement it with Inline::C, though there's so much already pushed into internals that it probably wouldn't gain as much as one would hope. You might also spawn workers that each handle one input file, and limit number of workers to some sane amount. If you're currently CPU bound, but only using one core, this makes sense.
Beware: Nothing will mojibake your Unicode like treating it as raw bits and bytes. This is for ASCII only.
Update: Benchmark results are in a followup later in this thread: Re: merging strings (xor at char level?). This bit-twiddling approach is a clear winner for both short and multi-megabyte strings.
|