merging strings (xor at char level?)

csebe has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: merging strings (xor at char level?) by moritz (Cardinal) on Jul 03, 2014 at 08:09 UTC
Instead of going through the string(s) one character at a time, you could split the second string into whitespace and non-whitespace parts, and then apply them blockwise. That way you have far fewer iterations and operations to do. `use 5.014; use strict; use warnings; my $str1 = "12345 ABC 987 MNO"; my $str2 = " CDE"; my $pos = 0; my $res = ''; for my $chunk (split /([^ ]+)/, $str2) { if (substr($chunk, 0, 1) eq ' ') { $res .= substr $str1, $pos, length $chunk; } else { $res .= $chunk; } } continue { $pos += length $chunk; } if ($pos < length $str1) { $res .= substr $str1, $pos; } say $res;` [download] If you really want it to be fast and correct, provide us some test cases (Test::Simple or Test::More to the rescue!) and some Benchmarks into which we can plug our solutions. Perl 6 - the future is here, just unevenly distributed	[reply] [d/l]
Re: merging strings (xor at char level?) by davido (Cardinal) on Jul 03, 2014 at 07:55 UTC
It looks like "PRM" in your sample output is a typo -- or I simply don't understand the conversion. (I would like to help, but think before we optimize we should fully understand what we're optimizing.) I'm going to assume it's just a typo for now. Also, I realize this is just example code, but have you profiled real code with your real files to see where you're spending all your time? If you're spending all your time in IO, this algorithm may not need improving. With those caveats in mind, I wanted to give it a shot using a bit-twiddling approach. This solution pushes as much work as I can manage into internals (without resorting to XS), and performs bitwise operations on the strings to derive the result. I think you will find the performance to be decent.: `my $str1 = "12345 ABC 987 MNO"; my $str2 = " CDE"; sub merge { my $s2 = $_[1] =~ tr/ /\x{0}/r; return $s2 \| $_[0] & ( "\x{ff}" x length $_[0] ^ $s2 =~ tr/\x{00}/\x +{FF}/cr ); } print merge( $str1, $str2 ), "\n";` [download] ...output... `12345 CDE 987 MNO` [download] I really wish I could get rid of the `tr///`'s, but my bit-fu is maxed out for now. ;) This bitwise method works by masking off the strings and then or-ing them together. It's a little hard to read, but that might be the price one pays for a fairly efficient solution. A last resort, if this doesn't do it for you, would be to re-implement it with Inline::C, though there's so much already pushed into internals that it probably wouldn't gain as much as one would hope. You might also spawn workers that each handle one input file, and limit number of workers to some sane amount. If you're currently CPU bound, but only using one core, this makes sense. Beware: Nothing will mojibake your Unicode like treating it as raw bits and bytes. This is for ASCII only. Update: Benchmark results are in a followup later in this thread: Re: merging strings (xor at char level?). This bit-twiddling approach is a clear winner for both short and multi-megabyte strings. Dave	[reply] [d/l] [select]
Re: merging strings (xor at char level?) by davido (Cardinal) on Jul 03, 2014 at 16:01 UTC
I really should generate my own huge sample strings to test with, but given your sample input, here are how four implementations benchmark (two of them are mine, one is from moritz, and one is yours). `Implementations check OK. Rate csebe moritz resubst bitwise csebe 81457/s -- -78% -88% -92% moritz 377674/s 364% -- -45% -63% resubst 686400/s 743% 82% -- -33% bitwise 1030128/s 1165% 173% 50% --` [download] And here's the benchmark code: Read more... (2 kB) The bitwise version is a pretty clear winner for the short test input. I think a 10x increase is probably a worthwhile optimization, if it holds true for long strings too. Update: I found a little time to test larger strings. For strings that are about 3.5MB, the results are really compelling in favor of the bitwise method: `# String length is 3515KB. ok 1 - Bitwise ok 2 - Regex/Substr ok 3 - moritz ok 4 - csebe 1..4 Rate csebe moritz resubst bitwise csebe 0.376/s -- -91% -95% -99% moritz 4.11/s 995% -- -46% -94% resubst 7.63/s 1931% 86% -- -89% bitwise 69.1/s 18302% 1581% 806% --` [download] And here is the updated benchmark code: Read more... (3 kB) An 180x improvement over the original method, at the cost of being almost illegible. ;) Putting it in perspective, your original solution would take about 2.7 seconds to process a 3.5mb string. In that time the bitwise solution would process 186 strings of that size, or 651MB of data in all. In such case, the application would probably become IO bound. Dave	[reply] [d/l] [select]
Re^2: merging strings (xor at char level?) by csebe (Initiate) on Feb 27, 2015 at 11:06 UTC
Jeez, I feel really bad for not visiting this page again since original posting. The site failed somehow to announce me that I got replies to my topic (or I might have had problems with my email?), but still, it looks terrible... Excellent idea and testing work Dave. Also thanks Moritz. Again, sorry for being an assh0le. Cheers, Lian	[reply]
Re^3: merging strings (xor at char level?) by choroba (Cardinal) on Feb 28, 2015 at 12:31 UTC
There are no email announcments. You get in-site personal messages, though. لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re: merging strings (xor at char level?) by perlfan (Vicar) on Jul 03, 2014 at 11:39 UTC
Is your data in fixed width fields?	[reply]


laziness, impatience, and hubris
	PerlMonks