Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: CPAN Module to determing overlap of 2 lists?

by wazat (Monk)
on Aug 11, 2020 at 19:03 UTC ( [id://11120621]=note: print w/replies, xml ) Need Help??


in reply to Re: CPAN Module to determing overlap of 2 lists?
in thread CPAN Module to determing overlap of 2 lists?

I hadn't thought much about a regex solution.

To ensure the match start is a complete line, requires a small tweak.

my $combine = "$file1$marker$file2" =~ s/(?:\A|\n)(.*)\K\Q$marker\E\1/ +/sr;

Replies are listed 'Best First'.
Re^3: CPAN Module to determing overlap of 2 lists?
by LanX (Saint) on Aug 12, 2020 at 18:27 UTC
    3 suggestions
    • you don't need complete lines to make it work, but anchoring to line start might prove to be faster
    • I'd include characters below ASCII 8 to the "marker" to play safe, see also discussion surrounding the similar $;
    • you might be interested to check with re "debug" , how the backtracking of the .* submatch performs. I'd guess you prefer it to grow from right to left instead of shrinking from left to right. I know the regex engine can do this depending on the anchors.
    I haven't checked the last point since performance might not be your biggest issue.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      > grow from right to left instead of shrinking from left to right

      This might be much faster if the overlaps are considerably smaller than the total files.

      And it avoids any semipredicate problem with $marker.°

      (Not heavily tested, please check edge-cases)

      use strict; use warnings; my $file1 = join "\n", qw( a b c d c ); my $file2 = join "\n", qw( c d c x ); my $content = "$file2\n$file1"; $content =~ /^(.*)\n.*\1$/s; (substr $file2,0,length $1)=$file1; print $file2;

      a b c d c x

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      °) unfortunately it doesn't, prove left to the interested reader

      I added the line start anchor as I wanted to match whole lines.

      Agreed, assuming text files, a more "binary" marker is better.

      Currently I feel the regex solution is interesting, but still not my first choice. I'll dig deeper if I start profiling.

        > Currently I feel the regex solution is interesting, but still not my first choice. I'll dig deeper if I start profiling.

        For completeness.

        Another approach would be combining a binary search with string equality eq if your priorities tend towards performance.

        Of course you could change to a division factor other than 0.5 depending on heuristics.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11120621]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-04-19 04:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found