Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^3: CPAN Module to determing overlap of 2 lists?

by LanX (Saint)
on Aug 12, 2020 at 18:27 UTC ( [id://11120662]=note: print w/replies, xml ) Need Help??


in reply to Re^2: CPAN Module to determing overlap of 2 lists?
in thread CPAN Module to determing overlap of 2 lists?

3 suggestions
  • you don't need complete lines to make it work, but anchoring to line start might prove to be faster
  • I'd include characters below ASCII 8 to the "marker" to play safe, see also discussion surrounding the similar $;
  • you might be interested to check with re "debug" , how the backtracking of the .* submatch performs. I'd guess you prefer it to grow from right to left instead of shrinking from left to right. I know the regex engine can do this depending on the anchors.
I haven't checked the last point since performance might not be your biggest issue.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re^4: CPAN Module to determing overlap of 2 lists? (updated)
by LanX (Saint) on Aug 12, 2020 at 21:08 UTC
    > grow from right to left instead of shrinking from left to right

    This might be much faster if the overlaps are considerably smaller than the total files.

    And it avoids any semipredicate problem with $marker.°

    (Not heavily tested, please check edge-cases)

    use strict; use warnings; my $file1 = join "\n", qw( a b c d c ); my $file2 = join "\n", qw( c d c x ); my $content = "$file2\n$file1"; $content =~ /^(.*)\n.*\1$/s; (substr $file2,0,length $1)=$file1; print $file2;

    a b c d c x

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    °) unfortunately it doesn't, prove left to the interested reader

Re^4: CPAN Module to determing overlap of 2 lists?
by wazat (Monk) on Aug 13, 2020 at 00:50 UTC

    I added the line start anchor as I wanted to match whole lines.

    Agreed, assuming text files, a more "binary" marker is better.

    Currently I feel the regex solution is interesting, but still not my first choice. I'll dig deeper if I start profiling.

      > Currently I feel the regex solution is interesting, but still not my first choice. I'll dig deeper if I start profiling.

      For completeness.

      Another approach would be combining a binary search with string equality eq if your priorities tend towards performance.

      Of course you could change to a division factor other than 0.5 depending on heuristics.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11120662]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (6)
As of 2024-03-28 21:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found