http://qs321.pair.com?node_id=11120576


in reply to CPAN Module to determing overlap of 2 lists?

#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11120564 use warnings; my $file1 = <<END; one two three four five six END my $file2 = <<END; two three four five six seven END my $marker = '***MARKER***'; # something not in either string my $combine = "$file1$marker$file2" =~ s/(.*)\K\Q$marker\E\1//sr; print $combine;

Outputs:

one two three four five six seven

Replies are listed 'Best First'.
Re^2: CPAN Module to determing overlap of 2 lists?
by wazat (Monk) on Aug 11, 2020 at 19:03 UTC

    I hadn't thought much about a regex solution.

    To ensure the match start is a complete line, requires a small tweak.

    my $combine = "$file1$marker$file2" =~ s/(?:\A|\n)(.*)\K\Q$marker\E\1/ +/sr;
      3 suggestions
      • you don't need complete lines to make it work, but anchoring to line start might prove to be faster
      • I'd include characters below ASCII 8 to the "marker" to play safe, see also discussion surrounding the similar $;
      • you might be interested to check with re "debug" , how the backtracking of the .* submatch performs. I'd guess you prefer it to grow from right to left instead of shrinking from left to right. I know the regex engine can do this depending on the anchors.
      I haven't checked the last point since performance might not be your biggest issue.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        > grow from right to left instead of shrinking from left to right

        This might be much faster if the overlaps are considerably smaller than the total files.

        And it avoids any semipredicate problem with $marker.°

        (Not heavily tested, please check edge-cases)

        use strict; use warnings; my $file1 = join "\n", qw( a b c d c ); my $file2 = join "\n", qw( c d c x ); my $content = "$file2\n$file1"; $content =~ /^(.*)\n.*\1$/s; (substr $file2,0,length $1)=$file1; print $file2;

        a b c d c x

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        °) unfortunately it doesn't, prove left to the interested reader

        I added the line start anchor as I wanted to match whole lines.

        Agreed, assuming text files, a more "binary" marker is better.

        Currently I feel the regex solution is interesting, but still not my first choice. I'll dig deeper if I start profiling.