Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

partial matching of lines in perl

by Sidd@786 (Initiate)
on Jun 12, 2020 at 11:44 UTC ( [id://11117975]=perlquestion: print w/replies, xml ) Need Help??

Sidd@786 has asked for the wisdom of the Perl Monks concerning the following question:

we have two files .if some lines of file1 is containing some lines of file2 then we should print those lines of file1 . file1 contains 1.he is man 2.don't you 3.what goes on Another file contains: 1.he is 2.what are 3.try to do output should be he is man what goes on

Replies are listed 'Best First'.
Re: partial matching of lines in perl
by Corion (Patriarch) on Jun 12, 2020 at 12:11 UTC
Re: partial matching of lines in perl
by BillKSmith (Monsignor) on Jun 13, 2020 at 04:05 UTC
    I prepared the following code before I realized that AnomalousMonk had already made almost the same suggestion as an advanced point. I have chosen to post it because it does not produce your expected output. Perhaps we misunderstand your requirement.
    use strict; use warnings; my $file1 = \<<"END1"; he is man don't you what goes on END1 my $file2 = \<<"END2"; he is what are try to do END2 open my $h2, '<', $file2 or die "cannot open file2"; my @a2 = <$h2>; close $h2; chomp @a2; my $match = join '|', @a2; $match = qr/$match/; open my $h1, '<', $file1 or die "cannot open file1"; my @a1 = <$h1>; close $h1; print grep {$_ =~ $match} @a1;

    OUTPUT:

    he is man
    Bill
      ... the following code ... does not produce your expected output. Perhaps we misunderstand your requirement.

      I'm also confused about Sidd@786's expected output. I can see that 'he is man' from file1 should be output because it has 'he is' from file2 as an exact substring. But Sidd@786 also seems to be saying in the OP that 'what goes on' should also be output, and I don't see how that's possible given the (somewhat vaguely presented) data and my (similarly vague) understanding of the requirement. Perhaps Sidd@786 can clarify things for us.

      I have code solutions for both index-based and dynamic regex approaches, but I'm a bit reluctant to post because the OP has too strong a smell of homework about it. Perhaps I'll post them tomorrow.


      Give a man a fish:  <%-{-{-{-<

      thanks for helping...Also please help me in finding reverse of the same problem or i am willing to find partially mismatched lines. output should be 1. don't you 2. what goes on

        Here's a variation based on index that seems to satisfy your requirement insofar as I understand it as discussed here, here and here.

        Note that this solution is O(n1 * n2) (the product of the number of lines in each file) because it depends on a nested loop, whereas the regex-based solution presented by BillKSmith here is O(n). Unfortunately, the regex-based solution imposes a tighter limit on the size of the substrings file that can be supported: at least several hundred, but surely no more than several thousand substring lines. The index-based solution, while potentially much slower, can support a few, perhaps several, million lines of substrings. (Caveat: These are all estimates.) The number of lines to be searched for substrings is unlimited with both approaches if the lines are processed line-by-line in a while-loop. The code below identifies both lines that match some substring and lines that do not match any substring, so comment out whichever branch of the if-else conditional you do not need. (There's also a bit of ornamental code that highlights the substring that was found.)


        Give a man a fish:  <%-{-{-{-<

        ... finding reverse of the same problem ... find partially mismatched lines.

        There is again a lack of clarity. I would define the "reverse of the same problem" as "find all lines in file1 that do not match any string in file2 as a substring." But "find partially mismatched lines" can be taken IMHO to mean "find all lines in file1 in which some part does not match any string in file2." All lines in file1 have some part that does not match anything in file2, but I doubt this is what you really mean.

        If I take the former of the two interpretations above as your intended requirement ("find all lines in file1 that do not match any string in file2 as a substring"), then the code provided by BillKSmith here can easily be adapted by changing the statement
            print grep { $_ =~ $match } @a1;
        to
            print grep { $_ !~ $match } @a1;
        (!~ vice =~). This change produces the output you seem to be specifying here.

        Again, please see How do I post a question effectively?, How (Not) To Ask A Question and I know what I mean. Why don't you? for help with asking questions more clearly: Please help us to help you. (And please do try to take a look at Short, Self-Contained, Correct Example and How to ask better questions using Test::More and sample data.)


        Give a man a fish:  <%-{-{-{-<

        I cannot give you the "reverse of the same problem" because I still do not understand the original problem. I stated that the code I posted did not pass your single test case. I posted it to demonstrate my understanding of the problem. I expected you to post a clarification.

        You now mention "Partially mismatched" lines. I cannot think of any interpretation of this phrase which is consistent with your new test case. In addition to all the suggestion from AnomolousMonk, I also tried: "Select a line from file1 if it does (not) contain any word which appears in file2" (Where "word" is defined as all the text between regex word boundaries.)

        Please post unambiguous requirements and several test cases. It is important that they all be exactly correct.

        Bill
Re: partial matching of lines in perl
by Sidd@786 (Initiate) on Jun 12, 2020 at 11:48 UTC
    $file = 'C:/Users/Siddharth/Desktop/file1.txt'; open(FH, $file) or die("File $file not found"); open(F2,"C:/Users/Siddharth/Desktop/file2.txt"); @a1=<FH>; foreach $a1(@a1){ while(@a2 = <F2>) { if( grep $a2 =~ /$_/,$a1) { } else { print "$a1\n"; } } }
      Note that @a2 and $a2 are unrelated variables. Use strict and warnings to catch similar kinds of errors.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      foreach $a1(@a1){ while(@a2 = <F2>) { ... } }

      Note also that with the block structure quoted above, the  F2 filehandle in the nested while-loop will be "exhausted" after handling the first item in the the foreach-loop and will thereafter, I think, assign to the  @a2 array a list consisting of a single undef value. To be useful, the  F2 filehandle would need to be rewound after each pass through the while-loop; see seek. (bliako has already made this point here.)

      A marginally better loop nesting structure would be

      while (my $line2 = <F2>) { foreach my $line1 (@a1) { print $line1 if line2_appears_in_line1($line2, $line1); } }
      However, this approach still requires a complete pass through  @a1 for every line in the  F2 file, i.e., it's still O(n1 * n2).

      Another point. If you want to find out if a line from file2 (always remember to chomp this line!) is exactly present within a line from file1, the comparison should be
          if ($line1 =~ /\Q$line2_chomped\E/) { ... }
      or better still (because simpler and faster)
          if (index($line1, $line2_chomped) >= 0) { ... }
      See index. (index is more appropriate here because no real regex matching seems needed, only an exact substring match.) Your code here seems to have this relationship wackbards.

      A more advanced point. If file file2 is small enough, the technique described in haukex's Building Regex Alternations Dynamically article could be used to build a single regex that could be matched against each line of file file1 to determine which of these lines were to be printed. This approach would require only a single pass through each file, i.e., will be O(n), but will fail if file2 is much more than several hundred (or perhaps several thousand — YMMV) lines. This approach is capable of handling an unlimited number of lines in file1 however.

      Update: Minor wording and spelling corrections.


      Give a man a fish:  <%-{-{-{-<

      It seems it compares only the first line of FH with all lines from F2. You must either re-wind F2 each time the while(@a2 = <F2>){} exits using: seek F2, 0, SEEK_SET; . Or, read @a2 = <F2> outside both loops, just like @a1=<FH>.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11117975]
Approved by mhearse
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-04-18 22:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found