partial matching of lines in perl

Sidd@786 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: partial matching of lines in perl by Corion (Patriarch) on Jun 12, 2020 at 12:11 UTC
This is a FAQ. See perlfaq4 on How do I compute the difference of two arrays?	[reply]
Re: partial matching of lines in perl by BillKSmith (Monsignor) on Jun 13, 2020 at 04:05 UTC
I prepared the following code before I realized that AnomalousMonk had already made almost the same suggestion as an advanced point. I have chosen to post it because it does not produce your expected output. Perhaps we misunderstand your requirement. `use strict; use warnings; my $file1 = \<<"END1"; he is man don't you what goes on END1 my $file2 = \<<"END2"; he is what are try to do END2 open my $h2, '<', $file2 or die "cannot open file2"; my @a2 = <$h2>; close $h2; chomp @a2; my $match = join '\|', @a2; $match = qr/$match/; open my $h1, '<', $file1 or die "cannot open file1"; my @a1 = <$h1>; close $h1; print grep {$_ =~ $match} @a1;` [download] OUTPUT: `he is man` [download] Bill	[reply] [d/l] [select]
Re^2: partial matching of lines in perl by AnomalousMonk (Archbishop) on Jun 13, 2020 at 06:54 UTC
... the following code ... does not produce your expected output. Perhaps we misunderstand your requirement. I'm also confused about Sidd@786's expected output. I can see that `'he is man'` from file1 should be output because it has `'he is'` from file2 as an exact substring. But Sidd@786 also seems to be saying in the OP that `'what goes on'` should also be output, and I don't see how that's possible given the (somewhat vaguely presented) data and my (similarly vague) understanding of the requirement. Perhaps Sidd@786 can clarify things for us. I have code solutions for both index-based and dynamic regex approaches, but I'm a bit reluctant to post because the OP has too strong a smell of homework about it. Perhaps I'll post them tomorrow. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: partial matching of lines in perl by Sidd@786 (Initiate) on Jun 15, 2020 at 08:24 UTC
thanks for helping...Also please help me in finding reverse of the same problem or i am willing to find partially mismatched lines. output should be 1. don't you 2. what goes on	[reply]
Re^3: partial matching of lines in perl by AnomalousMonk (Archbishop) on Jun 15, 2020 at 10:46 UTC
Here's a variation based on index that seems to satisfy your requirement insofar as I understand it as discussed here, here and here. Note that this solution is O(n1 n2)* (the product of the number of lines in each file) because it depends on a nested loop, whereas the regex-based solution presented by BillKSmith here is O(n). Unfortunately, the regex-based solution imposes a tighter limit on the size of the substrings file that can be supported: at least several hundred, but surely no more than several thousand substring lines. The `index`-based solution, while potentially much slower, can support a few, perhaps several, million lines of substrings. (Caveat: These are all estimates.) The number of lines to be searched for substrings is unlimited with both approaches if the lines are processed line-by-line in a `while`-loop. The code below identifies both lines that match some substring and lines that do not match any substring, so comment out whichever branch of the if-else conditional you do not need. (There's also a bit of ornamental code that highlights the substring that was found.) Read more... (1354 Bytes) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: partial matching of lines in perl by AnomalousMonk (Archbishop) on Jun 15, 2020 at 09:07 UTC
... finding reverse of the same problem ... find partially mismatched lines. There is again a lack of clarity. I would define the "reverse of the same problem" as "find all lines in file1 that do not match any string in file2 as a substring." But "find partially mismatched lines" can be taken IMHO to mean "find all lines in file1 in which some part does not match any string in file2." All lines in file1 have some part that does not match anything in file2, but I doubt this is what you really mean. If I take the former of the two interpretations above as your intended requirement ("find all lines in file1 that do not match any string in file2 as a substring"), then the code provided by BillKSmith here can easily be adapted by changing the statement `print grep { $_ =~ $match } @a1;` to `print grep { $_ !~ $match } @a1;` (`!~` vice `=~`). This change produces the output you seem to be specifying here. Again, please see How do I post a question effectively?, How (Not) To Ask A Question and I know what I mean. Why don't you? for help with asking questions more clearly: Please help us to help you. (And please do try to take a look at Short, Self-Contained, Correct Example and How to ask better questions using Test::More and sample data.) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: partial matching of lines in perl by Sidd@786 (Initiate) on Jun 15, 2020 at 10:13 UTC
Re^5: partial matching of lines in perl by hippo (Bishop) on Jun 15, 2020 at 11:03 UTC
Re^5: partial matching of lines in perl by Sidd@786 (Initiate) on Jun 15, 2020 at 10:41 UTC
Some notes below your chosen depth have not been shown here
Re^3: partial matching of lines in perl by BillKSmith (Monsignor) on Jun 16, 2020 at 15:50 UTC
I cannot give you the "reverse of the same problem" because I still do not understand the original problem. I stated that the code I posted did not pass your single test case. I posted it to demonstrate my understanding of the problem. I expected you to post a clarification. You now mention "Partially mismatched" lines. I cannot think of any interpretation of this phrase which is consistent with your new test case. In addition to all the suggestion from AnomolousMonk, I also tried: "Select a line from file1 if it does (not) contain any word which appears in file2" (Where "word" is defined as all the text between regex word boundaries.) Please post unambiguous requirements and several test cases. It is important that they all be exactly correct. Bill	[reply]
Re^4: partial matching of lines in perl by AnomalousMonk (Archbishop) on Jun 16, 2020 at 16:25 UTC
Re: partial matching of lines in perl by Sidd@786 (Initiate) on Jun 12, 2020 at 11:48 UTC
`$file = 'C:/Users/Siddharth/Desktop/file1.txt'; open(FH, $file) or die("File $file not found"); open(F2,"C:/Users/Siddharth/Desktop/file2.txt"); @a1=<FH>; foreach $a1(@a1){ while(@a2 = <F2>) { if( grep $a2 =~ /$_/,$a1) { } else { print "$a1\n"; } } }` [download]	[reply] [d/l]
Re^2: partial matching of lines in perl by choroba (Cardinal) on Jun 12, 2020 at 12:37 UTC
Note that @a2 and $a2 are unrelated variables. Use strict and warnings to catch similar kinds of errors. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^2: partial matching of lines in perl by AnomalousMonk (Archbishop) on Jun 12, 2020 at 15:29 UTC
`foreach $a1(@a1){ while(@a2 = <F2>) { ... } }` [download] Note also that with the block structure quoted above, the `F2` filehandle in the nested `while`-loop will be "exhausted" after handling the first item in the the `foreach`-loop and will thereafter, I think, assign to the `@a2` array a list consisting of a single undef value. To be useful, the `F2` filehandle would need to be rewound after each pass through the `while`-loop; see seek. (bliako has already made this point here.) A marginally better loop nesting structure would be `while (my $line2 = <F2>) { foreach my $line1 (@a1) { print $line1 if line2_appears_in_line1($line2, $line1); } }` [download] However, this approach still requires a complete pass through `@a1` for every line in the `F2` file, i.e., it's still O(n1 n2). Another point. If you want to find out if a line from `file2` (always remember to chomp* this line!) is exactly present within a line from `file1`, the comparison should be `if ($line1 =~ /\Q$line2_chomped\E/) { ... }` or better still (because simpler and faster) `if (index($line1, $line2_chomped) >= 0) { ... }` See index. (`index` is more appropriate here because no real regex matching seems needed, only an exact substring match.) Your code here seems to have this relationship wackbards. A more advanced point. If file `file2` is small enough, the technique described in haukex's Building Regex Alternations Dynamically article could be used to build a single regex that could be matched against each line of file `file1` to determine which of these lines were to be printed. This approach would require only a single pass through each file, i.e., will be O(n), but will fail if `file2` is much more than several hundred (or perhaps several thousand — YMMV) lines. This approach is capable of handling an unlimited number of lines in `file1` however. Update: Minor wording and spelling corrections. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: partial matching of lines in perl by bliako (Monsignor) on Jun 12, 2020 at 13:45 UTC
It seems it compares only the first line of `FH` with all lines from `F2`. You must either re-wind `F2` each time the `while(@a2 = <F2>){}` exits using: `seek F2, 0, SEEK_SET;` . Or, read `@a2 = <F2>` outside both loops, just like `@a1=<FH>`.	[reply] [d/l] [select]


We don't bite newbies here... much
	PerlMonks