PeachCityDude has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to search a large number of files (file1, file2, etc.). I want to search each file for a different name (Joe, Lee, etc.). I want to see if "Joe" appears in "file1" and "Lee" appears in "file2". I begin with a text file (array.txt) that contains the files and names and put this into an array.
array.txt:
file1 Joe
file2 Lee
... ...
Then, I use "foreach" to work through the items in the array. In the following code, the files "file1" and "file2" will open with each iteration of the "foreach" loop. However, the $name variable in the "if" statement doesn't seem to loop through the names (Joe, Lee) but rather stays on the first name (Joe). Therefore, both "file1" and "file2" are searched for "Joe". I only want to search "file1" for "Joe" then search "file2" for "Lee".
Any help is much appreciated!
#!/usr/bin/perl -w
use strict;
use warnings;
use diagnostics;
my @filelist = ();
my $file='';
my $name='';
open(FILE,"array.txt")||die "Can't open array.txt";
@filelist=<FILE>;
close(FILE);
foreach my $line (@filelist){
my $file=substr($line,0,5);
chomp $file;
my $name=substr($line,6,3);
chomp $name;
open (FILE2,"$file");
while ( <FILE2>) {
if (/$name/ismo) {
print ("$name found in $file\n");
}
}
close(FILE2);
}
Re: Passing variable to if statement
by AppleFritter (Vicar) on Jul 16, 2014 at 20:16 UTC
|
if (/$name/ismo) {
Remove the /o ("compile only once") modifier to your regex, and it'll work just fine.
Here's what Regexp Quote Like Operators has to say on /o:
PATTERN may contain variables, which will be interpolated every time the pattern search is evaluated, except for when the delimiter is a single quote. (Note that $( , $) , and $| are not interpolated because they look like end-of-string tests.) Perl will not recompile the pattern unless an interpolated variable that it contains changes. You can force Perl to skip the test and never recompile by adding a /o (which stands for "once") after the trailing delimiter. Once upon a time, Perl would recompile regular expressions unnecessarily, and this modifier was useful to tell it not to do so, in the interests of speed. But now, the only reasons to use /o are either:
The variables are thousands of characters long and you know that they don't change, and you need to wring out the last little bit of speed by having Perl skip testing for that. (There is a maintenance penalty for doing this, as mentioning /o constitutes a promise that you won't change the variables in the pattern. If you do change them, Perl won't even notice.)
you want the pattern to use the initial values of the variables regardless of whether they change or not. (But there are saner ways of accomplishing this than using /o.)
If the pattern contains embedded code, such as
use re 'eval';
$code = 'foo(?{ $x })';
/$code/
then perl will recompile each time, even though the pattern string hasn't changed, to ensure that the current value of $x is seen each time. Use /o if you want to avoid this.
The bottom line is that using /o is almost never a good idea.
Emphasis mine.
| [reply] [d/l] [select] |
|
For that matter, remove the /sm, too. They're pointless here.
| [reply] [d/l] |
Re: Passing variable to if statement
by PerlSufi (Friar) on Jul 16, 2014 at 20:19 UTC
|
In addition to what AppleFritter said, use the best practice way of file handling:
use strict;
use warnings;
use autodie;
my $file = 'array.txt';
open(my $fh, '<', $file);
| [reply] [d/l] |
|
use strict;
use warnings;
use autodie qw( open close );
binmode STDOUT, ':encoding(UTF-8)'; # Or ASCII, Latin-1, etc.
my $file = 'array.txt';
open my $fh, '<:encoding(UTF-8)', $file;
# ...
close $fh;
Better…
use strict;
use warnings;
use autodie qw( open close );
use open qw( :encoding(UTF-8) :std ); # Or ASCII, Latin-1, etc.
my $file = 'array.txt';
open my $fh, '<', $file;
# ...
close $fh;
| [reply] [d/l] [select] |
|
I think OP's original code is fine in this regard. I'd say autodie is helpful for those who know all the things that it does, or e.g. for dropping into legacy scripts that don't do enough error checking. But, as I've discovered, it can also cause problems when it's used, and its bug trackers (1, 2) show I'm not the only one. Don't get me wrong, it's a good recommendation, I just disagree that it's a "best practice" - it's just MTOWTDI.
| [reply] |
|
| [reply] [d/l] [select] |
|
|
Thank you for the tip on file handling.
| [reply] |
Re: Passing variable to if statement
by neilwatson (Priest) on Jul 16, 2014 at 20:19 UTC
|
Immediate guess the the 'once' in your regex /$name/ismo. That means compile regex only once because I do not expect it change. But it does :)
| [reply] |
Re: Passing variable to if statement
by PeachCityDude (Novice) on Jul 16, 2014 at 20:26 UTC
|
I removed the "once" modifier and now the code works perfectly. Thank you very much! | [reply] |
|
I removed the "once" modifier and now the code works perfectly.
I don't think the code is working perfectly. If any of the names in array.txt have regular expression metacharacters in them, then the regular expression pattern you're using is unlikely to match the text as you expect it to.
See perldoc -f quotemeta.
| [reply] |
Re: Passing variable to if statement
by Jim (Curate) on Jul 17, 2014 at 00:17 UTC
|
use strict;
use warnings;
use autodie qw( open close );
use open qw( :encoding(UTF-8) :std );
my %terms_in; # Hash of hash of terms in each file
my $terms_file = 'array.txt';
open my $fh1, '<', $terms_file;
while (my $record = <$fh1>) {
my ($file, $term) = $record =~ m/^(\S+)\s+(\S+)/;
$terms_in{ $file }{ $term } = 1;
}
close $fh1;
FILE:
for my $file (keys %terms_in) {
open my $fh2, '<', $file;
while (my $line = <$fh2>) {
for my $term (keys %$terms_in{ $file }) {
if (index $line, $term) {
print "Term $term found in file $file\n";
# Don't search terms that have already been found...
delete $terms_in{ $file }{ $term };
}
}
# Don't keep searching file once all terms have been found...
if (scalar keys %$terms_in{ $file } == 0) {
close $fh2;
next FILE;
}
}
close $fh2;
}
# At this point, what remains in %terms_in are all the terms that
# were not found in their respective files, which might be useful.
exit 0;
| [reply] [d/l] |
|
|