gudmo has asked for the wisdom of the Perl Monks concerning the following question:
I have a script that is parsing through large text files
Each section in the file begins with a lower case t and ends with a capital T with lines starting with J, D or S in between.
The problem is in the lower t section. It only executes at the start and never again, and I just cannot wrap my head around the problem.
Any help is greatly appreciated.
#!/usr/bin/perl
my @lines = ('t13:45\n', 'D13:45\n', 'S13:45 Unicorn\n','D13:45\n', 'S
+13:45\n', 'T13:45\n', 't13:45\n', 'D13:45\n', 'T13:46\n','t13:45\n',
+'D13:45\n', 'S13:45\n','D13:45\n', 'S13:45 UNICORN\n', 'T13:45\n', 't
+13:45\n', 'D13:45\n', 'T13:46\n');
my $value = "unicorn";
$i = 0;
$newsection = 0;
#my $debug = "true";
print "Let's start\n" if $debug;
while (my $row = shift(@lines)) {
chomp($row);
if($newsection <= 0) {
if ($row =~ /^[t]/) {
$i++;
print "New section started\n" if $debug;
print "Inserting $row into array\n" if $debug;
push(@section,$row);
}
elsif ($row =~ /^[J]/) {
print "Section continued\n" if $debug;
push(@section,$row);
}
elsif ($row =~ /^[S]/) {
print "Section continued\n" if $debug;
push(@section,$row);
}
elsif ($row =~ /^[D]/) {
print "Section continued\n" if $debug;
push(@section,$row);
}
elsif ($row =~ /^[T]/) {
print "Section Ended\n" if $debug;
push(@section,$row);
$newsection = 1;
}
}
else {
$newsection = 0;
print "Checking for value\n" if $debug;
if ( grep( /$value/i, @section )) {
print "Value discovered, saving section\n" if $debug;
foreach (@section) {
print "$_\n";
}
}
@section = ();
undef(@section);
}
}
exit;
Re: Cannot find the error
by choroba (Cardinal) on Apr 03, 2020 at 12:58 UTC
|
The check for a new section start is in the $row =~ /^[t]/
condition (the regex is equivalent to /^t/, BTW). It's located inside the YES-branch of the
if ($newsection <= 0)
but the end of the previous section sets $newsection to 1, so this condition can't be reached.
I'd move the /^t/ condition into the ELSE part of the outermost if, and initialize $newsection with 1 at the very start.
Also, I miss strict and warnings.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
It's just a cut down version of the real script, that's why there are no strict or warnings.
But anyhow. The ELSE part of the outermost is only to review the section saved and if the value is found in the section then do the output.
The script is supposed to look at each line, starting with t and until it reaches T and save all that into @section.
When it finds T it goes to the start with $newsection = 1 and then goes to ELSE to check if the section should be saved.
It then clears the @section array and goes to the start of the loop.
But why it only checks the ^t at start I don't known, and it doesn't seem to matter if I put it second as an elsif.
I am convinced now that everytime the section is reset and the loop starts to write into the array the first iteration is lost.
| [reply] |
|
> everytime the section is reset and the loop starts to write into the array the first iteration is lost
It's not lost, it's missed by your script. Every iteration of the loop reads one line from the array. Even the execution of the outermost "else" part processes one line - and it's supposed to be the /^t/ line. That's why I suggested to move the check there.
Another option would be to add a
redo
as the last command in the outermost else, to restart the loop without reading a line.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
|
Re: Cannot find the error
by hippo (Bishop) on Apr 03, 2020 at 16:00 UTC
|
Since I'm feeling generous and it's late on a Friday afternoon so nobody is touching prod, here's a tidy-up and fix for you. The steps were as follows:
- Add strict and warnings. Just because this was originally a section of longer code doesn't mean you cannot or should not use them here.
- Fix up any contraventions of strict arising.
- Remove unreferenced variables.
- Shorten the code with postfix loops and de-duplicated elif branches. Simply the regexen while there.
- Run it through perltidy to fix the indenting.
- At this point it's clear where the problem is. Apply a simple fix (which is basically a version of choroba's solution).
#!/usr/bin/perl
use strict;
use warnings;
my @lines = (
't13:45\n', 'D13:45\n',
'S13:45 Unicorn\n', 'D13:45\n',
'S13:45\n', 'T13:45\n',
't13:45\n', 'D13:45\n',
'T13:46\n', 't13:45\n',
'D13:45\n', 'S13:45\n',
'D13:45\n', 'S13:45 UNICORN\n',
'T13:45\n', 't13:45\n',
'D13:45\n', 'T13:46\n'
);
my $value = "unicorn";
my $newsection = 0;
my $debug = 0; # Set to 1 for verbosity
my @section;
print "Let's start\n" if $debug;
while (my $row = shift (@lines)) {
if ($newsection <= 0) {
if ($row =~ /^t/) {
print "New section started\n" if $debug;
print "Inserting $row into array\n" if $debug;
push (@section, $row);
} elsif ($row =~ /^[JSD]/) {
print "Section continued\n" if $debug;
push (@section, $row);
} elsif ($row =~ /^T/) {
print "Section Ended\n" if $debug;
push (@section, $row);
$newsection = 1;
}
}
if ($newsection > 0) {
$newsection = 0;
print "Checking for value\n" if $debug;
if (grep (/$value/i, @section)) {
print "Value discovered, saving section\n" if $debug;
print "$_\n" for @section;
}
@section = ();
}
}
Running this we get:
t13:45\n
D13:45\n
S13:45 Unicorn\n
D13:45\n
S13:45\n
T13:45\n
t13:45\n
D13:45\n
S13:45\n
D13:45\n
S13:45 UNICORN\n
T13:45\n
... which I can only presume is what you expect the output to be. Enjoy. | [reply] [d/l] [select] |
Re: Cannot find the error
by Athanasius (Archbishop) on Apr 03, 2020 at 14:23 UTC
|
Hello gudmo,
...
#my $debug = "true";
print "Let's start\n" if $debug;
...
A better way to do this is with the constant pragma, which plays well with strict and still gives you the benefits of conditional compilation:
#!/usr/bin/perl
use strict;
use warnings;
use constant DEBUG => 0;
...
print "Let's start\n" if DEBUG;
...
Update: Fixed typo: missing constant in the definition of DEBUG (!) Thanks, pryrt.
Hope that’s of interest,
| [reply] [d/l] [select] |
Re: Cannot find the error (updated)
by AnomalousMonk (Archbishop) on Apr 03, 2020 at 18:51 UTC
|
Another way. I've "corrected" all the presumably incorrect 't13:45\n' lines to "t13:45\n" (single-quotes to double quotes so the \n escape means something) and added a blank line to section output for looks.
use strict;
use warnings;
my @lines = (
"t13:45\n", "D13:45\n", "S13:45 Unicorn\n", "D13:45\n", "S13:45\n"
+, "T13:45\n",
"t13:45\n", "D13:45\n", "T13:46\n",
"t13:45\n", "D13:45\n", "S13:45\n","D13:45\n", "S13:45 UNICORN\n",
+ "T13:45\n",
"t13:45\n", "D13:45\n", "T13:46\n",
);
use constant GOOD_LINE => qr{ \A [tTJSD] }xms;
use constant END_SECTION => qr{ \A T }xms;
my $value = "uNiCoRn";
my @section;
LINE:
while (my $line = shift @lines) {
die "bad line '$line'" unless $line =~ GOOD_LINE;
push @section, $line;
next LINE unless $line =~ END_SECTION;
if (grep m{ (?i) \Q$value\E }xms, @section) {
print for @section;
print "\n";
}
@section = ();
}
__END__
t13:45
D13:45
S13:45 Unicorn
D13:45
S13:45
T13:45
t13:45
D13:45
S13:45
D13:45
S13:45 UNICORN
T13:45
The grep in the section-printing conditional at the end of the loop is a little wasteful because it will continue matching after a match is found. See List::Util::any() for an alternative that may be better (because it short-circuits) if the number of lines in a section is very large or matching is very expensive.
Update: It occurs to me that you may want a strict ordering of section markers so that a 't...' line only occurs at the start of a section. If that's the case, define a section start regex
use constant START_SECTION => qr{ \A t }xms;
and add this die statement
die "start line '$line' not at start of section"
if $line =~ START_SECTION and @section;
just before the
push @section, $line;
statement. That oughta fix it. (Update: And if you want a really strict strict ordering, let me know; can do, will do.)
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Cannot find the error
by tybalt89 (Monsignor) on Apr 03, 2020 at 19:37 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11114988
use warnings;
my @lines = ("t13:45\n", "D13:45\n", "S13:45 Unicorn\n","D13:45\n",
"S13:45\n", "T13:45\n", "t13:45\n", "D13:45\n", "T13:46\n","t13:45\n
+",
"D13:45\n", "S13:45\n","D13:45\n", "S13:45 UNICORN\n", "T13:45\n",
"t13:45\n", "D13:45\n", "T13:46\n");
my $value = 'unicorn';
my @sections = grep /$value/i, (join '', @lines) =~ /^t.*?\nT.*?\n/gms
+;
print join "\n", @sections; # so there is a gap between sections
Outputs:
t13:45
D13:45
S13:45 Unicorn
D13:45
S13:45
T13:45
t13:45
D13:45
S13:45
D13:45
S13:45 UNICORN
T13:45
| [reply] [d/l] [select] |
Re: Cannot find the error
by jwkrahn (Abbot) on Apr 03, 2020 at 20:51 UTC
|
I have a script that is parsing through large text files
I rewrote it, assuming a text file.
#!/usr/bin/perl
use warnings;
use strict;
my $file = <<FILE;
t13:45
D13:45
S13:45 Unicorn
D13:45
S13:45
T13:45
t13:45
D13:45
T13:46
t13:45
D13:45
S13:45
D13:45
S13:45 UNICORN
T13:45
t13:45
D13:45
T13:46
FILE
my $value = 'unicorn';
open my $FH, '<', \$file or die "Cannot open 'file' because: $!";
my $section = '';
while ( <$FH> ) {
if ( /^t/ .. /^T/ ) {
$section .= $_;
}
if ( /^T/ ) {
print $section if $section =~ /$value/i;
$section = '';
}
}
Debug statements are left as an exercise for the OP.
| [reply] [d/l] |
|
|