Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Cannot find the error

by gudmo (Novice)
on Apr 03, 2020 at 12:30 UTC ( [id://11114988]=perlquestion: print w/replies, xml ) Need Help??

gudmo has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that is parsing through large text files Each section in the file begins with a lower case t and ends with a capital T with lines starting with J, D or S in between. The problem is in the lower t section. It only executes at the start and never again, and I just cannot wrap my head around the problem. Any help is greatly appreciated.
#!/usr/bin/perl my @lines = ('t13:45\n', 'D13:45\n', 'S13:45 Unicorn\n','D13:45\n', 'S +13:45\n', 'T13:45\n', 't13:45\n', 'D13:45\n', 'T13:46\n','t13:45\n', +'D13:45\n', 'S13:45\n','D13:45\n', 'S13:45 UNICORN\n', 'T13:45\n', 't +13:45\n', 'D13:45\n', 'T13:46\n'); my $value = "unicorn"; $i = 0; $newsection = 0; #my $debug = "true"; print "Let's start\n" if $debug; while (my $row = shift(@lines)) { chomp($row); if($newsection <= 0) { if ($row =~ /^[t]/) { $i++; print "New section started\n" if $debug; print "Inserting $row into array\n" if $debug; push(@section,$row); } elsif ($row =~ /^[J]/) { print "Section continued\n" if $debug; push(@section,$row); } elsif ($row =~ /^[S]/) { print "Section continued\n" if $debug; push(@section,$row); } elsif ($row =~ /^[D]/) { print "Section continued\n" if $debug; push(@section,$row); } elsif ($row =~ /^[T]/) { print "Section Ended\n" if $debug; push(@section,$row); $newsection = 1; } } else { $newsection = 0; print "Checking for value\n" if $debug; if ( grep( /$value/i, @section )) { print "Value discovered, saving section\n" if $debug; foreach (@section) { print "$_\n"; } } @section = (); undef(@section); } } exit;

Replies are listed 'Best First'.
Re: Cannot find the error
by choroba (Cardinal) on Apr 03, 2020 at 12:58 UTC
    The check for a new section start is in the
    $row =~ /^[t]/
    condition (the regex is equivalent to /^t/, BTW). It's located inside the YES-branch of the
    if ($newsection <= 0)
    but the end of the previous section sets $newsection to 1, so this condition can't be reached.

    I'd move the /^t/ condition into the ELSE part of the outermost if, and initialize $newsection with 1 at the very start.

    Also, I miss strict and warnings.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      It's just a cut down version of the real script, that's why there are no strict or warnings. But anyhow. The ELSE part of the outermost is only to review the section saved and if the value is found in the section then do the output. The script is supposed to look at each line, starting with t and until it reaches T and save all that into @section. When it finds T it goes to the start with $newsection = 1 and then goes to ELSE to check if the section should be saved. It then clears the @section array and goes to the start of the loop. But why it only checks the ^t at start I don't known, and it doesn't seem to matter if I put it second as an elsif. I am convinced now that everytime the section is reset and the loop starts to write into the array the first iteration is lost.
        > everytime the section is reset and the loop starts to write into the array the first iteration is lost

        It's not lost, it's missed by your script. Every iteration of the loop reads one line from the array. Even the execution of the outermost "else" part processes one line - and it's supposed to be the /^t/ line. That's why I suggested to move the check there.

        Another option would be to add a

        redo
        as the last command in the outermost else, to restart the loop without reading a line.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Cannot find the error
by hippo (Bishop) on Apr 03, 2020 at 16:00 UTC

    Since I'm feeling generous and it's late on a Friday afternoon so nobody is touching prod, here's a tidy-up and fix for you. The steps were as follows:

    • Add strict and warnings. Just because this was originally a section of longer code doesn't mean you cannot or should not use them here.
    • Fix up any contraventions of strict arising.
    • Remove unreferenced variables.
    • Shorten the code with postfix loops and de-duplicated elif branches. Simply the regexen while there.
    • Run it through perltidy to fix the indenting.
    • At this point it's clear where the problem is. Apply a simple fix (which is basically a version of choroba's solution).
    #!/usr/bin/perl use strict; use warnings; my @lines = ( 't13:45\n', 'D13:45\n', 'S13:45 Unicorn\n', 'D13:45\n', 'S13:45\n', 'T13:45\n', 't13:45\n', 'D13:45\n', 'T13:46\n', 't13:45\n', 'D13:45\n', 'S13:45\n', 'D13:45\n', 'S13:45 UNICORN\n', 'T13:45\n', 't13:45\n', 'D13:45\n', 'T13:46\n' ); my $value = "unicorn"; my $newsection = 0; my $debug = 0; # Set to 1 for verbosity my @section; print "Let's start\n" if $debug; while (my $row = shift (@lines)) { if ($newsection <= 0) { if ($row =~ /^t/) { print "New section started\n" if $debug; print "Inserting $row into array\n" if $debug; push (@section, $row); } elsif ($row =~ /^[JSD]/) { print "Section continued\n" if $debug; push (@section, $row); } elsif ($row =~ /^T/) { print "Section Ended\n" if $debug; push (@section, $row); $newsection = 1; } } if ($newsection > 0) { $newsection = 0; print "Checking for value\n" if $debug; if (grep (/$value/i, @section)) { print "Value discovered, saving section\n" if $debug; print "$_\n" for @section; } @section = (); } }

    Running this we get:

    t13:45\n D13:45\n S13:45 Unicorn\n D13:45\n S13:45\n T13:45\n t13:45\n D13:45\n S13:45\n D13:45\n S13:45 UNICORN\n T13:45\n

    ... which I can only presume is what you expect the output to be. Enjoy.

Re: Cannot find the error
by Athanasius (Archbishop) on Apr 03, 2020 at 14:23 UTC

    Hello gudmo,

    ... #my $debug = "true"; print "Let's start\n" if $debug; ...

    A better way to do this is with the constant pragma, which plays well with strict and still gives you the benefits of conditional compilation:

    #!/usr/bin/perl use strict; use warnings; use constant DEBUG => 0; ... print "Let's start\n" if DEBUG; ...

    Update: Fixed typo: missing constant in the definition of DEBUG (!) Thanks, pryrt.

    Hope that’s of interest,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Cannot find the error (updated)
by AnomalousMonk (Archbishop) on Apr 03, 2020 at 18:51 UTC

    Another way. I've "corrected" all the presumably incorrect  't13:45\n' lines to  "t13:45\n" (single-quotes to double quotes so the  \n escape means something) and added a blank line to section output for looks.

    use strict; use warnings; my @lines = ( "t13:45\n", "D13:45\n", "S13:45 Unicorn\n", "D13:45\n", "S13:45\n" +, "T13:45\n", "t13:45\n", "D13:45\n", "T13:46\n", "t13:45\n", "D13:45\n", "S13:45\n","D13:45\n", "S13:45 UNICORN\n", + "T13:45\n", "t13:45\n", "D13:45\n", "T13:46\n", ); use constant GOOD_LINE => qr{ \A [tTJSD] }xms; use constant END_SECTION => qr{ \A T }xms; my $value = "uNiCoRn"; my @section; LINE: while (my $line = shift @lines) { die "bad line '$line'" unless $line =~ GOOD_LINE; push @section, $line; next LINE unless $line =~ END_SECTION; if (grep m{ (?i) \Q$value\E }xms, @section) { print for @section; print "\n"; } @section = (); } __END__ t13:45 D13:45 S13:45 Unicorn D13:45 S13:45 T13:45 t13:45 D13:45 S13:45 D13:45 S13:45 UNICORN T13:45
    The grep in the section-printing conditional at the end of the loop is a little wasteful because it will continue matching after a match is found. See List::Util::any() for an alternative that may be better (because it short-circuits) if the number of lines in a section is very large or matching is very expensive.

    Update: It occurs to me that you may want a strict ordering of section markers so that a  't...' line only occurs at the start of a section. If that's the case, define a section start regex
        use constant START_SECTION => qr{ \A t }xms;
    and add this die statement
        die "start line '$line' not at start of section"
            if $line =~ START_SECTION and @section;
    just before the
        push @section, $line;
    statement. That oughta fix it. (Update: And if you want a really strict strict ordering, let me know; can do, will do.)


    Give a man a fish:  <%-{-{-{-<

Re: Cannot find the error
by tybalt89 (Monsignor) on Apr 03, 2020 at 19:37 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11114988 use warnings; my @lines = ("t13:45\n", "D13:45\n", "S13:45 Unicorn\n","D13:45\n", "S13:45\n", "T13:45\n", "t13:45\n", "D13:45\n", "T13:46\n","t13:45\n +", "D13:45\n", "S13:45\n","D13:45\n", "S13:45 UNICORN\n", "T13:45\n", "t13:45\n", "D13:45\n", "T13:46\n"); my $value = 'unicorn'; my @sections = grep /$value/i, (join '', @lines) =~ /^t.*?\nT.*?\n/gms +; print join "\n", @sections; # so there is a gap between sections

    Outputs:

    t13:45 D13:45 S13:45 Unicorn D13:45 S13:45 T13:45 t13:45 D13:45 S13:45 D13:45 S13:45 UNICORN T13:45
Re: Cannot find the error
by jwkrahn (Abbot) on Apr 03, 2020 at 20:51 UTC
    I have a script that is parsing through large text files

    I rewrote it, assuming a text file.

    #!/usr/bin/perl use warnings; use strict; my $file = <<FILE; t13:45 D13:45 S13:45 Unicorn D13:45 S13:45 T13:45 t13:45 D13:45 T13:46 t13:45 D13:45 S13:45 D13:45 S13:45 UNICORN T13:45 t13:45 D13:45 T13:46 FILE my $value = 'unicorn'; open my $FH, '<', \$file or die "Cannot open 'file' because: $!"; my $section = ''; while ( <$FH> ) { if ( /^t/ .. /^T/ ) { $section .= $_; } if ( /^T/ ) { print $section if $section =~ /$value/i; $section = ''; } }

    Debug statements are left as an exercise for the OP.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11114988]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-20 00:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found