Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re^4: processing file content as string vs array

by haukex (Bishop)
on May 18, 2019 at 18:58 UTC ( #11100227=note: print w/replies, xml ) Need Help??

in reply to Re^3: processing file content as string vs array
in thread processing file content as string vs array

I have no issue with that for a short (<1 page) piece of code.

For a short script, I don't see the advantage of a sub over just inlining the code. But since TMTOWTDI, it's fine.

I don't see any issue here at all. ... I don't see any case for "more flexible".

Just to be clear, I was talking about the general case, and especially for a longer script, where I disagree with this pattern. Personally, I think it's best to just read from the file in one place in the code, because as I said, I think it's more flexible across different input file formats. In a long script it would also become difficult to keep track of all the places that read the file, and what state they expect the filehandle to be in, and what state they leave it in.

You said "You are correct in that there is no 'unget' or 'un-read' for a line that has already been read." - that's what I was referring to. I still think a state machine approach is better, but if you disagree, perhaps you could show how you'd use the pattern you showed (a <DATA> in the main loop and a <DATA> in a sub) to read a file like the below __DATA__ section.

#!/usr/bin/env perl use warnings; use strict; my @output; use constant { STATE_IDLE=>0, STATE_IN_SECTION=>1 }; my $state = STATE_IDLE; my @buf; my $end_section = sub { if ( $state == STATE_IN_SECTION ) { push @output, [@buf]; @buf = () } $state = STATE_IDLE; }; while (<DATA>) { chomp; if ( my ($x,$y) = /^ (?: (.+) \s+ )? START (?: \s+ (.+) )? $/x ) { if ( defined $x ) { die "unexpected: $_\n" unless $state == STATE_IN_SECTION; push @buf, $x; } $end_section->(); $state = STATE_IN_SECTION; push @buf, $y if defined $y; } elsif ( my ($z) = /^ (?: (.+) \s+ )? END $/x ) { die "unexpected: $_\n" unless $state == STATE_IN_SECTION; push @buf, $z if defined $z; $end_section->(); } else { if ( $state == STATE_IN_SECTION ) { push @buf, $_ } else {} # ignore outside of section } } $end_section->(); use Test::More tests=>1; is_deeply \@output, [["a", "b"], ["c" .. "g"], ["h", "i"], ["j", "k"]] or diag explain \@output; __DATA__ START a b START c d e f g END ignoreme START h i START j k

Replies are listed 'Best First'.
Re^5: processing file content as string vs array
by Marshall (Canon) on May 21, 2019 at 11:29 UTC
    I like your code and have no problem with it!

    There are a number of techniques to deal with this kind of parsing. I know how to implement several of them and I'm ok with them all.

    Your example data format is unusual because it has more than one significant complicating factor.

    Just for fun, I show an alternate coding that demo's some other techniques. I make no claim about "better". There is seldom a coding pattern that works "the best" in all situations. I used your regex'es as they looked fine to me. At the end of the day, all of the "states" have to be described and handled.

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; $|=1; # Don't read in another line if we are still working # on a START line. This is caused by the # X START Y syntax in conjunction with the idea # of END absent a START in this example file format. # As a thought redefining the input separtator to # be 'START' could possibly be productive if the format # is not exactly like this?, # This format has some of the nastiest things to deal # with. They normally do not occur all at once! my @record=(); my $line_in =''; while ( $line_in =~ /START/ or $line_in =<DATA>) { $line_in = construct_record($line_in) if $line_in =~ /START/; } sub construct_record { my $line = shift; if ( (my $x) = $line =~ /START\s+(\w+)\s*$/) { push @record, $x; } while (defined ($line = <DATA>) and ($line !~ /(START|END)/) ) { $line =~ s/^\s*|\s*$//g; push @record, $line; } $line //= ''; #could be an EOF if (my ($b4end) = $line =~ /^ (?: (.+) \s+ )? END $/x) { push @record, $b4end if $b4end; output_record(); return ''; # no continuation of this record } if ( my ($x,$y) = $line =~ /^ (?: (.+) \s+ )? START (?: \s+ (.+) ) +? $/x ) { if ($x) { push @record, $x; output_record(); } if ($y) { output_record(); # might be: "^START 77"? return "START $y"; } } return ''; } sub output_record # or process it somehow... { print "Record: @record\n" if (@record >1); @record=(); } =Prints Record: a b Record: c d e f g Record: h i Record: j k =cut __DATA__ START a b START c d e f g END ignoreme START h i START j k END

      Of course TMTOWTDI. I just don't see the advantage of this code over just inlining the sub construct_record code directly in the while loop. Plus, you've increased the number of global variables you're using. There are a couple other things I could nitpick, like that you've got five different regexes all checking for the string START.

      This format has some of the nastiest things to deal with. They normally do not occur all at once!

      I disagree - I don't find this format nasty and there are plenty of data formats this complicated. Which was exactly my point - a state machine type approach can handle them all. Anyway, as I said, as long as it works you're free to write code like this - I personally still disagree with it ;-)

        I don't want to get into a big argument. My purpose was to show TMTOWTDI.
        I do not claim that the approach I demo'ed works best for all cases.
        I think knowledge of multiple ways to implement this result is a good idea.
        I do agree that a state machine approach can handle much more complicated formats than in this example.
        Most of the formats I deal with are less complex (thank goodness!) and do not require a state_machine approach.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11100227]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (1)
As of 2021-02-24 23:11 GMT
Find Nodes?
    Voting Booth?

    No recent polls found