Re: processing file content as string vs array

in reply to processing file content as string vs array

I see that you are happy with the flip-flop operator as demo'ed by haukex. The flip-flop operator in Perl keeps the state of whether or not you are within the beginning or closing statements of some data record. I like that operator, but it may not be the best in all situations.

In a language without the flip-flop operator, another method is to call a subroutine when the beginning of record is seen. Use that subroutine to process the record. This handles the "state information" of whether or not you are inside the record without having to have a flag value. Of course adjustments are necessary depending upon whether the first or last values of the record need to be included or not.

Here is some possible code:

use warnings;
use strict;

while (<DATA>) 
{
    process_record ($_) if /\@user_info_start/;
}

sub process_record #include first and last line
{
   my $start_of_record = shift;
   my @userinfo;
   
   push @userinfo, $start_of_record;
   
   my $line;
   
   do { 
      $line =<DATA>;
      push @userinfo, $line;
   } until $line =~ /\@user_info_end/;
   
   print @userinfo; 
   print "\n";  
}   
   

=Prints

xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : xxxxx
xxxx*@Filetype    : txt
xxxx*@Version       : 0001
xxxx*@Create_Date       : 20190407
xxxx*@Product       : xxxx
xxxx*@user_info_end

xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : yy
xxxx*@Filetype    : txt
xxxx*@Version       : 0005
xxxx*@Create_Date       : 43
xxxx*@Product       : xxxx
xxxx*@user_info_end

=cut

__DATA__
xxxxxxxxxxx
xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : xxxxx
xxxx*@Filetype    : txt
xxxx*@Version       : 0001
xxxx*@Create_Date       : 20190407
xxxx*@Product       : xxxx
xxxx*@user_info_end
d987sd66bd
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxx
xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : yy
xxxx*@Filetype    : txt
xxxx*@Version       : 0005
xxxx*@Create_Date       : 43
xxxx*@Product       : xxxx
xxxx*@user_info_end
somebus here or blank lines? whatever
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
[download]

Or perhaps.

use warnings;
use strict;

while (<DATA>) 
{
    process_record() if /\@user_info_start/;
}

sub process_record
{
  my $line;
  print $line while (defined ($line =<DATA>) and $line !~ /\@user_info
+_end/);
  print "\n";
}   
   

=Prints
xxxx*@Title         : Mr
xxxx*@Username   : xxxxx
xxxx*@Filetype    : txt
xxxx*@Version       : 0001
xxxx*@Create_Date       : 20190407
xxxx*@Product       : xxxx

xxxx*@Title         : Mr
xxxx*@Username   : yy
xxxx*@Filetype    : txt
xxxx*@Version       : 0005
xxxx*@Create_Date       : 43
xxxx*@Product       : xxxx

=cut

__DATA__
xxxxxxxxxxx
xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : xxxxx
xxxx*@Filetype    : txt
xxxx*@Version       : 0001
xxxx*@Create_Date       : 20190407
xxxx*@Product       : xxxx
xxxx*@user_info_end
d987sd66bd
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxx
xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : yy
xxxx*@Filetype    : txt
xxxx*@Version       : 0005
xxxx*@Create_Date       : 43
xxxx*@Product       : xxxx
xxxx*@user_info_end
somebus here or blank lines? whatever
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
[download]

Update:

of course the first example could avoid a push...In general, Don't "push" when you can "print"!

use warnings;
use strict;

while (<DATA>) 
{
    process_record ($_) if /\@user_info_start/;
}

sub process_record #include first and last line
{
   my $start_of_record = shift;
     
   print $start_of_record;
   
   my $line;
   
   do { 
      $line =<DATA>;
      print $line;
   } until $line =~ /\@user_info_end/;

   print "\n";  
}   
   

=Prints

xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : xxxxx
xxxx*@Filetype    : txt
xxxx*@Version       : 0001
xxxx*@Create_Date       : 20190407
xxxx*@Product       : xxxx
xxxx*@user_info_end

xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : yy
xxxx*@Filetype    : txt
xxxx*@Version       : 0005
xxxx*@Create_Date       : 43
xxxx*@Product       : xxxx
xxxx*@user_info_end

=cut

__DATA__
xxxxxxxxxxx
xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : xxxxx
xxxx*@Filetype    : txt
xxxx*@Version       : 0001
xxxx*@Create_Date       : 20190407
xxxx*@Product       : xxxx
xxxx*@user_info_end
d987sd66bd
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxx
xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : yy
xxxx*@Filetype    : txt
xxxx*@Version       : 0005
xxxx*@Create_Date       : 43
xxxx*@Product       : xxxx
xxxx*@user_info_end
somebus here or blank lines? whatever
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
[download]

I guess yet another way...

use warnings;
use strict;

while (<DATA>) 
{
    process_record ($_) if /\@user_info_start/;
}

sub process_record # include first and last line
{
   my $start_of_record = shift;
     
   print $start_of_record;
   
   my $line;
   
   # a bit of strangeness caused by using <DATA> handle
   
   while (defined ($line=<DATA>) and $line !~ /\@user_info_end/)
   {
      print $line;
   }
   print $line;  #the last line of record

   print "\n";  
}   
   

=Prints

xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : xxxxx
xxxx*@Filetype    : txt
xxxx*@Version       : 0001
xxxx*@Create_Date       : 20190407
xxxx*@Product       : xxxx
xxxx*@user_info_end

xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : yy
xxxx*@Filetype    : txt
xxxx*@Version       : 0005
xxxx*@Create_Date       : 43
xxxx*@Product       : xxxx
xxxx*@user_info_end

=cut

__DATA__
xxxxxxxxxxx
xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : xxxxx
xxxx*@Filetype    : txt
xxxx*@Version       : 0001
xxxx*@Create_Date       : 20190407
xxxx*@Product       : xxxx
xxxx*@user_info_end
d987sd66bd
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxx
xxxx*@user_info_start
xxxx*@Title         : Mr
xxxx*@Username   : yy
xxxx*@Filetype    : txt
xxxx*@Version       : 0005
xxxx*@Create_Date       : 43
xxxx*@Product       : xxxx
xxxx*@user_info_end
somebus here or blank lines? whatever
xxxxxxxxxxxxxxxxxxxxxxxxxxxx
[download]

Comment on Re: processing file content as string vs array Select or Download Code

Replies are listed 'Best First'.
Re^2: processing file content as string vs array by haukex (Archbishop) on May 15, 2019 at 21:08 UTC
call a subroutine when the beginning of record is seen It may work if the first and last ~~record~~line are supposed to be processed by the `sub`, but what if the final line is supposed to be processed by some other piece of code? You can't just ungetc a readline... Also, note that your `process_record` is making use of a global variable, `DATA`, and three of your four examples will throw an `undef` warning if the end-of-file is reached before the closing line is seen. I think a state machine type approach would be better, because it is more flexible and can handle the above cases specially, if needed.	[reply] [d/l] [select]
Re^3: processing file content as string vs array by Marshall (Canon) on May 17, 2019 at 01:13 UTC
Good points. but what if the final line is supposed to be processed by some other piece of code? You can't just ungetc a readline... You are correct in that there is no "unget" or "un-read" for a line that has already been read. There are various ways of handling that sort of situation. In the case where the process() sub needs to deal with the first line, I pass that first line as a parameter to the process() sub. Usually these sorts of things are record oriented.... something has to be done with a record that was read and the process() sub's job is to assemble a complete record. If you want the code that "does something to the record" to be in the main driver, then just have process() return a structure or modify a struct ref that is passed in. I don't see any issue here at all. Can't use Perl's single action "if" in that situation, but I don't see any issue. Also, note that your process_record is making use of a global variable, DATA, and three of your four examples will throw an undef warning if the end-of-file is reached before the closing line is seen. As far as global DATA goes, I have no issue with that for a short (<1 page) piece of code. In a larger program I would pass a lexical file handle to the sub. Note: You can make a lexical file handle out of DATA like this: `my $fh = DATA; print while (<fh>);` Pass $fh to the sub. In almost all of the situations I deal with, throwing an error for a malformed file input is the correct behaviour. This is a usually good thing and the input file needs to be fixed. It is rare for me to throw away or silently ignore a malformed record. Of course "seldom" does not mean "never". It could certainly be argued that the program that doesn't throw an undef warning is in error! Of course the programs I demoed can be modified to have either behaviour. I think a state machine type approach would be better, because it is more flexible and can handle the above cases specially, if needed.* I guess we disagree. I don't see any case for "more flexible". However, having said that, there is no real quibble on my part with having a state variable approach. Using a sub() to keep track of the "inside record" state is very clean. I actually think the Perl flip-flop operator is very cool. No problem with that either! When I use it, I have to go to Grandfather's classic post and look at the various start/end regex situations. I often have to write "one-off" programs to convert wierd file formats. I will attach such a program that I wrote a few days ago. For such a thing, efficiency doesn't matter, "general purpose" doesn't matter - I will never see a file like this again. My job was to convert this file as part of a larger project. This is not "perfect" but it did its job. Read more... (4 kB)	[reply] [d/l] [select]
Re^4: processing file content as string vs array by haukex (Archbishop) on May 18, 2019 at 18:58 UTC
I have no issue with that for a short (<1 page) piece of code. For a short script, I don't see the advantage of a `sub` over just inlining the code. But since TMTOWTDI, it's fine. I don't see any issue here at all. ... I don't see any case for "more flexible". Just to be clear, I was talking about the general case, and especially for a longer script, where I disagree with this pattern. Personally, I think it's best to just read from the file in one place in the code, because as I said, I think it's more flexible across different input file formats. In a long script it would also become difficult to keep track of all the places that read the file, and what state they expect the filehandle to be in, and what state they leave it in. You said "You are correct in that there is no 'unget' or 'un-read' for a line that has already been read." - that's what I was referring to. I still think a state machine approach is better, but if you disagree, perhaps you could show how you'd use the pattern you showed (a `<DATA>` in the main loop and a `<DATA>` in a `sub`) to read a file like the below `__DATA__` section. #!/usr/bin/env perl use warnings; use strict; my @output; use constant { STATE_IDLE=>0, STATE_IN_SECTION=>1 }; my $state = STATE_IDLE; my @buf; my $end_section = sub { if ( $state == STATE_IN_SECTION ) { push @output, [@buf]; @buf = () } $state = STATE_IDLE; }; while (<DATA>) { chomp; if ( my ($x,$y) = /^ (?: (.+) \s+ )? START (?: \s+ (.+) )? $/x ) { if ( defined $x ) { die "unexpected: $_\n" unless $state == STATE_IN_SECTION; push @buf, $x; } $end_section->(); $state = STATE_IN_SECTION; push @buf, $y if defined $y; } elsif ( my ($z) = /^ (?: (.+) \s+ )? END $/x ) { die "unexpected: $_\n" unless $state == STATE_IN_SECTION; push @buf, $z if defined $z; $end_section->(); } else { if ( $state == STATE_IN_SECTION ) { push @buf, $_ } else {} # ignore outside of section } } $end_section->(); use Test::More tests=>1; is_deeply \@output, [["a", "b"], ["c" .. "g"], ["h", "i"], ["j", "k"]] or diag explain \@output; __DATA__ START a b START c d e f g END ignoreme START h i START j k [download]	[reply] [d/l] [select]
Re^5: processing file content as string vs array by Marshall (Canon) on May 21, 2019 at 11:29 UTC
Re^6: processing file content as string vs array by haukex (Archbishop) on May 21, 2019 at 16:04 UTC
Some notes below your chosen depth have not been shown here

In Section Seekers of Perl Wisdom