processing file content as string vs array

vinoth.ree has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: processing file content as string vs array by Eily (Monsignor) on May 13, 2019 at 12:52 UTC
You only have one @user_info in the whole file right? Otherwise your regex will give the wrong result: everything from the first @user_info_start to the last @user_info_end. This is because of '.' in your regex, because is 'greedy' it will try to match as much as possible. This means that after @user_info_start has been found, the regex engine will basically jump to the end of the file, and move backward one character at a time (this is called backtracking) until it finds @user_info_end. To have the reverse behaviour: go forward one character at a time right after finding @user_info_start you could use (.?), where .? will start by matching nothing, and only consume an extra character when necessary. That being said, I really like the idiom presented by haukex here, which is quite intuitive when you know that the .. operator is read as "FROM .. TO" so in haukex's code that would be FROM @user_info_start TO @user_info_end. One thing you can add to his code if you only have one occurence of @user_info in the whole file is an exit from the loop as soon as you have found your data: `use warnings; use strict; my @userinfo; LINE: while (<DATA>) { chomp; if ( /\@user_info_start/ ... /\@user_info_end/ ) { push @userinfo, $_; } elsif (@userinfo) { last LINE; # stop looking } } use Data::Dumper; print Dumper(\@userinfo); __DATA__ xxxxxxxxxxx xxxx@user_info_start xxxx@Title : Mr xxxx@Username : xxxxx xxxx@Filetype : txt xxxx@Version : 0001 xxxx@Create_Date : 20190407 xxxx@Product : xxxx xxxx@user_info_end xxxxxxxxxxxxxxxxxxxxxxxxxxxx` [download]	[reply] [d/l]
Re^2: processing file content as string vs array by haukex (Archbishop) on May 13, 2019 at 13:41 UTC
One thing you can add to his code if you only have one occurence of @user_info in the whole file is an exit from the loop as soon as you have found your data That's a very good point! Here's two more variants, the first if the start and end tag should be captured, the second if they shouldn't (replaces the `if/elsif`): `if ( my $flag = /\@user_info_start/ ... /\@user_info_end/ ) { push @userinfo, $_; last LINE if $flag=~/E0/; } # - or - if ( my $flag = /\@user_info_start/ ... /\@user_info_end/ ) { last LINE if $flag=~/E0/; push @userinfo, $_ unless $flag==1; }` [download] See also Behavior of Flip-Flop Operators and Flipin good, or a total flop?	[reply] [d/l] [select]
Re^3: processing file content as string vs array by Eily (Monsignor) on May 13, 2019 at 13:54 UTC
++ in the spirit of TIMTOWTDI, but I personally don't like that version because /E0/ is too much of a magic value for me.	[reply]
Re^4: processing file content as string vs array (updated) by haukex (Archbishop) on May 13, 2019 at 14:12 UTC
Re^3: processing file content as string vs array by vinoth.ree (Monsignor) on May 14, 2019 at 12:22 UTC
Thank you haukex it works awesome!!! *All is well. I learn by answering your questions...*	[reply] [d/l]
Re: processing file content as string vs array by haukex (Archbishop) on May 13, 2019 at 07:21 UTC
The performance of regexes can vary depending on the regex itself, there are some cases where excessive backtracking can cause regexes to be quite slow on long inputs. Often adjusting the regex can help, but then it really depends on exactly what the data looks like, which you haven't shown. But if it's possible to read the file into an array and filter that, then it should also be possible to read the file with a `while(<>)` loop and store only the lines you need, instead of reading the entire file into memory.	[reply] [d/l]
Re^2: processing file content as string vs array by vinoth.ree (Monsignor) on May 13, 2019 at 07:45 UTC
Hi haukex, Thanks for your input, pls find the sample data `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxx xxxxxxxxxxx xxxx@user_info_start xxxx@Title : Mr xxxx@Username : xxxxx xxxx@Filetype : txt xxxx@Version : 0001 xxxx@Create_Date : 20190407 xxxx@Product : xxxx xxxx@user_info_end xxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx` [download] Here is the regex I used to capture the user details `if( $file_content =~ m/(.?\@user_info_start.?\n)(.)(.?\@user_info_ +end.$)/si){ my $user_info= $2; }` [download] All is well. I learn by answering your questions...*	[reply] [d/l] [select]
Re^3: processing file content as string vs array by haukex (Archbishop) on May 13, 2019 at 07:59 UTC
Yes, it looks to me like a simple `while` loop should be better. Here, I'm using the flip-flop operator to keep state: `use warnings; use strict; my @userinfo; while (<DATA>) { chomp; if ( /\@user_info_start/ ... /\@user_info_end/ ) { push @userinfo, $_; } } use Data::Dumper; print Dumper(\@userinfo); __DATA__ xxxxxxxxxxx xxxx@user_info_start xxxx@Title : Mr xxxx@Username : xxxxx xxxx@Filetype : txt xxxx@Version : 0001 xxxx@Create_Date : 20190407 xxxx@Product : xxxx xxxx@user_info_end xxxxxxxxxxxxxxxxxxxxxxxxxxxx` [download] Output: `$VAR1 = [ 'xxxx@user_info_start', 'xxxx@Title : Mr', 'xxxx@Username : xxxxx', 'xxxx@Filetype : txt', 'xxxx@Version : 0001', 'xxxx@Create_Date : 20190407', 'xxxx@Product : xxxx', 'xxxx@user_info_end' ];` [download] And if you want to exclude the section markers, you can inspect the return value of the operator, for example: `if ( my $flag = /\@user_info_start/ ... /\@user_info_end/ ) { push @userinfo, $_ unless $flag==1 \|\| $flag=~/E0/; }` [download]	[reply] [d/l] [select]
Re^3: processing file content as string vs array by jwkrahn (Abbot) on May 13, 2019 at 08:12 UTC
You could try setting the Input Record Separator to '@user_info_end': `... # set the Input Record Separator $/ = '@user_info_end'; while ( my $file_content = <FILE> ) { # remove the Input Record Separator chomp $file_content; if ( $file_content =~ /\@user_info_start.\n)(?s:(.))/i ) { $user_info = $2; } ... } ...` [download]	[reply] [d/l]
Re^3: processing file content as string vs array by AnomalousMonk (Archbishop) on May 13, 2019 at 20:15 UTC
I think the loopy approach discussed by haukex and others is probably better than using regexes in this application. However... `if( $file_content =~ m/(.?\@user_info_start.?\n)(.)(.?\@user_info_end.$)/si){ ... }`* Some comments on this regex. (BTW: This is all untested.) `m/.../si` The `/i` case insensitivity modifier has a cost in time. Is it really needed? Will you be processing info block delimiters that might look like `'@UsEr_iNfO_StArT'` etc? If there might be some limited variation in capitalization, e.g., `'@User_Info_Start'`, it might be better to use a pattern like `/\@[Uu]ser_[Ii]nfo_[Ss]tart/`; character ~~sets~~ \| classes are usually less expensive than global case insensitivity. (The start/end patterns used in the flip-flop solutions discussed elsewhere are entirely case sensitive.) `(.)` This greedy little pattern will grab everything* remaining in the string/file, forcing the following pattern to backtrack until it finds a block end delimiter substring. In particular, it will capture any junk at the beginning of the line containing the block end delimiter substring and also the newline from the previous line. Greed is one of the seven deadly sins. `(.?\@user_info_start.?\n)` This captures everything from the start of the string/file up to the the first newline after the block start delimiter substring. Do you really want this? You don't seem to use it, and captures aren't free. `(.?\@user_info_end.$)` A similar comment applies to the block end delimiter pattern. This captures everything from the start of the end delimiter substring to the end of the file. Again, you don't seem to use this. `(.?\@user_info_start.?\n)` `(.?\@user_info_end.$)` The info block start/end delimiter substring patterns are ambiguous: the start delimiter pattern also matches `'@user_info_starting_to_rain'` and similarly for the end delimiter pattern. There's a nice `'@'` character anchoring the start of the delimiter substrings, but I would have defined some kind of boundary assertion for their ends. (The start/end patterns used in the flip-flop solutions discussed elsewhere also suffer from this ambiguity.) Here's an untested suggestion for a whole-file info block extraction regex. It assumes: No more than one info block per file (update: although it wouldn't be very difficult to deal with multiple non-nested info blocks); On the lines containing the start/end delimiter substrings, there may be any amount of any junk preceding those substrings; On the lines containing the start/end delimiter substrings, there may only be zero or more whitespace following the start/end delimiter and before the newline; There must be at least one line (i.e., at least one newline) in the info block, although this line may be blank or empty; The start/end delimiter substrings are case sensitive. `if ($file_content =~ m{ \@user_info_start \s* \n (.?) \n [^\n]? \@us +er_info_end (?! \S) }xms) { my $user_info = $1; ... }` [download] (Note that the info block will be extracted without an ending newline.) If you have time to play around with this, I'd be interested to know how this regex compares speedwise to the loopy solutions. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: processing file content as string vs array by Marshall (Canon) on May 15, 2019 at 01:26 UTC
I see that you are happy with the flip-flop operator as demo'ed by haukex. The flip-flop operator in Perl keeps the state of whether or not you are within the beginning or closing statements of some data record. I like that operator, but it may not be the best in all situations. In a language without the flip-flop operator, another method is to call a subroutine when the beginning of record is seen. Use that subroutine to process the record. This handles the "state information" of whether or not you are inside the record without having to have a flag value. Of course adjustments are necessary depending upon whether the first or last values of the record need to be included or not. Here is some possible code: Read more... (2 kB) Or perhaps. Read more... (1393 Bytes) Update: of course the first example could avoid a push...In general, Don't "push" when you can "print"! Read more... (2 kB) I guess yet another way... Read more... (2 kB)	[reply] [d/l] [select]
Re^2: processing file content as string vs array by haukex (Archbishop) on May 15, 2019 at 21:08 UTC
call a subroutine when the beginning of record is seen It may work if the first and last ~~record~~line are supposed to be processed by the `sub`, but what if the final line is supposed to be processed by some other piece of code? You can't just ungetc a readline... Also, note that your `process_record` is making use of a global variable, `DATA`, and three of your four examples will throw an `undef` warning if the end-of-file is reached before the closing line is seen. I think a state machine type approach would be better, because it is more flexible and can handle the above cases specially, if needed.	[reply] [d/l] [select]
Re^3: processing file content as string vs array by Marshall (Canon) on May 17, 2019 at 01:13 UTC
Good points. but what if the final line is supposed to be processed by some other piece of code? You can't just ungetc a readline... You are correct in that there is no "unget" or "un-read" for a line that has already been read. There are various ways of handling that sort of situation. In the case where the process() sub needs to deal with the first line, I pass that first line as a parameter to the process() sub. Usually these sorts of things are record oriented.... something has to be done with a record that was read and the process() sub's job is to assemble a complete record. If you want the code that "does something to the record" to be in the main driver, then just have process() return a structure or modify a struct ref that is passed in. I don't see any issue here at all. Can't use Perl's single action "if" in that situation, but I don't see any issue. Also, note that your process_record is making use of a global variable, DATA, and three of your four examples will throw an undef warning if the end-of-file is reached before the closing line is seen. As far as global DATA goes, I have no issue with that for a short (<1 page) piece of code. In a larger program I would pass a lexical file handle to the sub. Note: You can make a lexical file handle out of DATA like this: `my $fh = DATA; print while (<fh>);` Pass $fh to the sub. In almost all of the situations I deal with, throwing an error for a malformed file input is the correct behaviour. This is a usually good thing and the input file needs to be fixed. It is rare for me to throw away or silently ignore a malformed record. Of course "seldom" does not mean "never". It could certainly be argued that the program that doesn't throw an undef warning is in error! Of course the programs I demoed can be modified to have either behaviour. I think a state machine type approach would be better, because it is more flexible and can handle the above cases specially, if needed.* I guess we disagree. I don't see any case for "more flexible". However, having said that, there is no real quibble on my part with having a state variable approach. Using a sub() to keep track of the "inside record" state is very clean. I actually think the Perl flip-flop operator is very cool. No problem with that either! When I use it, I have to go to Grandfather's classic post and look at the various start/end regex situations. I often have to write "one-off" programs to convert wierd file formats. I will attach such a program that I wrote a few days ago. For such a thing, efficiency doesn't matter, "general purpose" doesn't matter - I will never see a file like this again. My job was to convert this file as part of a larger project. This is not "perfect" but it did its job. Read more... (4 kB)	[reply] [d/l] [select]
Re^4: processing file content as string vs array by haukex (Archbishop) on May 18, 2019 at 18:58 UTC
Re^5: processing file content as string vs array by Marshall (Canon) on May 21, 2019 at 11:29 UTC
Some notes below your chosen depth have not been shown here


"be consistent"
	PerlMonks