http://qs321.pair.com?node_id=1233687


in reply to Re: processing file content as string vs array
in thread processing file content as string vs array

Hi haukex,

Thanks for your input, pls find the sample data

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxx xxxxxxxxxxx xxxx*@user_info_start xxxx*@Title : Mr xxxx*@Username : xxxxx xxxx*@Filetype : txt xxxx*@Version : 0001 xxxx*@Create_Date : 20190407 xxxx*@Product : xxxx xxxx*@user_info_end xxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx

Here is the regex I used to capture the user details

if( $file_content =~ m/(.*?\@user_info_start.*?\n)(.*)(.*?\@user_info_ +end.*$)/si){ my $user_info= $2; }

All is well. I learn by answering your questions...

Replies are listed 'Best First'.
Re^3: processing file content as string vs array
by haukex (Archbishop) on May 13, 2019 at 07:59 UTC

    Yes, it looks to me like a simple while loop should be better. Here, I'm using the flip-flop operator to keep state:

    use warnings; use strict; my @userinfo; while (<DATA>) { chomp; if ( /\@user_info_start/ ... /\@user_info_end/ ) { push @userinfo, $_; } } use Data::Dumper; print Dumper(\@userinfo); __DATA__ xxxxxxxxxxx xxxx*@user_info_start xxxx*@Title : Mr xxxx*@Username : xxxxx xxxx*@Filetype : txt xxxx*@Version : 0001 xxxx*@Create_Date : 20190407 xxxx*@Product : xxxx xxxx*@user_info_end xxxxxxxxxxxxxxxxxxxxxxxxxxxx

    Output:

    $VAR1 = [ 'xxxx*@user_info_start', 'xxxx*@Title : Mr', 'xxxx*@Username : xxxxx', 'xxxx*@Filetype : txt', 'xxxx*@Version : 0001', 'xxxx*@Create_Date : 20190407', 'xxxx*@Product : xxxx', 'xxxx*@user_info_end' ];

    And if you want to exclude the section markers, you can inspect the return value of the operator, for example:

    if ( my $flag = /\@user_info_start/ ... /\@user_info_end/ ) { push @userinfo, $_ unless $flag==1 || $flag=~/E0/; }
Re^3: processing file content as string vs array
by jwkrahn (Abbot) on May 13, 2019 at 08:12 UTC

    You could try setting the Input Record Separator to '@user_info_end':

    ... # set the Input Record Separator $/ = '@user_info_end'; while ( my $file_content = <FILE> ) { # remove the Input Record Separator chomp $file_content; if ( $file_content =~ /\@user_info_start.*\n)(?s:(.*))/i ) { $user_info = $2; } ... } ...
Re^3: processing file content as string vs array
by AnomalousMonk (Archbishop) on May 13, 2019 at 20:15 UTC

    I think the loopy approach discussed by haukex and others is probably better than using regexes in this application. However...

    if( $file_content =~ m/(.*?\@user_info_start.*?\n)(.*)(.*?\@user_info_end.*$)/si){ ... }

    Some comments on this regex. (BTW: This is all untested.)

    • m/.../si
      The  /i case insensitivity modifier has a cost in time. Is it really needed? Will you be processing info block delimiters that might look like '@UsEr_iNfO_StArT' etc? If there might be some limited variation in capitalization, e.g., '@User_Info_Start', it might be better to use a pattern like /\@[Uu]ser_[Ii]nfo_[Ss]tart/; character sets | classes are usually less expensive than global case insensitivity. (The start/end patterns used in the flip-flop solutions discussed elsewhere are entirely case sensitive.)
    • (.*)
      This greedy little pattern will grab everything remaining in the string/file, forcing the following pattern to backtrack until it finds a block end delimiter substring. In particular, it will capture any junk at the beginning of the line containing the block end delimiter substring and also the newline from the previous line. Greed is one of the seven deadly sins.
    • (.*?\@user_info_start.*?\n)
      This captures everything from the start of the string/file up to the the first newline after the block start delimiter substring. Do you really want this? You don't seem to use it, and captures aren't free.
    • (.*?\@user_info_end.*$)
      A similar comment applies to the block end delimiter pattern. This captures everything from the start of the end delimiter substring to the end of the file. Again, you don't seem to use this.
    • (.*?\@user_info_start.*?\n)
      (.*?\@user_info_end.*$)
      The info block start/end delimiter substring patterns are ambiguous: the start delimiter pattern also matches '@user_info_starting_to_rain' and similarly for the end delimiter pattern. There's a nice '@' character anchoring the start of the delimiter substrings, but I would have defined some kind of boundary assertion for their ends. (The start/end patterns used in the flip-flop solutions discussed elsewhere also suffer from this ambiguity.)

    Here's an untested suggestion for a whole-file info block extraction regex. It assumes:

    • No more than one info block per file (update: although it wouldn't be very difficult to deal with multiple non-nested info blocks);
    • On the lines containing the start/end delimiter substrings, there may be any amount of any junk preceding those substrings;
    • On the lines containing the start/end delimiter substrings, there may only be zero or more whitespace following the start/end delimiter and before the newline;
    • There must be at least one line (i.e., at least one newline) in the info block, although this line may be blank or empty;
    • The start/end delimiter substrings are case sensitive.
    if ($file_content =~ m{ \@user_info_start \s* \n (.*?) \n [^\n]*? \@us +er_info_end (?! \S) }xms) { my $user_info = $1; ... }
    (Note that the info block will be extracted without an ending newline.) If you have time to play around with this, I'd be interested to know how this regex compares speedwise to the loopy solutions.


    Give a man a fish:  <%-{-{-{-<