Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: processing file content as string vs array

by haukex (Bishop)
on May 13, 2019 at 07:21 UTC ( #1233684=note: print w/replies, xml ) Need Help??


in reply to processing file content as string vs array

The performance of regexes can vary depending on the regex itself, there are some cases where excessive backtracking can cause regexes to be quite slow on long inputs. Often adjusting the regex can help, but then it really depends on exactly what the data looks like, which you haven't shown. But if it's possible to read the file into an array and filter that, then it should also be possible to read the file with a while(<>) loop and store only the lines you need, instead of reading the entire file into memory.

Replies are listed 'Best First'.
Re^2: processing file content as string vs array
by vinoth.ree (Monsignor) on May 13, 2019 at 07:45 UTC
    Hi haukex,

    Thanks for your input, pls find the sample data

    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxx xxxxxxxxxxx xxxx*@user_info_start xxxx*@Title : Mr xxxx*@Username : xxxxx xxxx*@Filetype : txt xxxx*@Version : 0001 xxxx*@Create_Date : 20190407 xxxx*@Product : xxxx xxxx*@user_info_end xxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx

    Here is the regex I used to capture the user details

    if( $file_content =~ m/(.*?\@user_info_start.*?\n)(.*)(.*?\@user_info_ +end.*$)/si){ my $user_info= $2; }

    All is well. I learn by answering your questions...

      Yes, it looks to me like a simple while loop should be better. Here, I'm using the flip-flop operator to keep state:

      use warnings; use strict; my @userinfo; while (<DATA>) { chomp; if ( /\@user_info_start/ ... /\@user_info_end/ ) { push @userinfo, $_; } } use Data::Dumper; print Dumper(\@userinfo); __DATA__ xxxxxxxxxxx xxxx*@user_info_start xxxx*@Title : Mr xxxx*@Username : xxxxx xxxx*@Filetype : txt xxxx*@Version : 0001 xxxx*@Create_Date : 20190407 xxxx*@Product : xxxx xxxx*@user_info_end xxxxxxxxxxxxxxxxxxxxxxxxxxxx

      Output:

      $VAR1 = [ 'xxxx*@user_info_start', 'xxxx*@Title : Mr', 'xxxx*@Username : xxxxx', 'xxxx*@Filetype : txt', 'xxxx*@Version : 0001', 'xxxx*@Create_Date : 20190407', 'xxxx*@Product : xxxx', 'xxxx*@user_info_end' ];

      And if you want to exclude the section markers, you can inspect the return value of the operator, for example:

      if ( my $flag = /\@user_info_start/ ... /\@user_info_end/ ) { push @userinfo, $_ unless $flag==1 || $flag=~/E0/; }

      You could try setting the Input Record Separator to '@user_info_end':

      ... # set the Input Record Separator $/ = '@user_info_end'; while ( my $file_content = <FILE> ) { # remove the Input Record Separator chomp $file_content; if ( $file_content =~ /\@user_info_start.*\n)(?s:(.*))/i ) { $user_info = $2; } ... } ...

      I think the loopy approach discussed by haukex and others is probably better than using regexes in this application. However...

      if( $file_content =~ m/(.*?\@user_info_start.*?\n)(.*)(.*?\@user_info_end.*$)/si){ ... }

      Some comments on this regex. (BTW: This is all untested.)

      • m/.../si
        The  /i case insensitivity modifier has a cost in time. Is it really needed? Will you be processing info block delimiters that might look like '@UsEr_iNfO_StArT' etc? If there might be some limited variation in capitalization, e.g., '@User_Info_Start', it might be better to use a pattern like /\@[Uu]ser_[Ii]nfo_[Ss]tart/; character sets | classes are usually less expensive than global case insensitivity. (The start/end patterns used in the flip-flop solutions discussed elsewhere are entirely case sensitive.)
      • (.*)
        This greedy little pattern will grab everything remaining in the string/file, forcing the following pattern to backtrack until it finds a block end delimiter substring. In particular, it will capture any junk at the beginning of the line containing the block end delimiter substring and also the newline from the previous line. Greed is one of the seven deadly sins.
      • (.*?\@user_info_start.*?\n)
        This captures everything from the start of the string/file up to the the first newline after the block start delimiter substring. Do you really want this? You don't seem to use it, and captures aren't free.
      • (.*?\@user_info_end.*$)
        A similar comment applies to the block end delimiter pattern. This captures everything from the start of the end delimiter substring to the end of the file. Again, you don't seem to use this.
      • (.*?\@user_info_start.*?\n)
        (.*?\@user_info_end.*$)
        The info block start/end delimiter substring patterns are ambiguous: the start delimiter pattern also matches '@user_info_starting_to_rain' and similarly for the end delimiter pattern. There's a nice '@' character anchoring the start of the delimiter substrings, but I would have defined some kind of boundary assertion for their ends. (The start/end patterns used in the flip-flop solutions discussed elsewhere also suffer from this ambiguity.)

      Here's an untested suggestion for a whole-file info block extraction regex. It assumes:

      • No more than one info block per file (update: although it wouldn't be very difficult to deal with multiple non-nested info blocks);
      • On the lines containing the start/end delimiter substrings, there may be any amount of any junk preceding those substrings;
      • On the lines containing the start/end delimiter substrings, there may only be zero or more whitespace following the start/end delimiter and before the newline;
      • There must be at least one line (i.e., at least one newline) in the info block, although this line may be blank or empty;
      • The start/end delimiter substrings are case sensitive.
      if ($file_content =~ m{ \@user_info_start \s* \n (.*?) \n [^\n]*? \@us +er_info_end (?! \S) }xms) { my $user_info = $1; ... }
      (Note that the info block will be extracted without an ending newline.) If you have time to play around with this, I'd be interested to know how this regex compares speedwise to the loopy solutions.


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1233684]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2020-10-30 11:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (278 votes). Check out past polls.

    Notices?