Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^3: processing file content as string vs array

by AnomalousMonk (Bishop)
on May 13, 2019 at 20:15 UTC ( #1233732=note: print w/replies, xml ) Need Help??


in reply to Re^2: processing file content as string vs array
in thread processing file content as string vs array

I think the loopy approach discussed by haukex and others is probably better than using regexes in this application. However...

if( $file_content =~ m/(.*?\@user_info_start.*?\n)(.*)(.*?\@user_info_end.*$)/si){ ... }

Some comments on this regex. (BTW: This is all untested.)

  • m/.../si
    The  /i case insensitivity modifier has a cost in time. Is it really needed? Will you be processing info block delimiters that might look like '@UsEr_iNfO_StArT' etc? If there might be some limited variation in capitalization, e.g., '@User_Info_Start', it might be better to use a pattern like /\@[Uu]ser_[Ii]nfo_[Ss]tart/; character sets | classes are usually less expensive than global case insensitivity. (The start/end patterns used in the flip-flop solutions discussed elsewhere are entirely case sensitive.)
  • (.*)
    This greedy little pattern will grab everything remaining in the string/file, forcing the following pattern to backtrack until it finds a block end delimiter substring. In particular, it will capture any junk at the beginning of the line containing the block end delimiter substring and also the newline from the previous line. Greed is one of the seven deadly sins.
  • (.*?\@user_info_start.*?\n)
    This captures everything from the start of the string/file up to the the first newline after the block start delimiter substring. Do you really want this? You don't seem to use it, and captures aren't free.
  • (.*?\@user_info_end.*$)
    A similar comment applies to the block end delimiter pattern. This captures everything from the start of the end delimiter substring to the end of the file. Again, you don't seem to use this.
  • (.*?\@user_info_start.*?\n)
    (.*?\@user_info_end.*$)
    The info block start/end delimiter substring patterns are ambiguous: the start delimiter pattern also matches '@user_info_starting_to_rain' and similarly for the end delimiter pattern. There's a nice '@' character anchoring the start of the delimiter substrings, but I would have defined some kind of boundary assertion for their ends. (The start/end patterns used in the flip-flop solutions discussed elsewhere also suffer from this ambiguity.)

Here's an untested suggestion for a whole-file info block extraction regex. It assumes:

  • No more than one info block per file (update: although it wouldn't be very difficult to deal with multiple non-nested info blocks);
  • On the lines containing the start/end delimiter substrings, there may be any amount of any junk preceding those substrings;
  • On the lines containing the start/end delimiter substrings, there may only be zero or more whitespace following the start/end delimiter and before the newline;
  • There must be at least one line (i.e., at least one newline) in the info block, although this line may be blank or empty;
  • The start/end delimiter substrings are case sensitive.
if ($file_content =~ m{ \@user_info_start \s* \n (.*?) \n [^\n]*? \@us +er_info_end (?! \S) }xms) { my $user_info = $1; ... }
(Note that the info block will be extracted without an ending newline.) If you have time to play around with this, I'd be interested to know how this regex compares speedwise to the loopy solutions.


Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1233732]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2020-10-24 07:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (242 votes). Check out past polls.

    Notices?