Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I think the loopy approach discussed by haukex and others is probably better than using regexes in this application. However...

if( $file_content =~ m/(.*?\@user_info_start.*?\n)(.*)(.*?\@user_info_end.*$)/si){ ... }

Some comments on this regex. (BTW: This is all untested.)

  • m/.../si
    The  /i case insensitivity modifier has a cost in time. Is it really needed? Will you be processing info block delimiters that might look like '@UsEr_iNfO_StArT' etc? If there might be some limited variation in capitalization, e.g., '@User_Info_Start', it might be better to use a pattern like /\@[Uu]ser_[Ii]nfo_[Ss]tart/; character sets | classes are usually less expensive than global case insensitivity. (The start/end patterns used in the flip-flop solutions discussed elsewhere are entirely case sensitive.)
  • (.*)
    This greedy little pattern will grab everything remaining in the string/file, forcing the following pattern to backtrack until it finds a block end delimiter substring. In particular, it will capture any junk at the beginning of the line containing the block end delimiter substring and also the newline from the previous line. Greed is one of the seven deadly sins.
  • (.*?\@user_info_start.*?\n)
    This captures everything from the start of the string/file up to the the first newline after the block start delimiter substring. Do you really want this? You don't seem to use it, and captures aren't free.
  • (.*?\@user_info_end.*$)
    A similar comment applies to the block end delimiter pattern. This captures everything from the start of the end delimiter substring to the end of the file. Again, you don't seem to use this.
  • (.*?\@user_info_start.*?\n)
    The info block start/end delimiter substring patterns are ambiguous: the start delimiter pattern also matches '@user_info_starting_to_rain' and similarly for the end delimiter pattern. There's a nice '@' character anchoring the start of the delimiter substrings, but I would have defined some kind of boundary assertion for their ends. (The start/end patterns used in the flip-flop solutions discussed elsewhere also suffer from this ambiguity.)

Here's an untested suggestion for a whole-file info block extraction regex. It assumes:

  • No more than one info block per file (update: although it wouldn't be very difficult to deal with multiple non-nested info blocks);
  • On the lines containing the start/end delimiter substrings, there may be any amount of any junk preceding those substrings;
  • On the lines containing the start/end delimiter substrings, there may only be zero or more whitespace following the start/end delimiter and before the newline;
  • There must be at least one line (i.e., at least one newline) in the info block, although this line may be blank or empty;
  • The start/end delimiter substrings are case sensitive.
if ($file_content =~ m{ \@user_info_start \s* \n (.*?) \n [^\n]*? \@us +er_info_end (?! \S) }xms) { my $user_info = $1; ... }
(Note that the info block will be extracted without an ending newline.) If you have time to play around with this, I'd be interested to know how this regex compares speedwise to the loopy solutions.

Give a man a fish:  <%-{-{-{-<

In reply to Re^3: processing file content as string vs array by AnomalousMonk
in thread processing file content as string vs array by vinoth.ree

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (7)
    As of 2020-12-02 02:39 GMT
    Find Nodes?
      Voting Booth?
      How often do you use taint mode?

      Results (26 votes). Check out past polls.