in reply to Re^2: processing file content as string vs array
in thread processing file content as string vs array
I think the loopy approach discussed by haukex and others is probably better than using regexes in this application. However...
if( $file_content =~ m/(.*?\@user_info_start.*?\n)(.*)(.*?\@user_info_end.*$)/si){ ... }
Some comments on this regex. (BTW: This is all untested.)
-
m/.../si
The /i case insensitivity modifier has a cost in time. Is it really needed? Will you be processing info block delimiters that might look like '@UsEr_iNfO_StArT' etc? If there might be some limited variation in capitalization, e.g., '@User_Info_Start', it might be better to use a pattern like /\@[Uu]ser_[Ii]nfo_[Ss]tart/; charactersets| classes are usually less expensive than global case insensitivity. (The start/end patterns used in the flip-flop solutions discussed elsewhere are entirely case sensitive.) -
(.*)
This greedy little pattern will grab everything remaining in the string/file, forcing the following pattern to backtrack until it finds a block end delimiter substring. In particular, it will capture any junk at the beginning of the line containing the block end delimiter substring and also the newline from the previous line. Greed is one of the seven deadly sins. -
(.*?\@user_info_start.*?\n)
This captures everything from the start of the string/file up to the the first newline after the block start delimiter substring. Do you really want this? You don't seem to use it, and captures aren't free. -
(.*?\@user_info_end.*$)
A similar comment applies to the block end delimiter pattern. This captures everything from the start of the end delimiter substring to the end of the file. Again, you don't seem to use this. -
(.*?\@user_info_start.*?\n)
(.*?\@user_info_end.*$)
The info block start/end delimiter substring patterns are ambiguous: the start delimiter pattern also matches '@user_info_starting_to_rain' and similarly for the end delimiter pattern. There's a nice '@' character anchoring the start of the delimiter substrings, but I would have defined some kind of boundary assertion for their ends. (The start/end patterns used in the flip-flop solutions discussed elsewhere also suffer from this ambiguity.)
Here's an untested suggestion for a whole-file info block extraction regex. It assumes:
- No more than one info block per file (update: although it wouldn't be very difficult to deal with multiple non-nested info blocks);
- On the lines containing the start/end delimiter substrings, there may be any amount of any junk preceding those substrings;
- On the lines containing the start/end delimiter substrings, there may only be zero or more whitespace following the start/end delimiter and before the newline;
- There must be at least one line (i.e., at least one newline) in the info block, although this line may be blank or empty;
- The start/end delimiter substrings are case sensitive.
(Note that the info block will be extracted without an ending newline.) If you have time to play around with this, I'd be interested to know how this regex compares speedwise to the loopy solutions.if ($file_content =~ m{ \@user_info_start \s* \n (.*?) \n [^\n]*? \@us +er_info_end (?! \S) }xms) { my $user_info = $1; ... }
Give a man a fish: <%-{-{-{-<
|
---|
In Section
Seekers of Perl Wisdom