Applying regex to each line in a record.

pritesh_ugrankar has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Applying regex to each line in a record. by haukex (Archbishop) on Oct 24, 2020 at 22:26 UTC
Even after multiple attempts, I am at a total loss of how the "m" and "s" works for regex. `/m` changes the meaning of `^` and `$`: Without `/m`, `^` matches only at the very beginning of the string. (This is the same as `\A`, except that `\A` is not affected by `/m`.) `$` matches at the very end of the string, but if the string ends with `\n`, it will match just before and just after this `\n`. (This is the same as `\Z`, except that `\Z` is not affected by `/m`.) With `/m`, `^` matches at the very beginning of the string, and just after any `\n`, except if the `\n` is the last character in the string. In other words, it matches at the beginning of each line within the string. `$` matches just before each `\n`, in other words before the end of every line within the string, and at the very end of the string. `/s` changes the meaning of `.`: Without `/s`, `.` matches anything except the newline, i.e. `[^\n]`. In other words, a regex of `/.+/g` is limited to matching one line within the string at a time. With `/s`, `.` matches absolutely any character, including `\n`. Note that `/m` and `/s` are completely independent of one another. Keep in mind that `^` and `$` are zero-width matches - for example, this means that with `$_ = "a\nb"`, a regex of `/$/gm` will match and leave the regex engine's position at before the `\n`, and a following regex of `/./gs` would then match that `\n`. Here is some code to play around with (try changing the lists of `$str`ings and `$regex`es). As you can see, `/m` really only becomes important if there are multiple `\n`'s in the string. And of course there's the WebPerl Regex Tester that visualizes this as well (modern browser required). use warnings; use strict; use open qw/:std :utf8/; use Term::ANSIColor qw/colored/; for my $str ( "a","a\n","a\nb","a\n\nb","a\nb\nc\n","a\nb\nc\nd") { for my $regex ( '/^/g','/^/gm','/$/g','/$/gm','/./g','/./gs' ) { my $o = join( '', map { sprintf "%2s", chr( $_<0x21 ? 0x2400+$_ : $_==0x7F ? 0x2421 : $_ ) } map ord, split //, $str )." "; my @matches; eval qq{ push \@matches, [[\@-],[\@+]] while \$str=~$regex ;1} or die $@; my ($matchcnt,%matches) = (1); for my $match (@matches) { my @pos = $match->[0][0]==$match->[1][0] ? ( $match->[0][0] 2 ) : map { $_2+1 } $match->[0][0]..$match->[1][0]-1; for my $p (@pos) { die "overlapping matches not supported" if exists $matches{$p}; $matches{$p} = $matchcnt; } } continue { $matchcnt++ } substr($o, $_, 1) = colored(['underline'], substr($o, $_, 1)) #"<u>".substr($o, $_, 1)."</u>" # alternative for HTML for sort { $b<=>$a } keys %matches; printf "%6s: %s\n", $regex, $o; } } [download] Output: /^/g: a /^/gm: a /$/g: a /$/gm: a /./g: a /./gs: a /^/g: a ␊ /^/gm: a ␊ /$/g: a ␊ /$/gm: a ␊ /./g: a ␊ /./gs: a ␊ /^/g: a ␊ b /^/gm: a ␊ b /$/g: a ␊ b /$/gm: a ␊ b /./g: a ␊ b /./gs: a ␊ b /^/g: a ␊ ␊ b /^/gm: a ␊ ␊ b /$/g: a ␊ ␊ b /$/gm: a ␊ ␊ b /./g: a ␊ ␊ b /./gs: a ␊ ␊ b /^/g: a ␊ b ␊ c ␊ /^/gm: a ␊ b ␊ c ␊ /$/g: a ␊ b ␊ c ␊ /$/gm: a ␊ b ␊ c ␊ /./g: a ␊ b ␊ c ␊ /./gs: a ␊ b ␊ c ␊ /^/g: a ␊ b ␊ c ␊ d /^/gm: a ␊ b ␊ c ␊ d /$/g: a ␊ b ␊ c ␊ d /$/gm: a ␊ b ␊ c ␊ d /./g: a ␊ b ␊ c ␊ d /./gs: a ␊ b ␊ c ␊ d Update: Note that Repeated Patterns Matching a Zero length Substring is relevant here (example).	[reply] [d/l] [select]
Re^2: Applying regex to each line in a record. by pritesh_ugrankar (Monk) on Oct 25, 2020 at 16:55 UTC
Hi Haukex, I'm truly at a loss of words. While the code you've written here is truly advanced for me, The output is teaching me a lot.	[reply] [d/l] [select]
Re: Applying regex to each line in a record. by tybalt89 (Monsignor) on Oct 24, 2020 at 19:26 UTC
If I understand your problem correctly, this is how I'd do it (without all that mucking around with /s and /m and paragraphs) `#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11123126 use warnings; my %hash; my $current; while( <DATA> ) { if( /(\w+):\n/ ) { $hash{$1} = $current = {}; } elsif( /(\w+):(\w+)/ ) { $current->{$1} = $2; } } use Data::Dump 'dd'; dd \%hash; __DATA__ first: this:that here:there when:what how:where now:later name:onlyfirst second: this:that here:there when:what how:where now:later name:onlysecond` [download] Outputs: `{ first => { here => "there", how => "where", name => "onlyfirst", now => "later", this => "that", when => "what", }, second => { here => "there", how => "where", name => "onlysecond", now => "later", this => "that", when => "what", }, }` [download] Is that the hash-of-hashes you are looking for?	[reply] [d/l] [select]
Re: Applying regex to each line in a record. by haukex (Archbishop) on Oct 25, 2020 at 10:45 UTC
To comment on your code specifically: `local $/ = ""; while (my @records = <$fh>) { foreach my $line (@records) {` [download] You're activating paragraph mode with `$/ = ""`, meaning that for your sample data, each call to `<$fh>` in scalar context (e.g. `my $para = <$fh>`) will return one paragraph (e.g. `"first:\nthis:that\nhere:there..."`). However, in your `while`, you're calling `<$fh>` in list context (because you're assigning to an array), which will cause it to return all records, i.e. all paragraphs. This means that the second time the `while` tries to execute, it won't get anything from `$fh`, so the while loop will only execute once, and that makes the `while` loop kind of useless in this code. You can see this yourself by adding a `print Dumper(\@records);` at the top of the `while` (I'd also strongly recommend setting `$Data::Dumper::Useqq=1;`). Next, based on your variable naming and code, I guess that what you are expecting is that `foreach my $line (@records)` will loop over the lines in each paragraph. However, Perl doesn't do this automatically - with this code, you'd have to split each element of `@records` manually. What you're doing instead is looping over the paragraphs. Here is the code I think you were trying to write: `local $/ = ""; while (my $paragraph = <$fh>) { print Dumper($paragraph); foreach my $line (split /\n+/, $paragraph) { print Dumper($line); next if $line =~ /^[a-z]+:$/m; print "<$line>\n"; } }` [download] As you can see, the problem actually occurrs before your code even gets to the regex. The above approach is ok, as long as the paragraphs don't get too large to fit comfortably into RAM. Otherwise, you'd have to choose a more efficient approach like reading the file line-by-line and recognizing paragraphs with a state machine type approach. The other monks have shown you several examples of different approaches.	[reply] [d/l] [select]
Re^2: Applying regex to each line in a record. by pritesh_ugrankar (Monk) on Oct 25, 2020 at 16:37 UTC
Hi Haukex, Amazing....Yes, indeed I was thinking on the same lines you said. Thank you so very much.	[reply] [d/l] [select]
Re: Applying regex to each line in a record. by choroba (Cardinal) on Oct 24, 2020 at 19:30 UTC
You need to remember the current header (or the current section). `#!/usr/bin/perl use warnings; use strict; my %hash; my $header; while (<>) { if (my ($h) = /^(.):$/) { $header = $h; } elsif (my ($k, $v) = /^(.):(.)$/) { $hash{$header}{$k} = $v; } } use Data::Dumper; print Dumper \%hash;` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]`	[reply] [d/l] [select]
Re: Applying regex to each line in a record. by tybalt89 (Monsignor) on Oct 24, 2020 at 21:20 UTC
I tried paragraph mode, still didn't need /s and /m. :) `#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11123126 use warnings; local $/ = ''; # paragraph at a time my %hash = map { $1 x /(\w+):\n/, { /(\w+):(\w+)/g } } <DATA>; use Data::Dump 'dd'; dd \%hash; __DATA__ first: this:that here:there when:what how:where now:later name:onlyfirst second: this:that here:there when:what how:where now:later name:onlysecond` [download]	[reply] [d/l]
Re: Applying regex to each line in a record. by AnomalousMonk (Archbishop) on Oct 24, 2020 at 20:51 UTC
This approach is very generalized (some might say over-generalized :) and more verbose, but has the advantage that it is very flexible and can be highly specialized. E.g., patterns for top level key and lower level key/values can be individually specified. The script does a fair amount of data validation. I'm using a slightly different example dataset for testing. Script `extract_HoH_1.pl`: Read more... (1517 Bytes) Example dataset `11123126_1.dat`: Read more... (412 Bytes) Example run: `Win8 Strawberry 5.8.9.5 (32) Sat 10/24/2020 15:46:02 C:\@Work\Perl\monks\pritesh_ugrankar >perl extract_HoH_1.pl $VAR1 = { 'first' => { 'when' => 'what', 'here' => 'there', 'first' => 'firstAsKey', 'now' => 'later', 'how' => 'where', 'firstAsValue' => 'first', 'this' => 'that' }, 'second' => { 'now2' => 'later2', 'here2' => 'there2', 'how2' => 'where2', 'when2' => 'what2', 'secondAsValue' => 'second', 'this2' => 'that2', 'second' => 'secondAsKey' } };` [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Applying regex to each line in a record. by stevieb (Canon) on Oct 24, 2020 at 20:56 UTC
This can also be accomplished without a regex at all, thanks to the split() function. `use strict; use warnings; use Data::Dumper; my %hash; my $header; while (my $line = <DATA>) { chomp $line; my ($k, $v) = split ':', $line; next if ! $k; if (! $v) { $header = $k; next; } $hash{$header}{$k} = $v; } print Dumper \%hash; __DATA__ first: this:that here:there when:what how:where now:later second: this:that here:there when:what how:where now:later` [download] Output: `$VAR1 = { 'second' => { 'now' => 'later', 'how' => 'where', 'here' => 'there', 'this' => 'that', 'when' => 'what' }, 'first' => { 'when' => 'what', 'how' => 'where', 'here' => 'there', 'this' => 'that', 'now' => 'later' } };` [download]	[reply] [d/l] [select]
Re^2: Applying regex to each line in a record. by choroba (Cardinal) on Oct 24, 2020 at 21:46 UTC
> without a regex at all, thanks to the split() function. Note that the first argument to split, even if you write it as `':'`, is a regex. Only a space is special. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^3: Applying regex to each line in a record. by stevieb (Canon) on Oct 25, 2020 at 16:01 UTC
I was hoping that nobody would notice and point that out ;) But yes, choroba is absolutely correct in his assessment.	[reply]
Re: Applying regex to each line in a record. by pritesh_ugrankar (Monk) on Oct 24, 2020 at 20:41 UTC
Hi Choroba and Tybalt, Thank you for the answer!! Truly amazing. It gets rid of the "m" vs "s" confusion by simply not using them!! I was breaking my head on this for quite a while, and you guys have given such an elegant answer in no time. Some times makes me feel like giving up scripting, but it's too fun and productive. Besides, I still haven't been shooed away from here, so till then I'll stick around. :D	[reply] [d/l] [select]
Re^2: Applying regex to each line in a record. by AnomalousMonk (Archbishop) on Oct 24, 2020 at 21:09 UTC
It gets rid of the "m" vs "s" confusion by simply not using them!! ~~`/pattern/` is equivalent to `m/pattern/`.~~ \| Nope, pritesh_ugrankar is referring to the `/m` and `/s` modifiers! But the following are still a good read. :) See perlre, perlretut (highly recommended!) and perlreref for regex info; also perlop for the `m//` and `s///` (and `tr///`) operators. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Applying regex to each line in a record. by perlfan (Vicar) on Oct 25, 2020 at 20:29 UTC
It doesn't help that `s/` and `m/` imply different operation modes when prepended to the regex, "search and replace" and "match", respectively. And absence of anything in front implies `m/`.	[reply] [d/l] [select]


Clear questions and runnable code get the best and fastest answer
	PerlMonks