Re: Applying regex to each line in a record.
by haukex (Archbishop) on Oct 24, 2020 at 22:26 UTC
|
Even after multiple attempts, I am at a total loss of how the "m" and "s" works for regex.
- /m changes the meaning of ^ and $:
- Without /m,
- ^ matches only at the very beginning of the string. (This is the same as \A, except that \A is not affected by /m.)
- $ matches at the very end of the string, but if the string ends with \n, it will match just before and just after this \n. (This is the same as \Z, except that \Z is not affected by /m.)
- With /m,
- ^ matches at the very beginning of the string, and just after any \n, except if the \n is the last character in the string. In other words, it matches at the beginning of each line within the string.
- $ matches just before each \n, in other words before the end of every line within the string, and at the very end of the string.
- /s changes the meaning of .:
- Without /s, . matches anything except the newline, i.e. [^\n]. In other words, a regex of /.+/g is limited to matching one line within the string at a time.
- With /s, . matches absolutely any character, including \n.
Note that /m and /s are completely independent of one another. Keep in mind that ^ and $ are zero-width matches - for example, this means that with $_ = "a\nb", a regex of /$/gm will match and leave the regex engine's position at before the \n*, and a following regex of /./gs would then match that \n. Here is some code to play around with (try changing the lists of $strings and $regexes). As you can see, /m really only becomes important if there are multiple \n's in the string. And of course there's the WebPerl Regex Tester that visualizes this as well (modern browser required).
use warnings;
use strict;
use open qw/:std :utf8/;
use Term::ANSIColor qw/colored/;
for my $str ( "a","a\n","a\nb","a\n\nb","a\nb\nc\n","a\nb\nc\nd") {
for my $regex ( '/^/g','/^/gm','/$/g','/$/gm','/./g','/./gs' ) {
my $o = join( '', map { sprintf "%2s",
chr( $_<0x21 ? 0x2400+$_ : $_==0x7F ? 0x2421 : $_ ) }
map ord, split //, $str )." ";
my @matches;
eval qq{ push \@matches, [[\@-],[\@+]] while \$str=~$regex ;1}
or die $@;
my ($matchcnt,%matches) = (1);
for my $match (@matches) {
my @pos = $match->[0][0]==$match->[1][0]
? ( $match->[0][0] * 2 )
: map { $_*2+1 } $match->[0][0]..$match->[1][0]-1;
for my $p (@pos) {
die "overlapping matches not supported"
if exists $matches{$p};
$matches{$p} = $matchcnt;
}
} continue { $matchcnt++ }
substr($o, $_, 1) = colored(['underline'], substr($o, $_, 1))
#"<u>".substr($o, $_, 1)."</u>" # alternative for HTML
for sort { $b<=>$a } keys %matches;
printf "%6s: %s\n", $regex, $o;
}
}
Output:
/^/g: a
/^/gm: a
/$/g: a
/$/gm: a
/./g: a
/./gs: a
/^/g: a ␊
/^/gm: a ␊
/$/g: a ␊
/$/gm: a ␊
/./g: a ␊
/./gs: a ␊
/^/g: a ␊ b
/^/gm: a ␊ b
/$/g: a ␊ b
/$/gm: a ␊ b
/./g: a ␊ b
/./gs: a ␊ b
/^/g: a ␊ ␊ b
/^/gm: a ␊ ␊ b
/$/g: a ␊ ␊ b
/$/gm: a ␊ ␊ b
/./g: a ␊ ␊ b
/./gs: a ␊ ␊ b
/^/g: a ␊ b ␊ c ␊
/^/gm: a ␊ b ␊ c ␊
/$/g: a ␊ b ␊ c ␊
/$/gm: a ␊ b ␊ c ␊
/./g: a ␊ b ␊ c ␊
/./gs: a ␊ b ␊ c ␊
/^/g: a ␊ b ␊ c ␊ d
/^/gm: a ␊ b ␊ c ␊ d
/$/g: a ␊ b ␊ c ␊ d
/$/gm: a ␊ b ␊ c ␊ d
/./g: a ␊ b ␊ c ␊ d
/./gs: a ␊ b ␊ c ␊ d
* Update: Note that Repeated Patterns Matching a Zero length Substring is relevant here (example). | [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
Re: Applying regex to each line in a record.
by tybalt89 (Monsignor) on Oct 24, 2020 at 19:26 UTC
|
If I understand your problem correctly, this is how I'd do it (without all that mucking around with /s and /m and paragraphs)
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11123126
use warnings;
my %hash;
my $current;
while( <DATA> )
{
if( /(\w+):\n/ )
{
$hash{$1} = $current = {};
}
elsif( /(\w+):(\w+)/ )
{
$current->{$1} = $2;
}
}
use Data::Dump 'dd'; dd \%hash;
__DATA__
first:
this:that
here:there
when:what
how:where
now:later
name:onlyfirst
second:
this:that
here:there
when:what
how:where
now:later
name:onlysecond
Outputs:
{
first => {
here => "there",
how => "where",
name => "onlyfirst",
now => "later",
this => "that",
when => "what",
},
second => {
here => "there",
how => "where",
name => "onlysecond",
now => "later",
this => "that",
when => "what",
},
}
Is that the hash-of-hashes you are looking for?
| [reply] [d/l] [select] |
Re: Applying regex to each line in a record.
by haukex (Archbishop) on Oct 25, 2020 at 10:45 UTC
|
To comment on your code specifically:
local $/ = "";
while (my @records = <$fh>) {
foreach my $line (@records) {
You're activating paragraph mode with $/ = "", meaning that for your sample data, each call to <$fh> in scalar context (e.g. my $para = <$fh>) will return one paragraph (e.g. "first:\nthis:that\nhere:there..."). However, in your while, you're calling <$fh> in list context (because you're assigning to an array), which will cause it to return all records, i.e. all paragraphs. This means that the second time the while tries to execute, it won't get anything from $fh, so the while loop will only execute once, and that makes the while loop kind of useless in this code. You can see this yourself by adding a print Dumper(\@records); at the top of the while (I'd also strongly recommend setting $Data::Dumper::Useqq=1;).
Next, based on your variable naming and code, I guess that what you are expecting is that foreach my $line (@records) will loop over the lines in each paragraph. However, Perl doesn't do this automatically - with this code, you'd have to split each element of @records manually. What you're doing instead is looping over the paragraphs. Here is the code I think you were trying to write:
local $/ = "";
while (my $paragraph = <$fh>) {
print Dumper($paragraph);
foreach my $line (split /\n+/, $paragraph) {
print Dumper($line);
next if $line =~ /^[a-z]+:$/m;
print "<$line>\n";
}
}
As you can see, the problem actually occurrs before your code even gets to the regex.
The above approach is ok, as long as the paragraphs don't get too large to fit comfortably into RAM. Otherwise, you'd have to choose a more efficient approach like reading the file line-by-line and recognizing paragraphs with a state machine type approach. The other monks have shown you several examples of different approaches. | [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
Re: Applying regex to each line in a record.
by choroba (Cardinal) on Oct 24, 2020 at 19:30 UTC
|
You need to remember the current header (or the current section).
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
my $header;
while (<>) {
if (my ($h) = /^(.*):$/) {
$header = $h;
} elsif (my ($k, $v) = /^(.*):(.*)$/) {
$hash{$header}{$k} = $v;
}
}
use Data::Dumper;
print Dumper \%hash;
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Re: Applying regex to each line in a record.
by tybalt89 (Monsignor) on Oct 24, 2020 at 21:20 UTC
|
I tried paragraph mode, still didn't need /s and /m. :)
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11123126
use warnings;
local $/ = ''; # paragraph at a time
my %hash = map { $1 x /(\w+):\n/, { /(\w+):(\w+)/g } } <DATA>;
use Data::Dump 'dd'; dd \%hash;
__DATA__
first:
this:that
here:there
when:what
how:where
now:later
name:onlyfirst
second:
this:that
here:there
when:what
how:where
now:later
name:onlysecond
| [reply] [d/l] |
Re: Applying regex to each line in a record.
by AnomalousMonk (Archbishop) on Oct 24, 2020 at 20:51 UTC
|
This approach is very generalized (some might say over-generalized :) and more verbose,
but has the advantage that it is very flexible and can be highly specialized.
E.g., patterns for top level key and lower level key/values can be individually specified.
The script does a fair amount of data validation.
I'm using a slightly different example dataset for testing.
Script extract_HoH_1.pl:
Example dataset 11123126_1.dat:
Example run:
Win8 Strawberry 5.8.9.5 (32) Sat 10/24/2020 15:46:02
C:\@Work\Perl\monks\pritesh_ugrankar
>perl extract_HoH_1.pl
$VAR1 = {
'first' => {
'when' => 'what',
'here' => 'there',
'first' => 'firstAsKey',
'now' => 'later',
'how' => 'where',
'firstAsValue' => 'first',
'this' => 'that'
},
'second' => {
'now2' => 'later2',
'here2' => 'there2',
'how2' => 'where2',
'when2' => 'what2',
'secondAsValue' => 'second',
'this2' => 'that2',
'second' => 'secondAsKey'
}
};
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Applying regex to each line in a record.
by stevieb (Canon) on Oct 24, 2020 at 20:56 UTC
|
This can also be accomplished without a regex at all, thanks to the split() function.
use strict;
use warnings;
use Data::Dumper;
my %hash;
my $header;
while (my $line = <DATA>) {
chomp $line;
my ($k, $v) = split ':', $line;
next if ! $k;
if (! $v) {
$header = $k;
next;
}
$hash{$header}{$k} = $v;
}
print Dumper \%hash;
__DATA__
first:
this:that
here:there
when:what
how:where
now:later
second:
this:that
here:there
when:what
how:where
now:later
Output:
$VAR1 = {
'second' => {
'now' => 'later',
'how' => 'where',
'here' => 'there',
'this' => 'that',
'when' => 'what'
},
'first' => {
'when' => 'what',
'how' => 'where',
'here' => 'there',
'this' => 'that',
'now' => 'later'
}
};
| [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
|
I was hoping that nobody would notice and point that out ;)
But yes, choroba is absolutely correct in his assessment.
| [reply] |
Re: Applying regex to each line in a record.
by pritesh_ugrankar (Monk) on Oct 24, 2020 at 20:41 UTC
|
Hi Choroba and Tybalt,
Thank you for the answer!! Truly amazing. It gets rid of the "m" vs "s" confusion by simply not using them!!
I was breaking my head on this for quite a while, and you guys have given such an elegant answer in no time. Some times makes me feel like giving up scripting, but it's too fun and productive.
Besides, I still haven't been shooed away from here, so till then I'll stick around. :D
| [reply] [d/l] [select] |
|
It gets rid of the "m" vs "s" confusion by simply not using them!!
/pattern/ is equivalent to m/pattern/. | Nope, pritesh_ugrankar is referring to the
/m and /s modifiers! But the following are
still a good read. :)
See
perlre, perlretut (highly recommended!) and perlreref for regex
info; also perlop for the m// and s/// (and
tr///) operators.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Applying regex to each line in a record.
by perlfan (Vicar) on Oct 25, 2020 at 20:29 UTC
|
It doesn't help that s/ and m/ imply different operation modes when prepended to the regex, "search and replace" and "match", respectively. And absence of anything in front implies m/. | [reply] [d/l] [select] |