Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Applying regex to each line in a record.

by pritesh_ugrankar (Monk)
on Oct 24, 2020 at 18:40 UTC ( #11123126=perlquestion: print w/replies, xml ) Need Help??

pritesh_ugrankar has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Even after multiple attempts, I am at a total loss of how the "m" and "s" works for regex. I have a file like this:

first: this:that here:there when:what how:where now:later second: this:that here:there when:what how:where now:later

Please note that in the example, "this:that", "here:there", etc are repeated, but that's not the case with the actual record I am working on. I am trying to write a script that will create a hash of hashes such that I get a hash with a key "first" and its value will be a key value pair such that the key will be "this" and value will be "that", key will "here" and value will be "there", and so on and so forth. But even before I reach there, I need to ensure that I write the correct regex. So, I tried to write a regex that will skip the line if it contains "first" or "second". Of course if the regex works, I can then capture the part and make that as a hashkey, but that's much later.

I've tried the following, but it does not work. And I am pretty sure it's because my utter failure to understand how "m" and "s" work.

use strict; use warnings; use Data::Dumper; my $file = "new_testfile.txt"; my $testhashref; open (my $fh, "<",$file) or die "Can't open open file $file:$!"; { local $/ = ""; while (my @records = <$fh>) { foreach my $line (@records) { next if $line =~ /^[a-z]+:$/m; print "$line"; } } }

I tried using "s" instead of "m", but when I run the script, it does not read anything.

pritesh@pavilion:~/perlscripts$ perl test.pl pritesh@pavilion:~/perlscripts$

If I remove the next if $line =~ /^[a-z]+:$/m;, I get the whole file like so:

pritesh@pavilion:~/perlscripts$ perl test.pl first: this:that here:there when:what how:where now:later second: this:that here:there when:what how:where now:later

So at least I know it's reading the records right. I will be thankful if you could help me with this one.

Replies are listed 'Best First'.
Re: Applying regex to each line in a record.
by haukex (Bishop) on Oct 24, 2020 at 22:26 UTC
    Even after multiple attempts, I am at a total loss of how the "m" and "s" works for regex.
    • /m changes the meaning of ^ and $:
      • Without /m,
        • ^ matches only at the very beginning of the string. (This is the same as \A, except that \A is not affected by /m.)
        • $ matches at the very end of the string, but if the string ends with \n, it will match just before and just after this \n. (This is the same as \Z, except that \Z is not affected by /m.)
      • With /m,
        • ^ matches at the very beginning of the string, and just after any \n, except if the \n is the last character in the string. In other words, it matches at the beginning of each line within the string.
        • $ matches just before each \n, in other words before the end of every line within the string, and at the very end of the string.
    • /s changes the meaning of .:
      • Without /s, . matches anything except the newline, i.e. [^\n]. In other words, a regex of /.+/g is limited to matching one line within the string at a time.
      • With /s, . matches absolutely any character, including \n.

    Note that /m and /s are completely independent of one another. Keep in mind that ^ and $ are zero-width matches - for example, this means that with $_ = "a\nb", a regex of /$/gm will match and leave the regex engine's position at before the \n, and a following regex of /./gs would then match that \n. Here is some code to play around with (try changing the lists of $strings and $regexes). As you can see, /m really only becomes important if there are multiple \n's in the string. And of course there's the WebPerl Regex Tester that visualizes this as well (modern browser required).

    use warnings; use strict; use open qw/:std :utf8/; use Term::ANSIColor qw/colored/; for my $str ( "a","a\n","a\nb","a\n\nb","a\nb\nc\n","a\nb\nc\nd") { for my $regex ( '/^/g','/^/gm','/$/g','/$/gm','/./g','/./gs' ) { my $o = join( '', map { sprintf "%2s", chr( $_<0x21 ? 0x2400+$_ : $_==0x7F ? 0x2421 : $_ ) } map ord, split //, $str )." "; my @matches; eval qq{ push \@matches, [[\@-],[\@+]] while \$str=~$regex ;1} or die $@; my ($matchcnt,%matches) = (1); for my $match (@matches) { my @pos = $match->[0][0]==$match->[1][0] ? ( $match->[0][0] * 2 ) : map { $_*2+1 } $match->[0][0]..$match->[1][0]-1; for my $p (@pos) { die "overlapping matches not supported" if exists $matches{$p}; $matches{$p} = $matchcnt; } } continue { $matchcnt++ } substr($o, $_, 1) = colored(['underline'], substr($o, $_, 1)) #"<u>".substr($o, $_, 1)."</u>" # alternative for HTML for sort { $b<=>$a } keys %matches; printf "%6s: %s\n", $regex, $o; } }

    Output:

      /^/g:  a 
     /^/gm:  a 
      /$/g:  a 
     /$/gm:  a 
      /./g:  a 
     /./gs:  a 
      /^/g:  a ␊ 
     /^/gm:  a ␊ 
      /$/g:  a  
     /$/gm:  a  
      /./g:  a ␊ 
     /./gs:  a  
      /^/g:  a ␊ b 
     /^/gm:  a ␊ b 
      /$/g:  a ␊ b 
     /$/gm:  a ␊ b 
      /./g:  ab 
     /./gs:  a  b 
      /^/g:  a ␊ ␊ b 
     /^/gm:  a ␊  b 
      /$/g:  a ␊ ␊ b 
     /$/gm:  a  ␊ b 
      /./g:  a ␊ ␊ b 
     /./gs:  a   b 
      /^/g:  a ␊ b ␊ c ␊ 
     /^/gm:  a ␊ b ␊ c ␊ 
      /$/g:  a ␊ b ␊ c  
     /$/gm:  a ␊ b ␊ c  
      /./g:  abc ␊ 
     /./gs:  a  b  c  
      /^/g:  a ␊ b ␊ c ␊ d 
     /^/gm:  a ␊ b ␊ c ␊ d 
      /$/g:  a ␊ b ␊ c ␊ d 
     /$/gm:  a ␊ b ␊ c ␊ d 
      /./g:  abcd 
     /./gs:  a  b  c  d 
    

      Hi Haukex,

      I'm truly at a loss of words. While the code you've written here is truly advanced for me, The output is teaching me a lot.

Re: Applying regex to each line in a record.
by tybalt89 (Prior) on Oct 24, 2020 at 19:26 UTC

    If I understand your problem correctly, this is how I'd do it (without all that mucking around with /s and /m and paragraphs)

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11123126 use warnings; my %hash; my $current; while( <DATA> ) { if( /(\w+):\n/ ) { $hash{$1} = $current = {}; } elsif( /(\w+):(\w+)/ ) { $current->{$1} = $2; } } use Data::Dump 'dd'; dd \%hash; __DATA__ first: this:that here:there when:what how:where now:later name:onlyfirst second: this:that here:there when:what how:where now:later name:onlysecond

    Outputs:

    { first => { here => "there", how => "where", name => "onlyfirst", now => "later", this => "that", when => "what", }, second => { here => "there", how => "where", name => "onlysecond", now => "later", this => "that", when => "what", }, }

    Is that the hash-of-hashes you are looking for?

Re: Applying regex to each line in a record.
by choroba (Archbishop) on Oct 24, 2020 at 19:30 UTC
    You need to remember the current header (or the current section).
    #!/usr/bin/perl use warnings; use strict; my %hash; my $header; while (<>) { if (my ($h) = /^(.*):$/) { $header = $h; } elsif (my ($k, $v) = /^(.*):(.*)$/) { $hash{$header}{$k} = $v; } } use Data::Dumper; print Dumper \%hash;
    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Applying regex to each line in a record.
by haukex (Bishop) on Oct 25, 2020 at 10:45 UTC

    To comment on your code specifically:

    local $/ = ""; while (my @records = <$fh>) { foreach my $line (@records) {

    You're activating paragraph mode with $/ = "", meaning that for your sample data, each call to <$fh> in scalar context (e.g. my $para = <$fh>) will return one paragraph (e.g. "first:\nthis:that\nhere:there..."). However, in your while, you're calling <$fh> in list context (because you're assigning to an array), which will cause it to return all records, i.e. all paragraphs. This means that the second time the while tries to execute, it won't get anything from $fh, so the while loop will only execute once, and that makes the while loop kind of useless in this code. You can see this yourself by adding a print Dumper(\@records); at the top of the while (I'd also strongly recommend setting $Data::Dumper::Useqq=1;).

    Next, based on your variable naming and code, I guess that what you are expecting is that foreach my $line (@records) will loop over the lines in each paragraph. However, Perl doesn't do this automatically - with this code, you'd have to split each element of @records manually. What you're doing instead is looping over the paragraphs. Here is the code I think you were trying to write:

    local $/ = ""; while (my $paragraph = <$fh>) { print Dumper($paragraph); foreach my $line (split /\n+/, $paragraph) { print Dumper($line); next if $line =~ /^[a-z]+:$/m; print "<$line>\n"; } }

    As you can see, the problem actually occurrs before your code even gets to the regex.

    The above approach is ok, as long as the paragraphs don't get too large to fit comfortably into RAM. Otherwise, you'd have to choose a more efficient approach like reading the file line-by-line and recognizing paragraphs with a state machine type approach. The other monks have shown you several examples of different approaches.

      Hi Haukex,

      Amazing....Yes, indeed I was thinking on the same lines you said. Thank you so very much.

Re: Applying regex to each line in a record.
by tybalt89 (Prior) on Oct 24, 2020 at 21:20 UTC

    I tried paragraph mode, still didn't need /s and /m. :)

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11123126 use warnings; local $/ = ''; # paragraph at a time my %hash = map { $1 x /(\w+):\n/, { /(\w+):(\w+)/g } } <DATA>; use Data::Dump 'dd'; dd \%hash; __DATA__ first: this:that here:there when:what how:where now:later name:onlyfirst second: this:that here:there when:what how:where now:later name:onlysecond
Re: Applying regex to each line in a record.
by AnomalousMonk (Bishop) on Oct 24, 2020 at 20:51 UTC

    This approach is very generalized (some might say over-generalized :) and more verbose, but has the advantage that it is very flexible and can be highly specialized. E.g., patterns for top level key and lower level key/values can be individually specified. The script does a fair amount of data validation. I'm using a slightly different example dataset for testing.

    Script extract_HoH_1.pl:

    Example dataset 11123126_1.dat:

    Example run:

    Win8 Strawberry 5.8.9.5 (32) Sat 10/24/2020 15:46:02 C:\@Work\Perl\monks\pritesh_ugrankar >perl extract_HoH_1.pl $VAR1 = { 'first' => { 'when' => 'what', 'here' => 'there', 'first' => 'firstAsKey', 'now' => 'later', 'how' => 'where', 'firstAsValue' => 'first', 'this' => 'that' }, 'second' => { 'now2' => 'later2', 'here2' => 'there2', 'how2' => 'where2', 'when2' => 'what2', 'secondAsValue' => 'second', 'this2' => 'that2', 'second' => 'secondAsKey' } };


    Give a man a fish:  <%-{-{-{-<

Re: Applying regex to each line in a record.
by stevieb (Canon) on Oct 24, 2020 at 20:56 UTC

    This can also be accomplished without a regex at all, thanks to the split() function.

    use strict; use warnings; use Data::Dumper; my %hash; my $header; while (my $line = <DATA>) { chomp $line; my ($k, $v) = split ':', $line; next if ! $k; if (! $v) { $header = $k; next; } $hash{$header}{$k} = $v; } print Dumper \%hash; __DATA__ first: this:that here:there when:what how:where now:later second: this:that here:there when:what how:where now:later

    Output:

    $VAR1 = { 'second' => { 'now' => 'later', 'how' => 'where', 'here' => 'there', 'this' => 'that', 'when' => 'what' }, 'first' => { 'when' => 'what', 'how' => 'where', 'here' => 'there', 'this' => 'that', 'now' => 'later' } };
      > without a regex at all, thanks to the split() function.

      Note that the first argument to split, even if you write it as ':', is a regex. Only a space is special.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        I was hoping that nobody would notice and point that out ;)

        But yes, choroba is absolutely correct in his assessment.

Re: Applying regex to each line in a record.
by pritesh_ugrankar (Monk) on Oct 24, 2020 at 20:41 UTC

    Hi Choroba and Tybalt,

    Thank you for the answer!! Truly amazing. It gets rid of the "m" vs "s" confusion by simply not using them!!

    I was breaking my head on this for quite a while, and you guys have given such an elegant answer in no time. Some times makes me feel like giving up scripting, but it's too fun and productive. Besides, I still haven't been shooed away from here, so till then I'll stick around. :D

      It gets rid of the "m" vs "s" confusion by simply not using them!!

      /pattern/ is equivalent to m/pattern/. | Nope, pritesh_ugrankar is referring to the /m and /s modifiers! But the following are still a good read. :) See perlre, perlretut (highly recommended!) and perlreref for regex info; also perlop for the m// and s/// (and tr///) operators.


      Give a man a fish:  <%-{-{-{-<

Re: Applying regex to each line in a record.
by perlfan (Vicar) on Oct 25, 2020 at 20:29 UTC
    It doesn't help that s/ and m/ imply different operation modes when prepended to the regex, "search and replace" and "match", respectively. And absence of anything in front implies m/.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11123126]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2020-11-24 04:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?