Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Need help using regex to extract multiple matches

by GrandFather (Saint)
on Nov 26, 2019 at 06:22 UTC ( [id://11109229]=note: print w/replies, xml ) Need Help??


in reply to Need help using regex to extract multiple matches

Something like:

use strict; use warnings; my $webStr = <<EOS; data-src-hq="image location1" <p>other stuff</p> <p>data-src-hq="image location2"</p> EOS my @matches = $webStr =~ /data-src-hq="([^"]+)"/g; print join "\n", @matches, '';

Prints:

image location1 image location2

Note the /g switch on the regex.

Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

Replies are listed 'Best First'.
Re^2: Need help using regex to extract multiple matches
by SergioQ (Beadle) on Dec 02, 2019 at 03:36 UTC
    Thank you for the solution, may I ask a little more? I've read up on regex, but sometimes I think I have a mild reading disorder. Could you break down for me how the solution worked:

    Here's how I believe I read your solution:

    img src=" is the first part to match

    [^"]match everything but a quote

    +" stop when you hit a quote

    () return only what matches within the brackets

    Am also curious what's the difference between +" and +?" since both seem to work

    Thank you again

    SergioQ

      I'm assuming you're referring to the regex in GrandFather's reply:
          /data-src-hq="([^"]+)"/g

      First, let me draw your attention to YAPE::Regex::Explain, which can explain regexes that do not have regex operators or features added after Perl verion 5.6:

      c:\@Work\Perl\monks>perl use strict; use warnings; use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr/data-src-hq="([^"]+)"/)->explain; __END__ The regular expression: (?-imsx:data-src-hq="([^"]+)") matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- data-src-hq=" 'data-src-hq="' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [^"]+ any character except: '"' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
      There are also on-line regex explainers.

      Now let me address your narration.

      img src=" is the first part to match
      Ok.
      [^"] match everything but a quote
      I would word this as match a single character from the class of all characters except a  " (double-quote). It's important to realize that the  [...] regex operator defines a character class or set (see Character Classes and other Special Escapes in perlre and also this topic in perlretut, perlrequick and perlrecharclass), and that all by itself, any  [...] matches only a single character.
      +" stop when you hit a quote
      I would quarrel with this description. The  + quantifier (see Quantifiers in perlre; see also the topic of quantifiers in perlretut and perlrequick) is associated with the expression before it, i.e.,  [^"]+ and I would read it as match one or more characters from the class/set of all characters except a double-quote. Again, the double-quote is not directly associated with the  + quantifier in your  +" — but see below because they are | can be related.
      () return only what matches within the brackets
      Ok.

      Am also curious what's the difference between +" and +?" since both seem to work

      Again, note that the  + or  +? quantifiers affect the preceding  [^"] character class, not the double-quote that follows. In the  /data-src-hq="([^"]+)"/g match regex, the final  " (double-quote) is not absolutely needed because  [^"]+ will match as much as possible until it either hits a  " or the end of the string. (I would still tend to use it because I like the feeling of security that well-defined boundaries give me. Also, a final " in the match will prevent a match with a "runaway" quote in a string in which the closing " is missing.) However, if you use a  [^"]+? "lazy" or "non-greedy" expression instead, the final " becomes vital to matching the entire contents of the double-quoted substring. Try this:

      c:\@Work\Perl\monks>perl use strict; use warnings; my $s = 'foo "xyzzy" bar'; print qq{+? (lazy) quantifier with final ": matched '$1' \n} if $s +=~ /"([^"]+?)"/; print qq{+? (lazy) quantifier without final ": matched '$1' \n} if $s +=~ /"([^"]+?)/; __END__ +? (lazy) quantifier with final ": matched 'xyzzy' +? (lazy) quantifier without final ": matched 'x'
      A lazy quantifier matches the minimum necessary for an overall match. A final " in the regex is necessary in this case to capture the entire quoted substring. Take a look at this and be sure you understand what's going on, i.e., the difference between lazy and greedy matching.


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11109229]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-04-25 07:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found