Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

//s modifier

by kettle (Beadle)
on Mar 21, 2006 at 10:57 UTC ( [id://538158]=perlquestion: print w/replies, xml ) Need Help??

kettle has asked for the wisdom of the Perl Monks concerning the following question:

How, exactly does the //s modifier work? Can I use this modifier to help me find occurrences of a word in a multiline context e.g., to return positive if the word 'RESULTADOS' is found between certain SGML tags in the following manner:

<TITLE>
RESULTADOS Y CLASIFICACIONES DE LA NBA
</TITLE>

I know I could chomp all the line returns and newlines and concatenate all the lines of text in the document, then search through it; but this strikes me as an AWFUL method. I could also use a series of boolean switches to make sure that the word I'm seeking appears in the correct context, but again this seems like terrible overkill.

I'm guessing that what I want to do can be rather easily and efficiently done with some clever modifier - perhaps the //s - but I'm not sure.

Any help would be greatly appreciated! --joe

Replies are listed 'Best First'.
Re: //s modifier
by davorg (Chancellor) on Mar 21, 2006 at 11:30 UTC

    The effect of the /s modifier is to change . so that it also matches a newline character (which it doesn't do by default).

    The effect of the /m modifier is to change ^ and $ so they match the start and end of a line (rather than the start and end of the string).

    So /s changes a single metacharacter and /m changes multiple metacharacters. That's how I remember it.

    And, yes, I think that /s will solve your problem.

    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: //s modifier
by tirwhan (Abbot) on Mar 21, 2006 at 11:43 UTC

    You don't necessarily need the s-modifier here, only if you want to use the dot to match newline characters:

    #!/usr/bin/perl use strict; use warnings; my $string = << "STRING_END"; <TITLE> RESULTADOS Y CLASIFICACIONES DE LA NBA </TITLE> STRING_END print "matched without modifier\n" if ($string =~ m{<TITLE>[^<]*RESULTADOS[^<]*</TITLE>}); print "matched with s modifier\n" if ($string =~ m{<TITLE>.*?RESULTADOS.*?</TITLE>}s);

    Note that both these solutions are imperfect, the first will not work for nested tags and the second will match if the keyword is anywhere between the first <TITLE> and the last </TITLE>, even if it's outside a title, e.g. <TITLE>something</TITLE>RESULTADOS<TITLE>else</TITLE> will match. Which is why regexes are usually a bad solution for this kind of problem, it would be better to parse the SGML and check the contents of TITLE nodes directly.


    All dogma is stupid.
Re: //s modifier
by jonadab (Parson) on Mar 21, 2006 at 13:07 UTC

    In the example you give, a regular expression will probably do what you want, because it is very unlikely that a document will contain two TITLE elements. However, in a slightly different example, e.g., if we were looking for certain text in a CAPTION element, then the regular expression that works for your example might fail, if the text in question occurs between two of the elements in question but not within either of them. It is possible to work around that with a much more complicated regular expression, but it's hairy, and it will still fail if the element in question can be nested within itself, either directly or indirectly. In such cases, you really need to use a module that parses the SGML and hands you a DOM. HTML::TreeBuilder and XML::Twig make this sort of thing easy for HTML and XML respectively, and there are various alternatives to them as well. I don't know as much about SGML modules, since I've never worked much with SGML (except for legacy versions of HTML that were SGML-based), but you might check the CPAN.

    Of course, if the example you gave is really all you want to do, then you may not need a parser, since the regex will probably be good enough.


    Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. Why, I've got so much sanity it's driving me crazy.
      the problem is actually considerably more complex than the example I gave. I decided I'll have to use an SGML parser, as you and the previous poster suggested. Thanks for the regex help and the SGML suggestions! joe
Re: //s modifier
by Melly (Chaplain) on Mar 21, 2006 at 11:17 UTC

    AFAIK you will need either the /m or /s operator - otherwise your regex will only ever look at a single line.

    If you use /m, then you will still need to handle newlines (since . won't match a newline). If you use /s, then . will match newlines, so will lead to a shorter regex.

    (Not tested)

    /<title>.*resultados.*<\/title>/is is equivalent to: /<title>.*\n?.*resultados.*\n.*<\/title>/im
    Tom Melly, tom@tomandlu.co.uk
      AFAIK you will need either the /m or /s operator - otherwise your regex will only ever look at a single line.

      No, that's not true, see below one-liner.

      perl -e '$t="hello\nmoto";print "yep\n" if $t=~m/hello\smoto/;'

      All the s modifier does is to change dot (.) to match newline characters and all the m modifier does is to make ^ and $ match at the beginning/end of each line instead of the whole string. See perldoc perlre.


      All dogma is stupid.
      I don't think that the two regexes are equivalent. The first one matches <title>resultados\n\n\n</title> the second doesn't.

        Sorry - should have said "similiar to" ;)

        Anyway, my real bad was saying he'd need one of the modifiers to ever hope on matching a multi-line regex... what was I thinking? (well, I know what I was thinking - got in a muddle over iterating, line-by-line, through a file, etc.).

        Tom Melly, tom@tomandlu.co.uk

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://538158]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2024-04-19 09:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found