kettle has asked for the wisdom of the Perl Monks concerning the following question:
How, exactly does the //s modifier work? Can I use this modifier to help me find occurrences of a word in a multiline context e.g., to return positive if the word 'RESULTADOS' is found between certain SGML tags in the following manner:
<TITLE>
RESULTADOS Y CLASIFICACIONES DE LA NBA
</TITLE>
I know I could chomp all the line returns and newlines and concatenate all the lines of text in the document, then search through it; but this strikes me as an AWFUL method. I could also use a series of boolean switches to make sure that the word I'm seeking appears in the correct context, but again this seems like terrible overkill.
I'm guessing that what I want to do can be rather easily and efficiently done with some clever modifier - perhaps the //s - but I'm not sure.
Any help would be greatly appreciated! --joe
Re: //s modifier
by davorg (Chancellor) on Mar 21, 2006 at 11:30 UTC
|
The effect of the /s modifier is to change . so that it also matches a newline character (which it doesn't do by default).
The effect of the /m modifier is to change ^ and $ so they match the start and end of a line (rather than the start and end of the string).
So /s changes a single metacharacter and /m changes multiple metacharacters. That's how I remember it.
And, yes, I think that /s will solve your problem.
--
< http://dave.org.uk>
"The first rule of Perl club is you do not talk about
Perl club." -- Chip Salzenberg
| [reply] |
Re: //s modifier
by tirwhan (Abbot) on Mar 21, 2006 at 11:43 UTC
|
You don't necessarily need the s-modifier here, only if you want to use the dot to match newline characters:
#!/usr/bin/perl
use strict;
use warnings;
my $string = << "STRING_END";
<TITLE>
RESULTADOS Y CLASIFICACIONES DE LA NBA
</TITLE>
STRING_END
print "matched without modifier\n"
if ($string =~ m{<TITLE>[^<]*RESULTADOS[^<]*</TITLE>});
print "matched with s modifier\n"
if ($string =~ m{<TITLE>.*?RESULTADOS.*?</TITLE>}s);
Note that both these solutions are imperfect, the first will not work for nested tags and the second will match if the keyword is anywhere between the first <TITLE> and the last </TITLE>, even if it's outside a title, e.g. <TITLE>something</TITLE>RESULTADOS<TITLE>else</TITLE> will match. Which is why regexes are usually a bad solution for this kind of problem, it would be better to parse the SGML and check the contents of TITLE nodes directly.
| [reply] [d/l] [select] |
Re: //s modifier
by jonadab (Parson) on Mar 21, 2006 at 13:07 UTC
|
In the example you give, a regular expression will probably do what you
want, because it is very unlikely that a document will contain two
TITLE elements. However, in a slightly different example, e.g., if
we were looking for certain text in a CAPTION element, then the regular
expression that works for your example might fail, if the text in
question occurs between two of the elements in question but not within
either of them. It is possible to work around that with a much more
complicated regular expression, but it's hairy, and it will still fail
if the element in question can be nested within itself, either directly
or indirectly. In such cases, you really need to use a module that
parses the SGML and hands you a DOM. HTML::TreeBuilder and XML::Twig
make this sort of thing easy for HTML and XML respectively, and there
are various alternatives to them as well. I don't know as much about
SGML modules, since I've never worked much with SGML (except for legacy
versions of HTML that were SGML-based), but you might check the
CPAN.
Of course, if the example you gave is really all you want to do, then
you may not need a parser, since the regex will probably be good enough.
Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. Why, I've got so much sanity it's driving me crazy.
| [reply] |
|
the problem is actually considerably more complex than the example I gave. I decided I'll have to use an SGML parser, as you and the previous poster suggested. Thanks for the regex help and the SGML suggestions!
joe
| [reply] |
Re: //s modifier
by Melly (Chaplain) on Mar 21, 2006 at 11:17 UTC
|
AFAIK you will need either the /m or /s operator - otherwise your regex will only ever look at a single line.
If you use /m, then you will still need to handle newlines (since . won't match a newline). If you use /s, then . will match newlines, so will lead to a shorter regex.
(Not tested)
/<title>.*resultados.*<\/title>/is
is equivalent to:
/<title>.*\n?.*resultados.*\n.*<\/title>/im
Tom Melly, tom@tomandlu.co.uk
| [reply] [d/l] |
|
AFAIK you will need either the /m or /s operator - otherwise your regex will only ever look at a single line.
No, that's not true, see below one-liner.
perl -e '$t="hello\nmoto";print "yep\n" if $t=~m/hello\smoto/;'
All the s modifier does is to change dot (.) to match newline characters and all the m modifier does is to make ^ and $ match at the beginning/end of each line instead of the whole string. See perldoc perlre.
| [reply] [d/l] |
|
I don't think that the two regexes are equivalent. The first one matches <title>resultados\n\n\n</title> the second doesn't.
| [reply] [d/l] |
|
Sorry - should have said "similiar to" ;)
Anyway, my real bad was saying he'd need one of the modifiers to ever hope on matching a multi-line regex... what was I thinking? (well, I know what I was thinking - got in a muddle over iterating, line-by-line, through a file, etc.).
Tom Melly, tom@tomandlu.co.uk
| [reply] |
|
|