http://qs321.pair.com?node_id=715529

lima1 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I don't find a working regex for a problem that sounds not very difficult: I want to match sentences (everything that is not just a long word) of a minimum length. Leading and trailing spaces should be ignored. My first idea was

m{\A \s* .{50,}? \s* \z}xms
but this matches also long words and also the surrounding spaces.
m{\A \s* \w .{48,}? \w \s* \z}xms
seems to fix my spaces problem, but still matches long words.

Thanks!

Replies are listed 'Best First'.
Re: RE question: Sentence with a minimum length
by salva (Canon) on Oct 06, 2008 at 09:02 UTC
    You can use a look-ahead assertion (see perlre) to ensure that there are at least two words in the sentence:
    /\s*(?=\w+\s+\w+)[\w\s]{49,}\w/
      lima1 wants minimal matches, so the {49,} should actually be {49,}?, in which case it stops working.

      The reason is that the look-ahead is not limited to what the [\w\s]{49,}? matches. A small demonstration:

      #!/usr/bin/perl use strict; use warnings; my $re = qr{^\s*(?=\w+\s+\w+)[\w\s]{49,}?\w}; my $str = ('x' x 65) . ' x'; if ($str =~ m/$re/) { print $&, $/; } __END__ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

      You see that match only includes a long word, not a sentence. There is a sentence present, but it's not matched.

      Update: In Perl 6 you could use & like this:

      regex short_word { \s* [ & \w+ \s+ \w+ & .**?{1..50} ] }
        Well, I am not a native English speaker, but as I understand it...
        I want to match sentences (everything that is not just a long word) of a minimum length
        ...doesn't mean that the length of the sentence has to be minimal but that the length of the sentence has to be equal or bigger than a $minimum_length.

        Anyway, if a minimal match is what you want, it still can be done with a regexp (without embedded code!), though a complicated one:

        sub make_re { my $len = shift; $len > 3 or die "len <= 3"; my $re = "\\b" . join('|', map("\\w(?:\\s[\\s\\w]{$_,}?", reverse 1 .. ( +$len - 3)), "\\w+\\s+") . (")" x ($len - 3)) . "\\w+"; warn "re: /$re/\n"; return qr/($re)/; } my $re = make_re(5); while(<DATA>) { print "$1\n" if $_ =~ $re; } __DATA__ foo foo foooooooo foooooooo fooo foo foo foo foo foo foo foooooooo foo foo foo f foo fo fo foo f fo foo f fo
      Ah, nice. Seems to work perfectly! Thank you very much...

      Update:

      ++moritz. But that's still ok for me, because it filters my problematic case (just a long word). That's all I need here.

Re: RE question: Sentence with a minimum length
by moritz (Cardinal) on Oct 06, 2008 at 08:52 UTC
    You could try something along these lines (untested):
    m/\A \s* (\w+\s+\w+?.*?) (??{ length($1) >= 50 }) \s* \z/xs;

    (Update: this regex might be very inefficient because both \w+? and .*? might match the same number characters. Use (\w+\s+\w.*?) instead.)

    In general that sounds like a problem for which regexes aren't the best solution (split and friends might be better).

      Thanks! Yeah, this was also the only way I saw, but I was scared by the experimental warning in perlre. I need a regex here because the API I use requires a regex.