Re: RE question: Sentence with a minimum length

Replies are listed 'Best First'.
Re^2: RE question: Sentence with a minimum length by moritz (Cardinal) on Oct 06, 2008 at 09:22 UTC
lima1 wants minimal matches, so the `{49,}` should actually be `{49,}?`, in which case it stops working. The reason is that the look-ahead is not limited to what the `[\w\s]{49,}?` matches. A small demonstration: `#!/usr/bin/perl use strict; use warnings; my $re = qr{^\s(?=\w+\s+\w+)[\w\s]{49,}?\w}; my $str = ('x' x 65) . ' x'; if ($str =~ m/$re/) { print $&, $/; } __END__ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx` [download] You see that match only includes a long word, not a sentence. There is a sentence present, but it's not matched. Update: In Perl 6 you could use `&` like this: `regex short_word { \s [ & \w+ \s+ \w+ & .**?{1..50} ] }` [download]	[reply] [d/l] [select]
Re^3: RE question: Sentence with a minimum length by salva (Canon) on Oct 06, 2008 at 11:01 UTC
Well, I am not a native English speaker, but as I understand it... I want to match sentences (everything that is not just a long word) of a minimum length ...doesn't mean that the length of the sentence has to be minimal but that the length of the sentence has to be equal or bigger than a $minimum_length. Anyway, if a minimal match is what you want, it still can be done with a regexp (without embedded code!), though a complicated one: `sub make_re { my $len = shift; $len > 3 or die "len <= 3"; my $re = "\\b" . join('\|', map("\\w(?:\\s[\\s\\w]{$_,}?", reverse 1 .. ( +$len - 3)), "\\w+\\s+") . (")" x ($len - 3)) . "\\w+"; warn "re: /$re/\n"; return qr/($re)/; } my $re = make_re(5); while(<DATA>) { print "$1\n" if $_ =~ $re; } __DATA__ foo foo foooooooo foooooooo fooo foo foo foo foo foo foo foooooooo foo foo foo f foo fo fo foo f fo foo f fo` [download]	[reply] [d/l]
Re^4: RE question: Sentence with a minimum length by salva (Canon) on Oct 06, 2008 at 16:41 UTC
I was curious about the efficiency of the generated regular expression from my previous post. I run some benchmarks, and the results are somewhat unexpected, at least for me!: my $len = 300; my @lines; push @lines, join('', (map { 'f' . ('o' x rand $len * 1.5), (rand > .8 + ? '. ' : ' ') } 0..rand 20 ), "\n") for 0 .. 1000; sub make_re { my $len = shift; my $re = "\\b" . join('\|', map("\\w(?:\\s[\\s\\w]{$_,}?", reverse 1 .. ( +$len - 3)), "\\w+\\s+") . (")" x ($len - 3)) . "\\w+"; qr/($re)/; } # match maximal length sentence my $len_minus_two = $len - 2; sub max { my @m = grep /\s\b\w(?=\w\s+\w+)[\w\s]{$len_minus_two,}\w/ +o, @lines } # match minimal length sentence my $re = make_re $len; sub min { my @m = grep /$re/, @lines } use Benchmark qw(cmpthese); cmpthese(-1, { max => \&max, min => \&min } ); __OUTPUT__ Rate max min max 24.3/s -- -26% min 33.0/s 36% -- [download] Note that the two regexps used match different things.	[reply] [d/l]
Re^2: RE question: Sentence with a minimum length by lima1 (Curate) on Oct 06, 2008 at 09:08 UTC
Ah, nice. Seems to work perfectly! Thank you very much... Update: ++moritz. But that's still ok for me, because it filters my problematic case (just a long word). That's all I need here.	[reply]


There's more than one way to do things
	PerlMonks