Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^3: RE question: Sentence with a minimum length

by salva (Canon)
on Oct 06, 2008 at 11:01 UTC ( [id://715550]=note: print w/replies, xml ) Need Help??


in reply to Re^2: RE question: Sentence with a minimum length
in thread RE question: Sentence with a minimum length

Well, I am not a native English speaker, but as I understand it...
I want to match sentences (everything that is not just a long word) of a minimum length
...doesn't mean that the length of the sentence has to be minimal but that the length of the sentence has to be equal or bigger than a $minimum_length.

Anyway, if a minimal match is what you want, it still can be done with a regexp (without embedded code!), though a complicated one:

sub make_re { my $len = shift; $len > 3 or die "len <= 3"; my $re = "\\b" . join('|', map("\\w(?:\\s[\\s\\w]{$_,}?", reverse 1 .. ( +$len - 3)), "\\w+\\s+") . (")" x ($len - 3)) . "\\w+"; warn "re: /$re/\n"; return qr/($re)/; } my $re = make_re(5); while(<DATA>) { print "$1\n" if $_ =~ $re; } __DATA__ foo foo foooooooo foooooooo fooo foo foo foo foo foo foo foooooooo foo foo foo f foo fo fo foo f fo foo f fo

Replies are listed 'Best First'.
Re^4: RE question: Sentence with a minimum length
by salva (Canon) on Oct 06, 2008 at 16:41 UTC
    I was curious about the efficiency of the generated regular expression from my previous post. I run some benchmarks, and the results are somewhat unexpected, at least for me!:
    my $len = 300; my @lines; push @lines, join('', (map { 'f' . ('o' x rand $len * 1.5), (rand > .8 + ? '. ' : ' ') } 0..rand 20 ), "\n") for 0 .. 1000; sub make_re { my $len = shift; my $re = "\\b" . join('|', map("\\w(?:\\s[\\s\\w]{$_,}?", reverse 1 .. ( +$len - 3)), "\\w+\\s+") . (")" x ($len - 3)) . "\\w+"; qr/($re)/; } # match maximal length sentence my $len_minus_two = $len - 2; sub max { my @m = grep /\s*\b\w(?=\w*\s+\w+)[\w\s]{$len_minus_two,}\w/ +o, @lines } # match minimal length sentence my $re = make_re $len; sub min { my @m = grep /$re/, @lines } use Benchmark qw(cmpthese); cmpthese(-1, { max => \&max, min => \&min } ); __OUTPUT__ Rate max min max 24.3/s -- -26% min 33.0/s 36% --
    Note that the two regexps used match different things.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://715550]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2024-04-25 05:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found