regex at word boundary

mikeraz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regex at word boundary by japhy (Canon) on Dec 07, 2005 at 17:54 UTC
Backreferences in a regex are `\NUMBER`, not `$NUMBER`. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart 2005-12-12 Retitled by jdporter: s/boundry/boundary/ Original title: 'regex at word boundry'	[reply] [d/l] [select]
Re^2: regex at word boundary by mikeraz (Friar) on Dec 07, 2005 at 17:56 UTC
Ow! Ow, ow, ow. Thank you. Off to get more coffee. Be Appropriate && Follow Your Curiosity	[reply]
Re: regex at word boundary by davido (Cardinal) on Dec 07, 2005 at 18:01 UTC
japhy correctly mentioned that $2 isn't a backreference, \2 is. But there's another problem. If you wish for such a RE to match "God dog", that one won't be selective enough to exclude non-palindromes. This is because it only fails to match if the string you're comparing doesn't have two of the same letter. For example, "Pete" would match, because it has two "e's". But "God" wouldn't match, because it doesn't have two of any particular letter in it. Dave	[reply]
Re: regex at word boundary by ww (Archbishop) on Dec 07, 2005 at 18:09 UTC
mikeraz: Welcome. Briefly, re your actual question... `\b` marks a word_boundary, not the letter "b" -- except inside a character_class, where it's the metachar for backspace `/i` is a modifier which tells the regex to search in case_INsensitive mode. The Monastery does some magic with regard to rendering, so read Perl Monks Approved HTML tags. For example, use `[` to write an open_square_bracket here; a literal bracket, except inside <code>... </code> or <c>... </c> tags operates in the Monastery's specialized parsing as a link token. In short, you need to read how to post here, starting with Welcome to the Monastery and How do I post a question effectively?-- but even before that, you should probably read perlretut and similar regex introductions (including Tutorials#regexdata) as your question suggests you've missed some very basic info.	[reply] [d/l] [select]
Re: regex at word boundary by mikeraz (Friar) on Dec 07, 2005 at 22:06 UTC
I reviewed the link that swkronenfeld provided. The code samples, or at least as many as I waded through, did not find palindromes. The found mirror strings. I prefer pi is a palindrome. aaabbcbbaaa is not a palindrome. After writing to steele1381 and QM about intent and all that I decided I'd better step up and show the code. I believe this word end searching should be faster than a pure regex version but have not tested it yet. Here's the trivial program I wrote to take a text file and locate all the palindromes. Note: it's trivial and should be easy to break. In particular the punct filter is incomplete, HTML will break it . . . #!/usr/bin/perl -w use strict; # find palindromes in text file my ($le, @lines, @F, $test, $pal, %pals, $start_char, $i, $word); # cross line boundries but not paragraph boundries $le = $/; $/ = "\n\n"; @lines = <>; $/ = $le; foreach (@lines) { # strip punct - no posix on my system s/[)(\?".,\/]//g; s/-/ /g; chomp; (@F) = split; while (int @F) { # select array slices where last letter of last word in # slice equals first letter of first word $start_char = lc substr $F[0], 0, 1; foreach $i (1 .. $#F) { if( (lc substr $F[$i], -1) eq $start_char) { # test for palindrome $test = lc join "", @F[0..$i]; if($test eq reverse $test) { $pal = join " ", @F[0..$i]; $pals{$pal}++; } } } # grab single word palindromes $word = shift @F; if(length $word > 2 && lc $word eq lc reverse $word) { $pals{$word}++; } } } foreach $pal (sort keys %pals) { print "$pals{$pal}\t$pal\n"; } [download] Be Appropriate && Follow Your Curiosity	[reply] [d/l]
Re: regex at word boundary by QM (Parson) on Dec 07, 2005 at 19:49 UTC
As others have mentioned, you aren't going to find palindromes with that. Single word palindromes are problematic without some code magic. You really need regex recursion, which is what Regexp::Common::lingua does. There are ways to do it with embedded regex `??{code}`, without being as fancy as in R::C::l, but I'd go with R::C::l. If you really want to do it yourself, it's easy enough to Google for these, so I'll leave that up to you. As far as I know, there's not an easy way to skip over whitespace, without explicitly stripping it out first (but maybe I'm just not being creative enough today). -QM -- Quantum Mechanics: The dreams stuff is made of 2005-12-12 Retitled by jdporter: s/boundry/boundary/ Original title: 'regex at word boundry'	[reply] [d/l]
Re^2: regex at word boundary by davido (Cardinal) on Dec 08, 2005 at 04:27 UTC
You're right about the `(??{code})` construct facilitating the use of regular expressions in detecting palindromes. Behold: `use strict; use warnings; my $string = "God dog"; my $re = qr/^ (.+) (??{ ( length $^N > 1 and lc $^N eq reverse lc $^N ) ? '' : '1\b2' }) $/ix; print "$string is", ($string =~ $re) ? "" : "n't", " a palindrome.\n";` [download] One trick here; since `(??{code})` must spawn a regular subexpression that will then be evaluated for truth, the code block I used plays a little trick. If the boolean test of reversability succeeds, the (??{code}) expression returns an empty string (which, in other words, adds no additional criteria to be matched). If the reversability test fails, the code returns a regular subexpression that can never succeed; a search for '1\b2' (in other words, the literal number one, followed by the literal number two, but with a word boundry sandwiched between; an impossibility). That way the outcome of the reversability test can force a failure to match for the entire regular expression. Also, the use of `$^N` is a convenience described in perlvar and perlre. It contains the most recent parenthetical capture. That way you don't need to count capturing parens, not that it's an issue in this case. By the way, in my solution I chose to accept palindromes regardless of whether they contain only alpha characters or not. Why? Wikipedia says, "A palindrome is a word, phrase, number or any other sequence of units (like a strand of DNA) which has the property of reading the same in either direction..." It also says that the position of spaces can usually be adjusted as necessary. My regexp doesn't accomodate that possibility; it treats space like any other character. But why not; it's just a proof of concept. ;) Dave	[reply] [d/l] [select]
Re^3: regex at word boundary by QM (Parson) on Dec 08, 2005 at 04:50 UTC
Very good. `'1\b2'` [download] I would point out the existence of `(?!)`, which I'm told always fails, though I'm not sure I understand it. Update 1: By the way, in my solution I chose to accept palindromes regardless of whether they contain only alpha characters or not. Yes. The OP was looking for multi-word palindromes, perhaps more along the lines of the "interesting to humans" variety, which seems to be what started the thread in the first place. Update 2: After further examination, your idea could be adapted for intervening whitespace (or indeed any noise characters) if the regex engine was re-entrant (is that with or without the hyphen?). Something like: `(??{ local $N; ($N = $^N) =~ s/\w+//g; (lc $N eq reverse lc $N) ? '' : (?!) })` [download] which might be further streamlined to `(??{ local $N; ($N = $^N) =~ s/\w+//g; (?!) if (lc $N ne reverse lc $N) })` [download] I think `length $^N > 1` is superfluous, as `length $^N` is sufficient as a test, and `(.+)` would always be positive anyway (or is there a zero length character that would match?) -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]
Re^4: regex at word boundary by davido (Cardinal) on Dec 08, 2005 at 05:03 UTC
Re^5: regex at word boundary by QM (Parson) on Dec 08, 2005 at 05:12 UTC
Re^4: regex at word boundary by davido (Cardinal) on Dec 08, 2005 at 05:10 UTC
Re^5: regex at word boundary by QM (Parson) on Dec 08, 2005 at 05:25 UTC
Re^4: regex at word boundary by davido (Cardinal) on Dec 08, 2005 at 05:57 UTC
Re^5: regex at word boundary by QM (Parson) on Dec 10, 2005 at 05:46 UTC
Some notes below your chosen depth have not been shown here
Re^2: regex at word boundary by mikeraz (Friar) on Dec 07, 2005 at 20:32 UTC
You're right, the (corrected) code won't find palindromes. It wasn't intended to. It was the first step in finding a palindrome candidate. After the candidate is found I can strip out spaces and check to see if the word group does qualify as a palindrome. Ultimately I don't care about finding palindromes. It is just a little puzzle to apply some time to. Some people do soduko puzzles, some crosswords, I look for little problems that I don't know the answer to and explore the possibilities. Be Appropriate && Follow Your Curiosity	[reply]
Re^3: regex at word boundary by QM (Parson) on Dec 07, 2005 at 21:07 UTC
You're right, the (corrected) code won't find palindromes. It wasn't intended to. It was the first step in finding a palindrome candidate. OK, fair enough. So what did you want it to do? Did you just want to check for repeated characters at the proper end of word boundaries? [Insert "Lightbulb Over Head" graphic here ] You just want a filter to grab candidates, then strip whitespace, then test for palindrome-ness. I think you'd do us a favor by coming up with an appropriate regex with `??{code}`, something near: `/(\b (([a-z]) # at least one char ?{local @suffix; push @suffix, $^N}) # build up suffix (([a-z) ?{local @suffix; push @suffix, $^N} \| \S+)+ [a-z]? # possible odd char (([a-z]) (?{if ($^N eq $suffix[-1]) {pop @suffix}}) \| \S+)+) (([a-z]) (?{if ($^N eq $suffix[-1]) {pop @suffix}}) \b) /x` [download] Though I don't think that works, and it's untested. (Hey, I have to leave something up to you.) Besides, there's a lot left out of the perlre with `?{code}` and `??{code}`, so please come back and tell us what it should say :) -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]
Palindrome regex (was: regex at word boundary) by Roy Johnson (Monsignor) on Dec 12, 2005 at 17:10 UTC
I came up with this recursive-regex code. It's a bit simpler than most of the others posted here. It finds the leftmost longest palindrome in each line. `my $palin_re; $palin_re = qr/(([a-z]) # First letter [^a-z]* # any irrelevant chars (?:(??{$palin_re})\|[a-z]?) # The interior [^a-z]* # any other irrelevant chars \2) # the last letter matches the first /xi; while (<DATA>) { if (/\b$palin_re\b/) { print "Matched $1\n"; } else { print "No palindrome found in '$_'\n"; } } __DATA__ oo madam in eden, I'm Adam, prince of eternia is god a dog? nested testest detsen nested i prefer pi ip referp nothing to see here, folks.` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re: regex at word boundary by QM (Parson) on Dec 10, 2005 at 15:48 UTC
Is the code at Re^5: regex at word boundary useful? I thought you might compare it with your filter version, and see what falls out. Also, it could be made to search across newlines, but at considerable slowdown if not done carefully. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re^2: regex at word boundary by mikeraz (Friar) on Dec 12, 2005 at 16:45 UTC
The code at that Re^5: regex at word boundary fails on overlapping palindromes, not that I'm aware of any that exist in the wild. Such that adding nested testest detsen nested i prefer pi ip referp yeilds: line 7: (0) "nested testest detsen nested" (7) "testest detsen nested" (15) "detsen nested" (22) "nested" line 8: (0) "i prefer pi ip referp" (2) "prefer pi ip referp" (9) "pi ip referp" (12) "ip referp" (15) "referp" I also tested it on a handy text file of 79,569 lines and it ran much slower than the code I listed above, modified to just test on each line, not each paragraph. sunorccws04 ~$ time ./mr_pal.pl trf > mr.out real 1m2.161s user 1m1.210s sys 0m0.280s sunorccws04 ~$ time ./qm_pal.pl trf > qm.out real 2m53.492s user 2m49.070s sys 0m1.690s trf is the output of a tcpdump session. Other data sets are sure to produce differing comparative speeds. Be Appropriate && Follow Your Curiosity	[reply]
Re^3: regex at word boundary by QM (Parson) on Dec 12, 2005 at 18:16 UTC
Are you comparing apples to apples? Does the other code find overlapping palindromes? -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re^4: regex at word boundary by mikeraz (Friar) on Dec 12, 2005 at 18:22 UTC
Re: regex at word boundary by swkronenfeld (Hermit) on Dec 07, 2005 at 18:00 UTC
Edit: removing comments except for link due to <pre> <code> issue. Have a look here Try this in Regexp palindrome (found using the super search) for a discussion on this topic.	[reply]


Think about Loose Coupling
	PerlMonks