mikeraz has asked for the wisdom of the Perl Monks concerning the following question:
I'm writing a bit to look for palindromes. To search for those with embedded whitespace, like "God dog" I started with this regex:
/(\b([a-zA-Z]).*$2\b)/i;
But it matches any line. For instance:
Being able to script this in Perl gives a lot more flexibility, and
is arguably easier to use and extend.
I'm blind as to where the $2 match is occuring in those two lines. There is no word ending in b on the first line. There is no word ending in i on the second.
What is the problem with my perception of the regex?
update: OP used <pre> instead of <code> tags. This confused the issue for the first few replies. Question was really and completely addressed by japhy - I was asking for the smack upside the head to clear out temporary blindness.
Be Appropriate && Follow Your Curiosity
Re: regex at word boundary
by japhy (Canon) on Dec 07, 2005 at 17:54 UTC
|
| [reply] [d/l] [select] |
|
| [reply] |
Re: regex at word boundary
by davido (Cardinal) on Dec 07, 2005 at 18:01 UTC
|
japhy correctly mentioned that $2 isn't a backreference, \2 is. But there's another problem. If you wish for such a RE to match "God dog", that one won't be selective enough to exclude non-palindromes. This is because it only fails to match if the string you're comparing doesn't have two of the same letter. For example, "Pete" would match, because it has two "e's". But "God" wouldn't match, because it doesn't have two of any particular letter in it.
| [reply] |
Re: regex at word boundary
by ww (Archbishop) on Dec 07, 2005 at 18:09 UTC
|
mikeraz:
Welcome. Briefly, re your actual question...
\b marks a word_boundary, not the letter "b" -- except inside a character_class, where it's the metachar for backspace
/i is a modifier which tells the regex to search in case_INsensitive mode.
The Monastery does some magic with regard to rendering, so read Perl Monks Approved HTML tags. For example, use [ to write an open_square_bracket here; a literal bracket, except inside <code>... </code> or <c>... </c> tags operates in the Monastery's specialized parsing as a link token.
In short, you need to read how to post here, starting with Welcome to the Monastery and How do I post a question effectively?-- but even before that, you should probably read perlretut and similar regex introductions (including Tutorials#regexdata) as your question suggests you've missed some very basic info.
| [reply] [d/l] [select] |
Re: regex at word boundary
by mikeraz (Friar) on Dec 07, 2005 at 22:06 UTC
|
I reviewed the link that swkronenfeld provided. The code samples, or at least as many as I waded through, did not find palindromes. The found mirror strings. I prefer pi is a palindrome. aaabbcbbaaa is not a palindrome.
After writing to steele1381 and QM about intent and all that I decided I'd better step up and show the code.
I believe this word end searching should be faster than a pure regex version but have not tested it yet.
Here's the trivial program I wrote to take a text file and locate all the palindromes. Note: it's trivial and should be easy to break. In particular the punct filter is incomplete, HTML will break it . . .
#!/usr/bin/perl -w
use strict;
# find palindromes in text file
my ($le, @lines, @F, $test, $pal, %pals, $start_char, $i, $word);
# cross line boundries but not paragraph boundries
$le = $/;
$/ = "\n\n";
@lines = <>;
$/ = $le;
foreach (@lines) {
# strip punct - no posix on my system
s/[)(\?".,\/]//g;
s/-/ /g;
chomp;
(@F) = split;
while (int @F) {
# select array slices where last letter of last word in
# slice equals first letter of first word
$start_char = lc substr $F[0], 0, 1;
foreach $i (1 .. $#F) {
if( (lc substr $F[$i], -1) eq $start_char) {
# test for palindrome
$test = lc join "", @F[0..$i];
if($test eq reverse $test) {
$pal = join " ", @F[0..$i];
$pals{$pal}++;
}
}
}
# grab single word palindromes
$word = shift @F;
if(length $word > 2 && lc $word eq lc reverse $word) {
$pals{$word}++;
}
}
}
foreach $pal (sort keys %pals) {
print "$pals{$pal}\t$pal\n";
}
Be Appropriate && Follow Your Curiosity
| [reply] [d/l] |
Re: regex at word boundary
by QM (Parson) on Dec 07, 2005 at 19:49 UTC
|
As others have mentioned, you aren't going to find palindromes with that. Single word palindromes are problematic without some code magic. You really need regex recursion, which is what Regexp::Common::lingua does.
There are ways to do it with embedded regex ??{code}, without being as fancy as in R::C::l, but I'd go with R::C::l. If you really want to do it yourself, it's easy enough to Google for these, so I'll leave that up to you.
As far as I know, there's not an easy way to skip over whitespace, without explicitly stripping it out first (but maybe I'm just not being creative enough today).
-QM
--
Quantum Mechanics: The dreams stuff is made of
2005-12-12 Retitled by jdporter: s/boundry/boundary/ Original title: 'regex at word boundry' | [reply] [d/l] |
|
use strict;
use warnings;
my $string = "God dog";
my $re = qr/^
(.+)
(??{
(
length $^N > 1
and lc $^N eq reverse lc $^N
) ? '' : '1\b2'
})
$/ix;
print "$string is", ($string =~ $re)
? ""
: "n't",
" a palindrome.\n";
One trick here; since (??{code}) must spawn a regular subexpression that will then be evaluated for truth, the code block I used plays a little trick. If the boolean test of reversability succeeds, the (??{code}) expression returns an empty string (which, in other words, adds no additional criteria to be matched). If the reversability test fails, the code returns a regular subexpression that can never succeed; a search for '1\b2' (in other words, the literal number one, followed by the literal number two, but with a word boundry sandwiched between; an impossibility). That way the outcome of the reversability test can force a failure to match for the entire regular expression.
Also, the use of $^N is a convenience described in perlvar and perlre. It contains the most recent parenthetical capture. That way you don't need to count capturing parens, not that it's an issue in this case.
By the way, in my solution I chose to accept palindromes regardless of whether they contain only alpha characters or not. Why? Wikipedia says, "A palindrome is a word, phrase, number or any other sequence of units (like a strand of DNA) which has the property of reading the same in either direction..." It also says that the position of spaces can usually be adjusted as necessary. My regexp doesn't accomodate that possibility; it treats space like any other character. But why not; it's just a proof of concept. ;)
| [reply] [d/l] [select] |
|
(??{
local $N;
($N = $^N) =~ s/\w+//g;
(lc $N eq reverse lc $N)
? '' : (?!)
})
which might be further streamlined to
(??{
local $N;
($N = $^N) =~ s/\w+//g;
(?!) if (lc $N ne reverse lc $N)
})
I think length $^N > 1 is superfluous, as length $^N is sufficient as a test, and (.+) would always be positive anyway (or is there a zero length character that would match?)
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
You're right, the (corrected) code won't find palindromes. It wasn't intended to. It was the first step in finding a palindrome candidate. After the candidate is found I can strip out spaces and check to see if the word group does qualify as a palindrome.
Ultimately I don't care about finding palindromes. It is just a little puzzle to apply some time to. Some people do soduko puzzles, some crosswords, I look for little problems that I don't know the answer to and explore the possibilities.
Be Appropriate && Follow Your Curiosity
| [reply] |
|
You're right, the (corrected) code won't find palindromes. It wasn't intended to. It was the first step in finding a palindrome candidate.
OK, fair enough. So what did you want it to do? Did you just want to check for repeated characters at the proper end of word boundaries?
[Insert "Lightbulb Over Head" graphic here ]
You just want a filter to grab candidates, then strip whitespace, then test for palindrome-ness.
I think you'd do us a favor by coming up with an appropriate regex with ??{code}, something near:
/(\b
(([a-z]) # at least one char
?{local @suffix; push @suffix, $^N}) # build up suffix
(([a-z)
?{local @suffix; push @suffix, $^N}
| \S+)+
[a-z]? # possible odd char
(([a-z])
(?{if ($^N eq $suffix[-1]) {pop @suffix}})
| \S+)+)
(([a-z])
(?{if ($^N eq $suffix[-1]) {pop @suffix}})
\b)
/x
Though I don't think that works, and it's untested. (Hey, I have to leave something up to you.) Besides, there's a lot left out of the perlre with ?{code} and ??{code}, so please come back and tell us what it should say :)
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] [d/l] [select] |
|
I came up with this recursive-regex code. It's a bit simpler than most of the others posted here. It finds the leftmost longest palindrome in each line.
my $palin_re;
$palin_re = qr/(([a-z]) # First letter
[^a-z]* # any irrelevant chars
(?:(??{$palin_re})|[a-z]?) # The interior
[^a-z]* # any other irrelevant chars
\2) # the last letter matches the first
/xi;
while (<DATA>) {
if (/\b$palin_re\b/) {
print "Matched $1\n";
}
else {
print "No palindrome found in '$_'\n";
}
}
__DATA__
oo
madam in eden, I'm Adam, prince of eternia
is god a dog?
nested testest detsen nested
i prefer pi ip referp
nothing to see here, folks.
Caution: Contents may have been coded under pressure.
| [reply] [d/l] |
Re: regex at word boundary
by QM (Parson) on Dec 10, 2005 at 15:48 UTC
|
Is the code at Re^5: regex at word boundary useful? I thought you might compare it with your filter version, and see what falls out. Also, it could be made to search across newlines, but at considerable slowdown if not done carefully.
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] |
|
nested testest detsen nested
i prefer pi ip referp
yeilds:
line 7:
(0) "nested testest detsen nested"
(7) "testest detsen nested"
(15) "detsen nested"
(22) "nested"
line 8:
(0) "i prefer pi ip referp"
(2) "prefer pi ip referp"
(9) "pi ip referp"
(12) "ip referp"
(15) "referp"
I also tested it on a handy text file of 79,569 lines and it ran much slower than the code I listed above, modified to just test on each line, not each paragraph.
sunorccws04 ~$ time ./mr_pal.pl trf > mr.out
real 1m2.161s
user 1m1.210s
sys 0m0.280s
sunorccws04 ~$ time ./qm_pal.pl trf > qm.out
real 2m53.492s
user 2m49.070s
sys 0m1.690s
trf is the output of a tcpdump session. Other data sets are sure to produce differing comparative speeds.
Be Appropriate && Follow Your Curiosity
| [reply] |
|
Are you comparing apples to apples? Does the other code find overlapping palindromes?
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] |
|
Re: regex at word boundary
by swkronenfeld (Hermit) on Dec 07, 2005 at 18:00 UTC
|
Edit: removing comments except for link due to <pre> <code> issue.
Have a look here Try this in Regexp palindrome (found using the super search) for a discussion on this topic.
| [reply] |
|
|