Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Weird "soundex" algorithm

by giulienk (Curate)
on Aug 28, 2003 at 14:59 UTC ( #287394=perlquestion: print w/replies, xml ) Need Help??

giulienk has asked for the wisdom of the Perl Monks concerning the following question:

I need to work out a bunch of data with an algorithm i didn't choose. The algorithm has nothing to do with the "real" soundex algorithm (see Text::Soundex), even though it was called like that where i read it. So from now on i will baptize this algorithm Weirdex; this is described to receive a string in input and then:
  • Strip chars outside [a-zA-Z] range
  • Take all consonants except doubles (that should be reduced to one)
  • Take just the first vowel

larry walllarywl

My first coding of Weirdex is

sub weirdex { local $_ = shift; my ($weirdex, $vowel) = ('', 0); s/[^a-zA-Z]//g; for (split '') { if (/[aeiou]/i) { $vowel++ or $weirdex .= $_; } else { $weirdex .= $_ if substr($weirdex, -1, 1) ne $_; } } return $weirdex; }

I'm interested in more elegant/efficient solutions, only regexp solutions and even Golf/Obfu ones.
Thank you.


Replies are listed 'Best First'.
Re: Weird "soundex" algorithm
by broquaint (Abbot) on Aug 28, 2003 at 15:17 UTC
    sub weirdex { local $_ = shift; tr/a-zA-Z//dc; tr/a-zA-Z//s; m/[aeiou]/g and substr($_, pos) =~ s/[aieuo]//g; $_; } print "$_ - ", weirdex($_), $/ for "giulienk", "larry wall", "etheroskedasticity"; __output__ giulienk - gilnk larry wall - larywl etheroskedasticity - ethrskdstcty
    tr//, m// and s///, pos and substr are your friends :)


      Thanks to yours and jmcnamara's reply: i totally forgot about the power of tr/// modifiers, especially c and s, which i never used before. :)


      Nice. I was unhappy with your vowel stripping though. You use a looping match for offsets into $_ but really, you only have to find the *first* vowel and then no looping is required.

      # m/[aeiou]/g and substr($_, pos) =~ s/[aieuo]//g; /[aeiou]/ and substr( $_, $+[0] ) =~ tr/aeiou//d;
        You use a looping match for offsets into $_ but really ...
        Er, what looping? The /g matches the first vowel then saves the position of the match for substr which the replace then operates on. I didn't want to use the $+ variable because of the overhead it invokes.


Re: Weird "soundex" algorithm
by jmcnamara (Monsignor) on Aug 28, 2003 at 15:20 UTC

    sub weirdex { local $_ = $_[0]; tr/a-zA-Z//s; tr/a-zA-Z//cd; my $i; s/([aeiou])/$1if!$i++/eg; $_; }


Re: Weird "soundex" algorithm
by rcaputo (Chaplain) on Aug 28, 2003 at 15:46 UTC
Re: Weird "soundex" algorithm
by Aristotle (Chancellor) on Aug 28, 2003 at 16:03 UTC
    use strict; use Test::More 'no_plan'; sub weirdex { local $_ = shift; tr/A-Za-z//cd; tr/A-Za-z//s; /[aeiou]/g; substr($_, pos()) =~ s/[aeiou]+//g; return $_; } my %test = ( 'giulienk' => 'gilnk', 'larry wall' => 'larywl', 'etheroskedasticity' => 'ethrskdstcty', ); my ($input, $result); is(weirdex($input), $result, "$input") while ($input, $result) = each +%test; __END__ ok 1 - etheroskedasticity ok 2 - larry wall ok 3 - giulienk 1..3

    Makeshifts last the longest.

Re: Weird "soundex" algorithm
by Anonymous Monk on Aug 28, 2003 at 15:37 UTC
    /[aeiou]/i should be  /[aeiouy]/i.

    Y is a vowel, isn't it?

      Sometimes, so it may not be necessary for this particular algorithm. It would also appear that two Welsh words in the dictionary use 'w' as a vowel.


      Y and W are pseudo-vowels. They represent vowel sounds, but are not vowel letters. This is just one of the many contradictions between spoken and written language. I think W is a vowel in Croatian.
Re: Weird "soundex" algorithm
by giulienk (Curate) on Sep 01, 2003 at 06:56 UTC
    I add a couple comments for posterity:
    • All the solutions (except the original one) have the same problem: they don't care about upper-case vowels. All examples were in lower cases so nobody noticed it.
    • The original implementation was wrong has it would consider consonant separated by a "non-first-vowel" as contigous ones and squash them if they are equal.
    • The illustrious surname "Schwartz" with its consonant-to-vowel ratio of 7.0 render this algorithm pretty useless :)


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://287394]
Approved by jwest
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2021-03-05 17:30 GMT
Find Nodes?
    Voting Booth?
    My favorite kind of desktop background is:

    Results (113 votes). Check out past polls.