Splitting strings into words when there are no separators

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Splitting strings into words when there are no separators (trie & !perl) by tye (Sage) on Sep 14, 2005 at 15:09 UTC
I wouldn't do this in Perl¹. I'd build a trie from the dictionary of possible search words (and stop words) and process the string much like the regex engine does, trying to greedily match the longest words in the dictionary as you inch along the string character-by-character, keeping a stack so you can backtrack when "yellowner" tries for "yellow" and fails on "ner" and so should try "yell" so it can match "owner". Fuzzy matching gets even trickier, but I'd also handle that via backtracking. Off the top of my head, I'd probably have a an acceptable error rate for the current word that starts out at 0 and increases only if backtracking requires it to. And have some upper limit on number of errors both as an absolute number of errors per word (say 3 or 4) and as a fraction of the number of letters in the words (say 15% or 20%). So, for each letter in your string, you look down the trie one letter until there aren't any longer words having what you've matched so far as a prefix. Each time you move over a letter, you push your state onto a stack so you can backtrack. Your state consists of the word prefix you've built up so far, the matching position in the trie, your position in the string, and how many errors you've tolerated and how many errors you are willing to tolerate for the current word. When you backtrack a word choice the first time, you simply look for a shorter word. If you have to backtrack after that, then you increment the allowed number of errors. If, going forward again, you don't find a matching letter or can't use the obvious matching letter since that was what you tried the last time, then you (if you are allowed to have more errors for this word) look for an error instead. Common errors would be: Dropping a letter: This means you search for grandchildren in the trie that match your next letter. Transposing two adjacent letters: This means you look in the trie for next two letters of the string but in the reverse order Inserting a letter: This means you just skip the next letter of the string and look in the trie for the letter after it Note how the middle item is a combination of the first and last items. So you'd probably want to try that error first and then move on to the other two types after that. Alternately, you could preload your dictionary with acceptable approximations to your search words, marked with their error score. If your dictionary of search words is not too large, then this would certainly be the way I would go. An efficient trie can hold quite a large dictionary. Allowing even a few simple errors balloons the number of "words" up very quickly, so you might want to make trade-offs on only allowing a very small number of errors or only allowing more likely errors (such as nearby letters phonetically or nearby on the keyboard or such) so that your dictionary stays small enough to be practical to process. - tye ¹ Perl is particularly slow at general character-at-a-time state machines -- though a mix of Perl and Inline::C could be considered, especially since demerphq already has trie support in such.	[reply]
combining (?(condition)yes\|no) and (?{code}) by halley (Prior) on Sep 14, 2005 at 14:35 UTC
It took me a bit to get the syntax correct, but there are two special regular-expression features which can be used in combination. `(?(condition)yes\|no)` and `(?{code})`. The perldoc perlre page explains that they can be combined, but gives no example. I then use the `(?!pattern)` construct with no pattern to force a backtrack for each non-word. Here's my example. `use strict; use warnings; my %vocab = map { $_ => 1 } qw/one two three four five six seven eight nine/; my $text = "onetwoeightxfour"; my $finder = qr/ (\w+?) (?(?{ not exists $vocab{$1} }) (?!) \| (?=) ) /x; for ($text =~ m/$finder/g) { print $_,$/; }` [download] Output: `one two eight four` [download] This particular solution is non-greedy: it finds the shortest known word, and leaves the rest for future matches. A more complicated solution would try harder to consume more letters for an early word if it led to fewer un-matched letters in the long-run: "bekindtostewardessesplease" should find 'stewardesses', not 'stew'. Luckily, one possible solution is simple: change `(\w+?)` to `(\w+)`, and be patient with the engine as it chugs through the additional backtracking work. Of course, you can fill the vocabulary hash with whatever you want, or use different code in the `(?{code})` construct to achieve the solution. You can also replace the `(?=)` success case to deal with extra unknown letters between words. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re: Splitting strings into words when there are no separators by japhy (Canon) on Sep 14, 2005 at 15:15 UTC
Your employer (or whoever) should consider relaxing to `http://green-brick.foobar` and `http://foo.bar/green-brick`. A hyphen is perfectly valid in a server name, and saves you a lot of trouble. However, there are ways to do what you ask. Here's a very naive approach: #!/usr/bin/perl -l use strict; use warnings; # build word list in %words my %words; open my $dict, "/usr/dict/words"; chomp, $words{lc $_} = 1 while <$dict>; # UPDATE: added 'lc' close $dict; my $str = "perfumesmellslikecheese"; $str =~ m{ ^ # anchor to beginning of string (?{ [ ] }) # start $^R as an empty array ref (?: # match this block << (\w{2,}) # capture 2 or more letters to $1 (?(?{ $words{lc $1} }) # if lowercase '$1' is in %words... (?{ [ @{$^R}, $1 ] }) # add this word to the current list \| # otherwise... (?!) # fail (force \w{2,} to backtrack) ) )+ # >> one or more times $ # anchor to end of string (?{ print "@{$^R}" }) # print the words (with spaces) (?!) # fail (cause everything to backtrack) }x; [download] You can make the engine a great deal smarter by making it dynamically adjust -- making it only possible to match things you KNOW to be words, for example. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l] [select]
Re^2: Splitting strings into words when there are no separators by inman (Curate) on Sep 15, 2005 at 16:39 UTC
This treatment compares each entry in the dictionary to the data and stores matching positions. All of the possible matches are then reconstructed in the printwords function. The words are read from the dictionary and compared without needing to be stored in memory. Read more... (1330 Bytes)	[reply] [d/l]
Re^2: Splitting strings into words when there are no separators by pingo (Hermit) on Jul 16, 2009 at 12:26 UTC
Ok, so this may be four years late, but still... For the benefit of someone who stumbles across this. If you don't mind being English-specific, the `(\w{2,})` bit could be extended slightly to `(\w{2,}\|[aAiI])`. Makes "a" and "I" match too.	[reply] [d/l] [select]
Re^3: Splitting strings into words when there are no separators by japhy (Canon) on Jul 23, 2009 at 16:21 UTC
Good point. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker Nos autem praedicamus Christum crucifixum (1 Cor. 1:23) - The Cross Reference (My Blog)	[reply]
Re: Splitting strings into words when there are no separators by Tanktalus (Canon) on Sep 14, 2005 at 14:32 UTC
The most straight-forward approach, coming from a person who has never done this before, making it probably a very naive approach, is to take the whole string and pass it into a spell checker, such as aspell or ispell. If the spell check says "no", remove the last character. Repeat. Once you've found the longest possible word starting from the beginning, remove it, and repeat the entire process again until you've removed everything. Pitfalls: compound words, such as "downstairs" or even "into", are preferred over splitting the words; brute force method is probably as slow as it gets; figuring out where extra words go requires a knowledge of grammar which a spell checker can't do. Note that I've heard you can open a pipe to aspell so that you're not re-running it each time, so that would give you the speed of loading the dictionary into a hash and using C instead of perl to scan it, while reducing your development time and probably the runtime (since it may be able to skip loading parts of the dictionary file). And, with the right spell-checker, it may offer suggestions - although figuring out if it's a sensical suggestion is beyond me. Short version: my condolences on the job assignment. :-(	[reply]
Re: Splitting strings into words when there are no separators by japhy (Canon) on Sep 14, 2005 at 16:16 UTC
As far as fuzzy matching goes, I posted code once that showed how to do fuzzy matching with Perl regexes. Here's an updated version: sub mk_fuzzy { our ($m, $i, $d) = splice @_, 0, 3; use re 'eval'; qr{ (?{ [ $i, $d, $m ] }) ^ @{[ map $_[$_] =~ /!$/ ? substr($_[$_],0,-1) : qq{ (?: $_[$_] (?: \| (?(?{ \$^R->[0] }) @{[ $_ < $#_ and "(?! $_[$_+1] + )" ]} (?s: . ) (?{ [ \$^R->[0] - 1, \$^R->[1], \$^R->[2] ] }) \| (?!) + ) ) \| (?(?{ \$^R->[1] }) (?{ [ \$^R->[0], \$^R->[1] - 1, \$^R->[2] + ] }) \| (?!) ) \| (?(?{ \$^R->[2] }) (?! $_[$_] ) (?s: . ) (?{ [ \$^R->[0], \$ +^R->[1], \$^R->[2] - 1 ] }) \| (?!) ) ) }, 0 .. $#_ ]} $ (?{ [ [$m-$^R->[2], $m], [$i-$^R->[0], $i], [$d-$^R->[1], $d] ] }) }x; } [download] The use of the function is as follows: `my $test = mk_fuzzy( 1, # max number of modifications to allow 1, # max number of insertions to allow 1, # max number of deletions to allow qw( p e r l ) ); for my $word (qw( pearl earl pearly peely )) { if ($word =~ $test) { print "$word is close enough to 'perl'\n"; } else { print "$word isn't enough like 'perl'\n"; } }` [download] This reports that pearl, earl, and peely are close to "perl". The reason I send the letters of the word individually is because the function allows you to follow a letter with a ! which means it MUST appear in the word. (And the 'letters' don't have to be just letters, they could be multi-character strings.) Example: `my $test = mk_fuzzy( 1, # max number of modifications to allow 1, # max number of insertions to allow 1, # max number of deletions to allow qw( p e r! l ) # must have the 'r' in this relative location ); for my $word (qw( pearl earl pearly peely )) { if ($word =~ $test) { print "$word is close enough to 'perl'\n"; } else { print "$word isn't enough like 'perl'\n"; } }` [download] This one only reports pearl and earl, since peely didn't keep the 'r'. After a successful match, by the way, you can inspect `$^R` to see how many modifications, insertions, and deletions were necessary: `if ($word =~ $test) { my ($m_used, $m_allowed) = @{ $^R->[0] }; my ($i_used, $i_allowed) = @{ $^R->[1] }; my ($d_used, $d_allowed) = @{ $^R->[2] }; print "Using $m_used mods, $i_used inserts, and $d_used dels, $word +matched\n"; }` [download] Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l] [select]
Re: Splitting strings into words when there are no separators by punch_card_don (Curate) on Sep 14, 2005 at 14:40 UTC
Possible line of research: If you Google "truckwheel", it replies "Did you mean 'truck wheel'?" So clearly they have something like this already working. Form Google: (http://www.googleguide.com/spelling_corrections.html) In just the first few months on the job, Google engineer Noam Shazeer developed a spelling correction (suggestion) system based on what other users have entered. The system automatically checks whether you are using the most common spelling of each word in your query. Apparently it's not a trivial task. Forget that fear of gravity, Get a little savagery in your life.	[reply]
Re^2: Splitting strings into words when there are no separators by Anonymous Monk on Sep 16, 2005 at 03:27 UTC
If you are not doing this frequently, you can use Google's API and make google do it for you. -Dash.	[reply]
Re: Splitting strings into words when there are no separators by svenXY (Deacon) on Sep 14, 2005 at 14:15 UTC
Do you have any sample code (even if it does not work)? Can you please provide examples of how the string looks like and in what pieces you would like to split it? svenXY	[reply]
Re: Splitting strings into words when there are no separators by jonadab (Parson) on Sep 14, 2005 at 16:56 UTC
update: Oops! I posted this, clearly, in the wrong thread. I intended to post it in the thread about which other languages are good to learn along with Perl. Sorry for any confusion. My first suggestion would be lisp. It's a very nice language, similar enough to Perl that if you like one you may like the other, but different enough from Perl to stretch the way you think about programming just a little. Lisp, like Perl, is a general-purpose programming language. Another possibility is Inform. Inform is not a general-purpose language, but is targeted toward a fairly narrow (but very interesting) problem domain. It's Turing equivalent, of course, and quite flexible, but really geared toward a specific type of program. The advantages of Inform are twofold. First, it will teach you to really appreciate the object-oriented paradigm, because the object model in Inform is pretty advanced (WAY beyond the Perl5 one), and a very good fit for the intended problem domain. Second, Inform has a really excellent book, the Designer's Manual, which is without qualification the best computer-related book I have ever seen. (It's also available online (free of charge), as well as in print.) Inform will change the way you think about object-oriented programming, guaranteed. Oh, and the intended problem domain is one that immediately captures the imagination, so it's fun to program in.	[reply]


Just another Perl shrine
	PerlMonks