Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Junk NOT words

by bart (Canon)
on Oct 31, 2002 at 10:39 UTC ( [id://209379]=note: print w/replies, xml ) Need Help??


in reply to Junk NOT words

Focus on the vowels. There are always fewer vowels than consonants, especially in junk, and all (English) words need at least one. So pick the next vowel, and see if you can make at least one word of it if you take a few letters in front and after it as well. Or: is this snippet of a few letters a substring of an existing word?

For many letters, that will be the case. For example, in your phrase, I can recognize "a", "real", "he", "her", "age". So don't throw away too many candidates upfront: none of these words here are the words of which the phrase actually consists. It's not because you found a word that it is the word. Note that you can hardly find any words at all in the junk string. So this one is quickly dismissed.

Well, eventually, you'll have them all fit the whole string, so you have only words and no excess letters between them. (Er, what do you do with spelling errors?). So if you do find a likely substring around a vowel, concentrate on the next vowel that isn't part of this word. You should find two words that fit both sequences with no letters between them. If you don't, and not for any group around each vowel, the string must be junk.

Well, HTH.

Replies are listed 'Best First'.
Re: Re: Junk NOT words
by false (Novice) on Oct 31, 2002 at 15:17 UTC

      There are always fewer vowels than consonants, especially in junk, and all (English) words need at least one.

    'Rhythms' doesn't have a vowel. ;)

      a e i o u and sometimes y.
      I remember hearing once that the only work in the english dictonary that doesn't have a vowel is 'nth'.
      bizzach
Re: Re: Junk NOT words
by John M. Dlugosz (Monsignor) on Oct 31, 2002 at 22:14 UTC
    Always? You bait us.

    I recall the word "strength" which has a lone vowel and count 'em, 7 consonants!

    I have a word file I made by stripping the entry tags (and massaging a bit) the dictionary on Gutenburg. I suppose I could scan that to see what a histogram of the actual proportion is, if I were so inclined.

Re: Re: Junk NOT words
by seattlejohn (Deacon) on Nov 01, 2002 at 01:45 UTC
    all (English) words need at least one (vowel)

    The vast majority, but not all. Two counterexamples that come to mind offhand are "cwm" (meaning cirque) and "nth" (first, second, ..., nth). Those are both uncommon enough that they might never occur in the data in question, but we don't really know enough about the problem to make that assumption.

    Update: Sorry, I didn't notice at first that bizzach had already mentioned "nth" above. No plagiarism intended ;-)

            $perlmonks{seattlejohn} = 'John Clyman';

Re: Re: Junk NOT words
by BrowserUk (Patriarch) on Nov 01, 2002 at 03:33 UTC

    There are also a whole host of places in english stripped of puctuation where non-vowel containing "words" would crop up. Eg. Mr Mrs Dr Jnr etc.


    Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy
      OK, OK, I concur! So there are a few sequences of consonants with no vowel that can be considered a word. However, there aren't many. The whole mechanism can remain the same, except that, if you're left with a string of consonants between words, that doesn't immediately means failure. Now you'll have to do an additional check to see if such a string exists of a sequence of these exceptions. I think that they are such a small minority that, for speed, it must be worth it extracting them all from a dictionary and storing them separately in a data file, before you even start.

      Am I alone, in feeling that the whole search system as I proposed, is very similar to how a regex may try to match a pattern, in a "penny machine"? Pick a candidate, try every possibility with it in turn, backtrack...

        Sorry Bart. I read your original post in isolation of the full thread and hadn't realised that I was repeating what others had already said.

        If you've seen my attempt at this at Re: Junk NOT words you'll have seen that my word list manages to match just about anything with one or two characters as a word. I decided to go through 1 & 2 char entries by hand and remove those that where nonsensical, but discovered to my surprise that many more of them are valid in some contexts than you might suppose.

        For instance, 'x' - Outside of math or computing this doesn't seem like a valid word, but I ran across to uses in a scan of my correspondance that I have sent and recieved. The first in the phrase "X marks the spot" the second in a email from my sister signed "x. jj".

        In other notes this became "xx. jj" and "xxx. jj". I guess I'm more loveable at sometimes than others. 'jj' are her first 2 initials BTW, so that meant that had to stay. Ah! 'BTW' there's another one. And so it went on. I found it extremely difficult to remove any of either the single chars or many of the digraphs as I could, without much effort, find (or think of) legitimate cases where they could crop up in 'normal' correspondance.

        I wasn't jumping on the bandwagon with this, just reflecting my own, somewhat surprising discovery.


        Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://209379]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (9)
As of 2024-04-16 08:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found