Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: detecting the language of a word?

by jreades (Friar)
on Dec 06, 2002 at 18:13 UTC ( #218116=note: print w/replies, xml ) Need Help??

in reply to detecting the language of a word?

Many of these technical approaches have merits, however as one very astute monk pointed out, if the problem were easy (or even automatable) babblefish wouldn't be so hilariously inadequate.

In terms of approach, I think that you can consider several different lines of attack that would allow you to automate most of the markup, if not all of it:

  1. Identify key foreign words for automatic flagging (can you always assume that the word email indicates an email address?). In doing your research, you've probably identified the basic words that should get caught in order to avoid howlers. Work on that list to make sure you're not missing anything obvious
  2. Look for patterns in foreign word usage. This will require more intuition that anything else, but I would guess, again, that you are beginning to develop a feel for where foreign words are likely to occur. Use automated tools to look for and flag those pages/sections for manual follow-up.
  3. In my very limited experience, I would guess that these types of words will tend to occur in 1) headers, 2) footers, 3) business and IT terminology. Headers and footers are where you are likely to find contact information, and business and IT terminology tends to be dominated by English (despite the ongoing French crusade to use the word ordinateur)
  4. If you think that you need to mount what is essentially a dictionary attack, then, to my mind, you need to look at ways to streamline the attack. Could you start off by making the (admittedly arbitrary) decision that words of less than five characters are either 1) not in a foreign language, or 2) not significant enough to be worth looking up in a foreign language? This could rapidly reduce the number of lookups that you need to do on any given page.
  5. Or, you could again take a contextual approach and mount a dictionary attack based on words of, say, ten characters or more working from the assumption that foreign words will occur in clumps and that at least one of those words will be more than nine characters in length. Then, you are looking to do manual follow-up for sections flagged with a nine-character foreign word. Over time, you could streamline your parser to ignore sections already flagged as containing a foreign language and gradually reduce the length of the words that you examine for foreign content.

This is a really hard problem, good luck.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://218116]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2023-02-04 21:39 GMT
Find Nodes?
    Voting Booth?
    I prefer not to run the latest version of Perl because:

    Results (31 votes). Check out past polls.