Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Questions: how to exclude substring having Evil meanings

by lihao (Monk)
on Dec 02, 2009 at 16:54 UTC ( [id://810605]=perlquestion: print w/replies, xml ) Need Help??

lihao has asked for the wisdom of the Perl Monks concerning the following question:

Hi, monks

I have a list which contains about 5M records and to uniquely identify them, I am using a 7-char CODE system. these codes are viewable from the user side, so it's important the CODEs do NOT contain any sub-strings which have evil meaning, i.e. K8KKKL9, 1GTSH1T, et.al. in other word, I am trying to prevent (or limit if not possible to prevent) such English evil words to be shown in my CODE system.

My questions is: are there any Perl modules to handle similar situation? or any web links or searchable terminologies to help analyze such situation?

thanks in advance

lihao

UPDATE Sorry, guys, I did NOT describe my question well in my original post. the following are something I forgot:

  • the 7-code system have: first two characters for identifying events, 5-char for 5M UNIQUE records
  • I've actually removed all vowels aoieu from the CODE system (case insensitive). Just some people do NOT want to see a string like KKK which I have not yet covered
  • The original requirement from my boss is that the CODE should be as short as possible
  • Also, we need to consider OCR discernable issue and add certain randomness to the generated CODE.
  • the list is not fixed and we have to expect potential growth in the next few years

Thank again to all who replied

lihao

Replies are listed 'Best First'.
Re: Questions: how to exclude substring having Evil meanings
by roboticus (Chancellor) on Dec 02, 2009 at 17:02 UTC
    lihao:

    I've never had to do such a thing, but I'd suggest simply removing all vowels from the alphabet you use to generate the codes. That way there wouldn't be anything pronouncable. It's certainly a lot simpler than worrying about stop word lists and the attendant worries of missing some foul words in other languages...

    ...roboticus

    Update: At least I hope that no languages have pronounceable words without vowels in the standard ASCII set.

      At least until someone gets in a huff about getting a code SH1T455 or 455H0L3 or . . . :)

      There's always Regexp::Common::profanity (and variants specialized for other locales and languages), but you'd (the OP, that is) want to examine them to make sure they handle everything you're worried about.

      Update: Or another idea: just use a hex representation of the record number (which'll be longer than 7 chars, but you'll only have to worry about offending Hindus or vegans with DEADBEEF :).

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

        Too true ... I think the problem is somewhat artificial, as people can find all sorts of things to be unreasonable about. ;^)

        Sure, we could omit digits that look like vowels from the alphabet (0, 1, 3, 4), but then someone'll complain about stuff when you turn the code upside down or some such.

        ...roboticus
        ... you'll only have to worry about ...  DEADBEEF ...

        I once had someone complain to me – only semi-facetiously – about the presence of 666 in some identifier or other. And then you have to worry about 13 (Europe/NA), 4 (China), etc., etc.

      At least I hope that no languages have pronounceable words without vowels in the standard ASCII set.

      Words can still be quite readable when removing letters. Look at personalized license plates for an example. We're trained to see words in letters even when there aren't none.

        ikegami:

        True, but if you're going to try to prevent people from reading their own meaning into things, we may as well abandon all forms of communication.

        ...roboticus
      Tsk tsk tsk, hmm, shh, and the rare cwm, to name a few.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: Questions: how to exclude substring having Evil meanings
by Utilitarian (Vicar) on Dec 02, 2009 at 17:30 UTC
Re: Questions: how to exclude substring having Evil meanings
by roboticus (Chancellor) on Dec 02, 2009 at 17:24 UTC
    lihao:

    Well, with 7 characters and 5 million records, you don't need much of a character set. You could get by with the digits 0-9 and wouldn't have to revisit the issue until you double the table size. Adding just a few more symbols would greatly increase the space to work in, so Fletch's idea of using hex digits would be a good way to go. (That'd give you over 250 million records before you need to expand your character set.)

    Personally, though, I think I'd just use upper-case letters without vowels. That gives you over a billion records, and leaves out the I/1, 0/O/Q, etc. confusions...

    ...roboticus
Re: Questions: how to exclude substring having Evil meanings
by tirwhan (Abbot) on Dec 02, 2009 at 17:05 UTC
    use NoEvil;
    ?

    Losing the sarcasm, the simplest thing I can think of is don't use vowels in your codes. Of course that then leaves you in the situation of having leetspeakisms such as 3v1l and sh1t appear in your codes, so you'd need to exclude a lot of numbers as well. Or maybe only show the codes to users who aren't so incredibly prissy?


    All dogma is stupid.
Re: Questions: how to exclude substring having Evil meanings
by ikegami (Patriarch) on Dec 02, 2009 at 17:17 UTC
    Effectively, you have a base 36 number. You could switch the number to base 10 (11 digits if you use the full range) or hex (10 digits if you use the full range, 9 covers most) when showing it to the user, although that might require you to do some changes to allow both formats in some inputs. Bonus: it avoids confusing 1/l and 0/O.
Re: Questions: how to exclude substring having Evil meanings
by mikelieman (Friar) on Dec 02, 2009 at 17:20 UTC
    7 chars, eh?
    N NNN NNN
    If you just use the digits 0-9, you're set.
    It's just a coincidence that your code is the set of integers from 1 - 5,000,000
      Where did you get 5,000,000 from? 367 is significantly larger: 78,364,164,096
        I would imagine from the OP:
        I have a list which contains about 5M records...
        But it has been updated to add that the list may grow substantially over time, so a solution optimized for 5M items still wouldn't be the Right Answer.
Re: Questions: how to exclude substring having Evil meanings
by ikegami (Patriarch) on Dec 02, 2009 at 22:51 UTC

    The original requirement from my boss is that the CODE should be as short as possible

    Give your new specs, what I suggested is just one char longer than what you have now.

    xxyyyyyy xx = Two alphanums. Event Type. Supports 1,296 event types. yyyyyy = Six hex digits. Record num. Supports 16,777,215 records.

    Advantages:

    • No bad words.
      • Without having to build a dictionary of bad words.
      • Without having to maintain a dictionary of bad words.
      • Without spending computation time going looking for bad words.
    • Short identifier.
    • Very easy to compute.
    • Very cheap to compute.
    • OCR-friendly.

    You can easily and cheaply substitute [0-9A-F] for another set of characters if you have OCR or bad words issues.

    Do you really need to support that many event types? If not, you could probably shorten that the id to six digits.

Re: Questions: how to exclude substring having Evil meanings
by leocharre (Priest) on Dec 02, 2009 at 20:58 UTC

    This sounds like an impossible and petty set of requirements. Isn't the remote chance of these evil codes part of the fun? I think if I were developing and my boss made such a stringent requirement- I must be a poor employee- 'cause I think I'd throw a fit and cry like chimp on coke.

    I mean... 5m records.. this sounds like some serious work.

Re: Questions: how to exclude substring having Evil meanings
by JavaFan (Canon) on Dec 02, 2009 at 23:34 UTC
    Perhaps you should use codes that don't contain letters? With Unicode, that still gives you thousands of possibilites for each position. That way, you satisfy both conditions at once: the codes will be short (shorter than ASCII only codes), and it's unlikely to offend someone.

      There are limitations and drawbacks.

      • It's hard to name most characters. "I have a problem with invoice latin-small-letter-a-with-dot-above-and-macron;devanagari-letter-vocalic-r;left-right-white-arrow." (ǡऋ⬄)
      • There are font problems ("I have a problem with invoice box-box-box.")
      • Encoding problems are still common too.

      What symbols are you suggesting?

      • You'd need a set of 22 chars to maintain a record num of 5 chars. (225 = 5,153,632).
      • You'd need a set of 48 chars to bring the record num down to 4 chars. (484 = 5,308,416).
      • You'd need a set of 171 chars to bring the record num down to 3 chars. (1713 = 5,000,211).

      I suppose you could use the horizontal dominoes. Each domino can be read as two digits from 0 to 6. For example, this is node 🀷🁜🁑🀵 (06:61:44:047).

      Using dominoes would reduce the record num to 4 chars (724 = 5,764,801) assuming you didn't want the sequence to be a legal domino sequence. Both the UTF-8 and the UTF-16 encoding of 4 dominoes would take 16 bytes. (UTF-32 too, for what it's worth.)

      Update: Added everything after the question.

        [...] assuming you didn't want the sequence to be a legal domino sequence.
        What if it had to be legal sequences?


        holli

        You can lead your users to water, but alas, you cannot drown them.
        I suppose you could use the horizontal dominoes. Each domino can be read as two digits from 0 to 6.
        This is brilliant.
        $,=qq.\n.;print q.\/\/____\/.,q./\ \ / / \\.,q.    /_/__.,q..

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://810605]
Approved by Corion
Front-paged by redgreen
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-03-29 06:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found