Re: Questions: how to exclude substring having Evil meanings
by roboticus (Chancellor) on Dec 02, 2009 at 17:02 UTC
|
lihao:
I've never had to do such a thing, but I'd suggest simply removing all vowels from the alphabet you use to generate the codes. That way there wouldn't be anything pronouncable. It's certainly a lot simpler than worrying about stop word lists and the attendant worries of missing some foul words in other languages...
...roboticus
Update: At least I hope that no languages have pronounceable words without vowels in the standard ASCII set. | [reply] [Watch: Dir/Any] |
|
At least until someone gets in a huff about getting a code SH1T455 or 455H0L3 or . . . :)
There's always Regexp::Common::profanity (and variants specialized for other locales and languages), but you'd (the OP, that is) want to examine them to make sure they handle everything you're worried about.
Update: Or another idea: just use a hex representation of the record number (which'll be longer than 7 chars, but you'll only have to worry about offending Hindus or vegans with DEADBEEF :).
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [Watch: Dir/Any] |
|
Too true ... I think the problem is somewhat artificial, as people can find all sorts of things to be unreasonable about. ;^)
Sure, we could omit digits that look like vowels from the alphabet (0, 1, 3, 4), but then someone'll complain about stuff when you turn the code upside down or some such.
...roboticus
| [reply] [Watch: Dir/Any] |
|
... you'll only have to worry about ... DEADBEEF ...
I once had someone complain to me – only semi-facetiously – about the presence of 666 in some identifier or other. And then you have to worry about 13 (Europe/NA), 4 (China), etc., etc.
| [reply] [Watch: Dir/Any] [d/l] |
|
|
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
|
|
|
|
Tsk tsk tsk, hmm, shh, and the rare cwm, to name a few.
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] [Watch: Dir/Any] |
Re: Questions: how to exclude substring having Evil meanings
by Utilitarian (Vicar) on Dec 02, 2009 at 17:30 UTC
|
| [reply] [Watch: Dir/Any] [d/l] |
Re: Questions: how to exclude substring having Evil meanings
by roboticus (Chancellor) on Dec 02, 2009 at 17:24 UTC
|
lihao:
Well, with 7 characters and 5 million records, you don't need much of a character set. You could get by with the digits 0-9 and wouldn't have to revisit the issue until you double the table size. Adding just a few more symbols would greatly increase the space to work in, so Fletch's idea of using hex digits would be a good way to go. (That'd give you over 250 million records before you need to expand your character set.)
Personally, though, I think I'd just use upper-case letters without vowels. That gives you over a billion records, and leaves out the I/1, 0/O/Q, etc. confusions...
...roboticus | [reply] [Watch: Dir/Any] |
Re: Questions: how to exclude substring having Evil meanings
by tirwhan (Abbot) on Dec 02, 2009 at 17:05 UTC
|
use NoEvil; ?
Losing the sarcasm, the simplest thing I can think of is don't use vowels in your codes. Of course that then leaves you in the situation of having leetspeakisms such as 3v1l and sh1t appear in your codes, so you'd need to exclude a lot of numbers as well. Or maybe only show the codes to users who aren't so incredibly prissy?
| [reply] [Watch: Dir/Any] [d/l] |
Re: Questions: how to exclude substring having Evil meanings
by ikegami (Patriarch) on Dec 02, 2009 at 17:17 UTC
|
Effectively, you have a base 36 number. You could switch the number to base 10 (11 digits if you use the full range) or hex (10 digits if you use the full range, 9 covers most) when showing it to the user, although that might require you to do some changes to allow both formats in some inputs. Bonus: it avoids confusing 1/l and 0/O.
| [reply] [Watch: Dir/Any] |
Re: Questions: how to exclude substring having Evil meanings
by mikelieman (Friar) on Dec 02, 2009 at 17:20 UTC
|
7 chars, eh?
N NNN NNN
If you just use the digits 0-9, you're set.
It's just a coincidence that your code is the set of integers from 1 - 5,000,000
| [reply] [Watch: Dir/Any] |
|
Where did you get 5,000,000 from? 367 is significantly larger: 78,364,164,096
| [reply] [Watch: Dir/Any] |
|
I would imagine from the OP:
I have a list which contains about 5M records...
But it has been updated to add that the list may grow substantially over time, so a solution optimized for 5M items still wouldn't be the Right Answer.
| [reply] [Watch: Dir/Any] |
Re: Questions: how to exclude substring having Evil meanings
by ikegami (Patriarch) on Dec 02, 2009 at 22:51 UTC
|
The original requirement from my boss is that the CODE should be as short as possible
Give your new specs, what I suggested is just one char longer than what you have now.
xxyyyyyy
xx = Two alphanums. Event Type. Supports 1,296 event types.
yyyyyy = Six hex digits. Record num. Supports 16,777,215 records.
Advantages:
- No bad words.
- Without having to build a dictionary of bad words.
- Without having to maintain a dictionary of bad words.
- Without spending computation time going looking for bad words.
- Short identifier.
- Very easy to compute.
- Very cheap to compute.
- OCR-friendly.
You can easily and cheaply substitute [0-9A-F] for another set of characters if you have OCR or bad words issues.
Do you really need to support that many event types? If not, you could probably shorten that the id to six digits.
| [reply] [Watch: Dir/Any] [d/l] |
Re: Questions: how to exclude substring having Evil meanings
by leocharre (Priest) on Dec 02, 2009 at 20:58 UTC
|
This sounds like an impossible and petty set of requirements. Isn't the remote chance of these evil codes part of the fun? I think if I were developing and my boss made such a stringent requirement- I must be a poor employee- 'cause I think I'd throw a fit and cry like chimp on coke.
I mean... 5m records.. this sounds like some serious work.
| [reply] [Watch: Dir/Any] |
Re: Questions: how to exclude substring having Evil meanings
by JavaFan (Canon) on Dec 02, 2009 at 23:34 UTC
|
Perhaps you should use codes that don't contain letters? With Unicode, that still gives you thousands of possibilites for each position. That way, you satisfy both conditions at once: the codes will be short (shorter than ASCII only codes), and it's unlikely to offend someone. | [reply] [Watch: Dir/Any] |
|
There are limitations and drawbacks.
- It's hard to name most characters. "I have a problem with invoice latin-small-letter-a-with-dot-above-and-macron;devanagari-letter-vocalic-r;left-right-white-arrow." (ǡऋ⬄)
- There are font problems ("I have a problem with invoice box-box-box.")
- Encoding problems are still common too.
What symbols are you suggesting?
- You'd need a set of 22 chars to maintain a record num of 5 chars. (225 = 5,153,632).
- You'd need a set of 48 chars to bring the record num down to 4 chars. (484 = 5,308,416).
- You'd need a set of 171 chars to bring the record num down to 3 chars. (1713 = 5,000,211).
I suppose you could use the horizontal dominoes. Each domino can be read as two digits from 0 to 6. For example, this is node 🀷🁜🁑🀵 (06:61:44:047).
Using dominoes would reduce the record num to 4 chars (724 = 5,764,801) assuming you didn't want the sequence to be a legal domino sequence. Both the UTF-8 and the UTF-16 encoding of 4 dominoes would take 16 bytes. (UTF-32 too, for what it's worth.)
Update: Added everything after the question.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] [d/l] |
|
|
I suppose you could use the horizontal dominoes. Each domino can be read as two digits from 0 to 6.
This is brilliant.
$,=qq.\n.;print q.\/\/____\/.,q./\ \ / / \\.,q. /_/__.,q..
| [reply] [Watch: Dir/Any] [d/l] |