http://qs321.pair.com?node_id=441430

qq has asked for the wisdom of the Perl Monks concerning the following question:

I'm maintaining a web page that organizes a list of items into ranges based on first letter: 0-9, A-E, F-H, etc.

The existing code makes no provision for non-ascii characters and silently passes by any that do not match the current character class: m/^[A-Fa-f]/. The expected input range will be latin-1, but it would be nice to have a place for other characters if they come up.

After reading this thread, and googling, the best option seems to be to use Text::Unidecode to "convert" unicode to ascii before using ascii regexes. This has the advantage of being quick, simple, and ensuring that all items will fall under some category.

But this seems like a common problem, so how have others approached it?

tia, qq

update: added regex snippet for clarity. And typos.

Replies are listed 'Best First'.
Re: unicode [A-F] equivalent?
by ambrus (Abbot) on Mar 22, 2005 at 19:23 UTC

    If you use locale, then the string comparision operators (and the default sort block too) will sort by the current locale, as given by LC_COLLATE. For example, this code

    LC_ALL=hu_HU perl -wle 'use locale; print join " ", sort "op\xf3\xf5\x +f6"=~/\S/g;'
    gives the correct order of the letters: o ó ö ő p.

    Thus, instead of a regular expression such as m/^[A-Fa-f]/, you could use a comparision like $_ ge "g". I don't know how you can do this with regular expressions.

    (Updated one typo.)

      This looks like the most 'correct' solution, thanks.

Re: unicode [A-F] equivalent?
by Anonymous Monk on Mar 22, 2005 at 13:28 UTC
    But this seems like a common problem,
    Really? I don't think the problem is common at all. Texts that require words from multiple scripts are not common, and if they are used, it's typically single words or short phrases that are used, and certainly not indexed.

    I don't think there's a canned solution that works for all. For instance, Chinese doesn't have the notion of "alphabetical" ordering of words - at least, not in the way we are used in the Western world. If you have a Chinese friend, ask him/her to explain how a Chinese dictionary works. I once did, and that was a learning experience. Your suggested solution will probably work if you have a handful of non-Western words - but does it scale if 70% of your list consists of Chinese and Korean words?

      It would not scale at all well, I agree. Luckily I'm not creating a multi-lingual dictionary, but organizing a list of english language radio show names. Occaisionally an accented character will come through, but anything else will be a surprise.

      I did once work on an international Who's Who book. The ordering of names was "solved" by having romanized equivalents. But it was the editor's job to decide the order, not the mine.

        Even the accented letters are a problem. Accented letters often come from Western or Nothern European countries. Which all use the ISO LATIN-1 alphabet. But while an accented letter may look the same in different countries, they are different. In some languages, an accent just means the letter is pronounced differently, but it's still the same letter. But the same accent can change the letter in a different language. Which will become a different letter. And even if you have two languages who use the same accented letter, it doesn't necessarely mean they the letters sort the same.

        Which is why we have locales. And which means that whatever solution you will pick - there are people that will be surprised.

        If only we all spoke (and wrote) Egyptian hieroglyphs, we would have this mess.

Re: unicode [A-F] equivalent?
by ysth (Canon) on Mar 22, 2005 at 18:29 UTC
    A quick search showed Diacritic-Insensitive and Case-Insensitve Sorting, which may help you. I would go ahead and lump accented characters in with their non-accented versions (my reply in that thread shows one way to convert them) and have a separate category for anything that doesn't convert into /^[A-Za-z]/.