http://qs321.pair.com?node_id=264617

qq has asked for the wisdom of the Perl Monks concerning the following question:

Hi Reverend Monks:

I want to determine a bi level sort order from a given set of already sorted records for use with Sort::ArbBiLex. (Well, actually a drop in replacement that I've written.)

The existing records are first name, last name have escaped entity sequences in them in the form: é.

I think this might be an interesting problem - but excuse the long post.

Sort::ArbBiLex has a very well written explanation of what bi level sorting is, and why you have to do it.

Decleration format and usage is like:

use Sort::ArbBiLex ( 'fulani_sort', # ask for a &fulani_sort to be defined "a A c C ch Ch CH ch' Ch' CH' e E l L lh Lh LH n N r R s S u U z Z " ); @words = <>; @stuff = fulani_sort(@words); foreach (@stuff) { print "<$_>\n" }

The decleration form above is just a shorter way of writing an AoA. Its that decleration, or at least a good approximation, that I'm hoping to extract from existing records (see DATA below).

The resulting decleration does not have to be perfect, it will be checked by hand.

Solution:

Well, if I knew I wouldn't be writing this.

First step, I believe, is to extract a list of glyghs from the data. Since glyghs can be variable widths, you can't do this entirely automatically. For my purposes, since all glyghs longer than one character are easily matched, this is easy. Put them as keys to a hash to remove duplicates (%glyghs below).

Next step I'm lost. I think something like:

  1. build a list of first glyphs of last name
  2. build a list af second glyphs of last name
  3. repeat, giving:
  4. A, B, D L, N, R, A, E B, C, D, R, B, I, A, &rsquo;, C E, A, R, O, I, N, R, A, C
  5. do something clever

but some glyphs are 'equal' on the first level sort. Ie we want an AoA, not just an array. And some glyphs might be invisible to the sort order - ep whitespace or '&rsquo';

#/usr/bin/perl -w use strict; my %glyphs; my @words; while ( <DATA> ) { push @words, split /,\s*/; } @glyphs{ map { /(&.+?;)|(.)/g } @words } = 1; print join( "\n", keys %glyphs); __DATA__ ALBEVERIO MANZONI, Solvejg ALCAL&Aacute;, Kathleen ANDR&Eacute;E, Alice ARROYO-GOMEZ, Mario Vernon BABINEAU, Jean Joseph BAINBRIDGE, Dame Beryl (Margaret) BAINBRIDGE, Cyril DEARDEN, James Shackley DE&rsquo;ATH, Richard DECKER, Donna

Thanks for you're ideas, qq

Replies are listed 'Best First'.
Re: Reverse Engineering Sort Order
by BrowserUk (Patriarch) on Jun 10, 2003 at 11:58 UTC

    Thanks for you're ideas

    I've read your post 4 times now, I'm afraid I'm still not sure quite what your question is?

    I think a simple example of "given this input, I would like a routine that produces this output" would be a good starting point.

    Oh, and sometimes, less is more:)


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      I did this rereading excercise too - but eventually I think I know what he needs. He has a sorted list of records and he want's to determine what kind of sort was used to sort it.

      More precisely he has a definition of a family of sorting functions, and the result of applying one of those functions to some data record. He want's to determine what was that function.

      Perhaps he want's an algorithm that given the sorted list would output the definition of the sorting function.

        Thats about it. The original data is hand sorted. I want to know what rules, or rather character order, was used to sort it.

      less is more

      Sorry about that - I didn't want to put a hard problem before you without showing that I had at least begun to think about it.

      So, given the input in the original posts __DATA__, I'm trying to extract a decleration for use with Sort::ArbBiLex, so that, if the original records lose their ordering, I could resort them.

        I think the problem was more mine than yours. zby managed to work out what you were asking, as evidenced by

        Perhaps he want's an algorithm that given the sorted list would output the definition of the sorting function.

        But no matter how many times I read it, I didn't get your drift.

        However, since then I have given the problem some pretty serious thunking, and I'm really not sure that what your asking is possible. Leastwise, not if your after complete accuracy unless you have a very large sample set that reflects all the possible comparisons. Even then, there is another, trickier problem.

        Lets assume somewhere in your dataset you had

        cuerno chile

        According to the POD for Sort::ArbBiLex, in a spanish dictionary, cuerno sorts before chile, because the 'ch' in chile, is considered a "single glyph" that sorts after the single glyph 'c' in cuerno.

        The problem is, whos to says that there isn't a language where the second glyph in cuerno is 'ue', and that this is designated as sorting before the second glyph in chile, 'h'?

        I simply cannot see any mechanistic way of making the determination of when 2 character codes represent a single glyph. It strikes me that the only way to do this is heuristically, and that means that you would have to already know what pairs consituted single glyphs up front.

        I don't envy you the problem, but if you do solve it, I'd be very interested to see how you do it.

        Good luck:)


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: Reverse Engineering Sort Order
by zakb (Pilgrim) on Jun 10, 2003 at 12:04 UTC

    I agree with BrowserUk, I'm not sure what your question is.

    I would like to point out:

      HTML::Entities does do the encoding/decoding (and I think I will use it for a related task). However it can't help with the more general problem of determining a bi level sort decleration from a given set of sorted records.

      perllocale doesn't, I think, give me a solution. The records I'm concerned with are truly international, and won't correspond to any one locale.

      thanks for the pointers.