http://qs321.pair.com?node_id=180952

ChOas has asked for the wisdom of the Perl Monks concerning the following question:

A collegue of mine is working on parsing a file, which
is part of a database... The person that wrote this
database has kinda screwed up, because he/she used 1 field
for 3 values, I`ll explain:

In The Netherlands, a surname CAN be prepended by (amongst
others) one of the following:
'VAN DER','VAN DE','DEN','DE','VAN' ...
I have no clue what this is called in English, but
I would like to call it a 'prependition' :)

Anyways... The fields in every line are:

NAME<Mandatory> PREPENDITION<maybe> HAVENT_GOT_A_CLUE<maybe>

I wanted to help, and I`ve tried different ways to parse
this, and I only have a little sample data, but I came up
with this:
#!/usr/bin/perl -w use strict; my $Prep_Re=join '|',('VAN DER','VAN DE','DEN','DE','VAN'); print "Name\t\tPrependition\t\tWhatever\n"; while (<DATA>) { chomp; s/\s{2,}/ /g; my ($Name,$Prep,$Unknown); ($Name,$Prep,$Unknown)=($`,$1,$') if (/ ($Prep_Re)/); ($Name,$Unknown)=($`,$1) if ((!$Name)&&(/ (\S*?$)/)); $Prep||=''; print "$Name\t\t$Prep\t\t$Unknown\n"; }; __DATA__ WINTER DE <A240> ZANDEN VAN DER ŤAť JENSEN 230 WOODHEAD <D> BRINK 130,- HEYDIER DEN <240> SMITSER (4X115PJ) LINDEN VAN DER MOTEL GOLDEN LEEUW <A225>

It works for my sample data, but I can already think of
cases where it won`t work... and... I have a gut-feeling
this can be done better (I`m not a regex-wiz) ...

Any pointers ?

GreetZ!,

print "profeth still\n" if /bird|devil/;

Replies are listed 'Best First'.
Re: Can this be parsed ?
by dws (Chancellor) on Jul 11, 2002 at 06:23 UTC
    Can names have embedded spaces? Can the "haven't got a clue" part have embedded spaces?

    Without knowing this, we can't know whether to break   MOTEL GOLDEN LEEUW <A225> into   (MOTEL)()(GOLDEN LEEUW <A225>) or   (MOTEL GOLDEN LEEUW)()(<A225>) Assuming for a moment that the latter is the correct way to divide the field, you could do something like

    while ( <DATA> ) { if ( /^(.*)\s*($Pred_re)\s*(.*)$/ ) { ($name,$pred,$unknown) = ($1, $2, $3); } else { ($name,$unknown,$pred) = /^(.*)\s*(\S*)$/; } }

    Updated: to tweak the whitespace matching.

    Update 2: Bah. Forget the feeble effort above.

    my $Prep_Re=join '|',('VAN DER','VAN DE','DEN','DE','VAN'); while ( <DATA> ) { chomp; if ( /^(.*)\s+($Prep_Re)\b\s*(.*)$/ ) { ($name, $prep, $other) = ($1, $2, $3); } elsif ( /^(.*)\s+($Prep_Re)\s*$/ ) { ($name, $prep, $other) = ($1, $2, ""); } elsif ( /^(.*)\b\s+(\S+)$/ ) { ($name, $prep, $other) = ($1, "", $2); } else { ($name, $prep, $other) = ($_, "", ""); } print "$name|$prep|$other\n"; } __DATA__ WINTER DE <A240> ZANDEN VAN DER ŤAť JENSEN 230 WOODHEAD <D> BRINK 130,- HEYDIER DEN <240> SMITSER (4X115PJ) LINDEN VAN DER MOTEL GOLDEN LEEUW <A225> __END__ WINTER|DE|<A240> ZANDEN|VAN DER|ŤAť JENSEN||230 WOODHEAD||<D> BRINK||130,- HEYDIER|DEN|<240> SMITSER||(4X115PJ) LINDEN|VAN DER| MOTEL GOLDEN LEEUW||<A225>
      Okay, I`ll reply here, might be easier :) Results:
      WINTER|DE|<A240> <- Parsed correctly ZANDEN VAN |DE|R ŤAť <- should be: ZANDEN|VAN DER|ŤAť JENSEN 230|| <- should be: JENSEN||230 WOODHEAD <D>|| <- should be: WOODHEAD||<D> BRINK 130,-|| <- should be: BRINK||130,- HEYDIER |DEN|<240> <- Parsed correctly SMITSER (4X115PJ)|| <- should be: SMITSER||(4X115PJ) LINDEN VAN |DE|R <- should be: LINDEN|VAN DER| MOTEL GOL|DEN|LEEUW <A225> <- should be: MOTEL GOLDEN LEEUW||<A225>
      does this help ? ... I will try to find a larger data set...

      btw, this is the result of my original code:
      WINTER|DE| <A240> ZANDEN|VAN DER| ŤAť JENSEN||230 WOODHEAD||<D> BRINK||130,- HEYDIER|DEN| <240> SMITSER||(4X115PJ) LINDEN|VAN DER| MOTEL GOLDEN LEEUW||<A225>

      </code>
      GreetZ!,
        ChOas

      print "profeth still\n" if /bird|devil/;
Re: Can this be parsed ?
by ChOas (Curate) on Jul 11, 2002 at 06:55 UTC
    Hmmm yeah... I should have given some more information:

    Names CAN have embeded spaces, and I THINK that the unknown
    field always starts with  [^A-Z] ...

    One case where my solution will not work is a name without
    spaces, no prependition (God, I`m starting to like that word :),
    and no Unknown field... But I expect there to be more...

    I tried your code, dws, but it does not seem to work
    (probably because I hadn`t set the prequisitories (sp?)
    straight)...

    And hossman, We gave up on the poetry magnets for choosing
    last names, it`s now based on quantum mechanics, but if I
    were to explain to you, I`d have to... well.. let`s not go
    there :))

    GreetZ!,
      ChOas

    print "profeth still\n" if /bird|devil/;
      Names CAN have embeded spaces, and I THINK that the unknown field always starts with [^A-Z] ...

      Hmm, unless you're certain about that [^A-Z] thing, it sounds like you're screwed.

      There's no clear way to parse records that don't contain a "prependition" because you can never be sure where the name ends, consider the most simple example...

      A B

      There are two totally valid parsings for this.

Re: Can this be parsed ?
by cLive ;-) (Prior) on Jul 11, 2002 at 08:05 UTC
    If NAME and HAVENT_GOT_A_CLUE can both contain spaces, then the instance where there is no 'prependition' will cause you real problems IF the last field can start with a letter. But if it can't, here's what I'd do:
    #!/usr/bin/perl -w use strict; my $Prep_Re = join '|',('VAN DER','VAN DE','DEN','DE','VAN'); while (<DATA>) { s/\s{2,}/ /g; s/^\s*(.*?)\s*$/$1/; if ( /(.+?) ($Prep_Re) ?((?:[^A-Za-z].*)?)/ ) { my ($Name,$Prep,$Unknown) = ($1,$2,$3); print "$Name == $Prep == $Unknown\n"; } elsif ( /(.+) ([^A-Za-z].*)?/ ) { my ($Name,$Prep,$Unknown) = ($1,'',$2); print "$Name == $Prep == $Unknown\n"; } else { print "No idea for $_\n"; } } __DATA__ WINTER DE <A240> ZANDEN VAN DER ŤAť JENSEN 230 WOODHEAD <D> BRINK 130,- HEYDIER DEN <240> SMITSER (4X115PJ) LINDEN VAN DER MOTEL GOLDEN LEEUW <A225>

    .02

    ps - the nearest word I could find was cognomen, but I bet that's not right either :)

    --
    seek(JOB,$$LA,0);

Re: Can this be parsed ?
by Abigail-II (Bishop) on Jul 11, 2002 at 09:47 UTC
    You will not be able to parse it that easily. Consider ALICE, born VAN DER VLIET, now married to Mr. MARIE. Her name would be ALICE MARIE VAN DER VLIET. But surely, you do not want to split that into: (ALICE MARIE) (VAN DER) (VLIET)?

    Abigail

Re: Can this be parsed ?
by Popcorn Dave (Abbot) on Jul 11, 2002 at 18:45 UTC
    After playing with this for a few hours, I can see what a headache this is for you. : |

    I did come up with a solution that isn't regex oriented, but I think it will work for all cases. It could probably be tightened a bit, although the regex way may be better in the end.

    #!/usr/bin/perl -w use strict; my $temp=''; my $flag=0; print "Name\t\tPrependition\t\tWhatever\n"; while (<DATA>) { chomp; my @line = split('\s+',$_); for my $i(0..$#line){ if ((length($line[$i])<4)&& ($line[$i] =~ m/[A-Z][A-Z]+/o)){ $temp .= $line[$i]; $line[$i] = '*' if +$flag; $line[$i] = '' if ! +$flag; $flag = 1; $temp .= ' '; } } $flag = 0; for my $i(0..$#line){ print "$temp\t" if $line[$i] eq ''; print "$line[$i]\t" unless $line[$i] eq '*';; } print "\n"; $temp = ''; } __DATA__ WINTER DE <A240> ZANDEN VAN DER ŤAť JENSEN 230 WOODHEAD <D> BRINK 130,- HEYDIER DEN <240> SMITSER (4X115PJ) LINDEN VAN DER MOTEL GOLDEN LEEUW <A225>

    Hope that helps!

    Some people fall from grace. I prefer a running start...

Re: Can this be parsed ?
by hossman (Prior) on Jul 11, 2002 at 06:24 UTC
    It's hard to understand your problem, mainly because I don't know anything about names in the Netherlands ...
    • What is your field seperator? (ie: can white spaces appear in names?)
    • What are the cases you can think of where it won't work
    • Is "MOTEL GOLDEN LEEUW" a valid name? How do you guys pick names in the Netherlands -- throw a box of poetry magnets in the air and see what sticks to the light fixtures?
      How do you guys pick names in the Netherlands -- throw a box of poetry magnets in the air and see what sticks to the light fixtures?

      That's a bit rude, or at least culturally insensitive.

      Is "MOTEL GOLDEN LEEUW" a valid name? That's the name of a motel, surely. What's weird about that?

      If all this data is just in the same field, I think my pseudocode would be:

      while (<DATA>){ $name = everything up to the first space. $rest = everything else if ($rest starts with one of the "van"-type prepositions){ deal with it } else { print "can't parse this one into a name: $_" } }

      --
      ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;
        That's a bit rude, or at least culturally insensitive.

        chalk it up to cultural insensitivity i guess, humor isn't neccessarily universal.

        Is "MOTEL GOLDEN LEEUW" a valid name?

        That's the name of a motel, surely. What's weird about that?

        no, not surely ... he said he was dealing with surnames, which is what prompted my question (it didn't seem like a name, which is why I wasn't clear how that line should be parsed)