http://qs321.pair.com?node_id=88801

Perl Newby has asked for the wisdom of the Perl Monks concerning the following question:

I am in the process of trying to parse data out of a text file. I have an irregular expression that takes the person's name and puts it in this format J. Arcain. However, if the person's name is like the third entry below, my code does not parse the name. Below is a piece of what the file looks like.
24|Janeth Arcain|6|6|217|36.2|51|106|.481| 321|Elen Chakirova|5|0|27|5.4|2|4|.500| 380|Kelley Gibson-White|6|0|85|14.2|3|17|.176|8|8|1.000|
This is the expression I am using to parse out the name.
~ s/^([A-Z])\w*( \w+)$/$1.$2/g
If anyone knows how I can parse the name out in this format K. Gibson-White, I would greatly appreciate it.

Replies are listed 'Best First'.
Re: Parsing Names in a Text File
by davorg (Chancellor) on Jun 15, 2001 at 18:30 UTC

    You're using the escape sequence \w to match characters in the surname. \w matches the characters A-z, a-z, 0-9 and the underscore (_). Your example contains a dash (-) character, so you'll need to add that to the list of allowed characters. Something like this will work:

    s/^([A-Z])\w*( [-\w]+)$/$1.$2/g

    --
    <http://www.dave.org.uk>

    Perl Training in the UK <http://www.iterative-software.com>

Re: Parsing Names in a Text File
by lemming (Priest) on Jun 15, 2001 at 18:33 UTC

    Check Name Parsing from early this yeat for more on this subject. It will give more cases to think about.

    update:
    I just looked at your strings a bit more closely.
    Why don't you split on the "|" and get the second value?

    ($something, $name, $junk) = split(/\|/, $line, 3);
    There are better ways of writing that split, but I'm running on no sleep.

    update on update: looks like more people gave the split answer while I did that...

Re: Parsing Names in a Text File
by enoch (Chaplain) on Jun 15, 2001 at 18:35 UTC
    I would pass on using a regex for this one.
    while(<FILE>) { @line = split '|'; # split line $firstLetter = substr $line[1], 0, 1; #grab first letter $lastName = (split(' ', $line[1]))[1]; #split the name entry on wh +itespace and grab the last name portion $name = $firstLetter . ". " . $lastName; #concatenate with period +after first letter }
    Jeremy
      I'd recommend taking a slight variant on this in order to handle the middle name problem:
      while(<FILE>) { my @line = split '|'; # split line #so far as above apart from my declaration #now split on whitespace my @names = split ' ',$line[1]; #same idiom for getting the first letter my $firstletter = substr ($names[0],0,1); #then get the last item in the name array my $lastname = $names[-1]; #now do whatever you want to do with the letter and lastname }
      Of course this assumes that all the names are in a givenname middlenames familyname format.
Re: Parsing Names in a Text File
by Masem (Monsignor) on Jun 15, 2001 at 18:35 UTC
    It's probably easily to use split since your file is nicely set up for that; Regex's aren't always the right cure for every problem.
    my @names; while (<FILE>) { my ( $a, $name, @rest ) = split /\|/; push @names, $name; }

    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
Re: Parsing Names in a Text File
by runrig (Abbot) on Jun 15, 2001 at 18:51 UTC
    Do it in two steps, it makes it much clearer than coming up with one regex to do everything:
    my $str = '380|Kelley Gibson-White|6|0|85|14.2|3|17|.176|8|8|1.000|'; # Get the name my ($name) = (split /\|/, $str, 3)[1]; # Initialize the first name $name =~ s/(\w)\S*\s+(.*)/$1. $2/; print $name;
Re: Parsing Names in a Text File
by marvell (Pilgrim) on Jun 15, 2001 at 19:53 UTC
    Look out for middle names and surnames that start with "De", etc. They are easily confused. I bet there is a CPAN module that does this, and I bet is has a list of "probable" two word surname prefixes.

    --
    Brother Marvell

      It can be totally ambiguous, too. Ricky Van Shelten, the singer, has a FIRST NAME of "Ricky Van".
Re: Parsing Names in a Text File
by mothra (Hermit) on Jun 15, 2001 at 20:08 UTC
    Easy enough:
    open(CUST, "cust_info.txt"); while (<CUST>) { print ((split ('\|'))[1], "\n"); }
    Update: My apologies...even after reading the question twice before answering, I still missed what you were trying to do. :)
    open(CUST, "cust_info.txt"); while (<CUST>) { $full_name = (split '\|')[1]; $full_name =~ s/(\w)\w*\s+(.+$)/$1. $2/; print $full_name, "\n"; }
    is probably more what you're looking for.
Re: Parsing Names in a Text File
by Hofmator (Curate) on Jun 15, 2001 at 19:38 UTC

    I like regexes :) and, well, they are not that difficult to understand

    use strict; use warnings; while (<DATA>) { my $shortname = join '. ', /\|([A-Z])\S*\s+([^|]+)/; print $shortname," or "; # or if the rest of the line is to be left alone s/(\|[A-Z])\S*\s+([^|]+)/$1. $2/; print; } __DATA__ 24|Janeth Arcain|6|6|217|36.2|51|106|.481| 321|Elen Chakirova|5|0|27|5.4|2|4|.500| 380|Kelley Gibson-White|6|0|85|14.2|3|17|.176|8|8|1.000|

    Of course this puts still some constraints on the names, e.g. two first names like in 'Johann Sebastian Bach' are not allowed ... and probably a lot more special cases

    -- Hofmator