http://qs321.pair.com?node_id=596527

rtremaine has asked for the wisdom of the Perl Monks concerning the following question:

Perlmonks, newbie having trouble with regular expression, I know it should be easy but .......... looking for the most efficient way to parse the following types of input so that i can place then into the three vars below. Thought 'split' might be the way to go but having difficulties! Many thanks
' Smith, John' ' Thompson, Frank A' ' Smith, John A JR' ' Smith, John A III' ' Smith, John A (Johnny)'
just need last name, first name and middle initial, like so:
$last_name = 'Smith' $first_name = 'John' $middle_initial = 'A'

Replies are listed 'Best First'.
Re: Newbie parsing problem
by McDarren (Abbot) on Jan 25, 2007 at 16:52 UTC
    Why do all your own barking when you can get yourself a perfectly good dog? ;)

    Lingua::EN::Parse::PersonsName does a pretty good job with this:

    #!/usr/bin/perl -wl use strict; use Lingua::EN::Parse::PersonsName; while (<DATA>) { chomp(); # Get rid of the stuff we don't need on either end of the string my ($fullname) = $_ =~ /^\'\s+(.*?)\'$/; my $parser = Lingua::EN::Parse::PersonsName->new($fullname); print join(" ",$parser->fname, $parser->mi, $parser->lname); } __DATA__ ' Smith, John' ' Thompson, Frank A' ' Smith, John A JR' ' Smith, John A III' ' Smith, John A (Johnny)' ' Robert E Smith' ' Fred E.J.K Flintstone III'
    Output:
    John Smith Frank A Thompson John A Smith John A Smith John A Smith Robert E Smith Fred E Flintstone
    (Note that a warning is thrown on the first one when we try to print the middle initial - because there is none.)

    Cheers,
    Darren :)

Re: Newbie parsing problem
by liverpole (Monsignor) on Jan 25, 2007 at 16:11 UTC
    Hi rtremaine,

    If you're sure there's always going to be whitespace between names, you could do something like this:

    sub parse_name { my ($namestr) = @_; my @names = split(/\s+/, $namestr); my ($last, $first, $middle) = ($names[0], $names[1], $names[2]); $last =~ s/,$//; $middle ||= ""; return [ $first, $middle, $last ]; }

    Note that the subroutine also trims any comma from the end of the last name, and returns a blank middle name if one wasn't defined.

    Now call the subroutine parse_name() with a name string, and you'll get a reference to a list containing the first, middle, and last names.  For example:

    + use strict; use warnings; + my @data = ( 'Smith, John', 'Thompson, Frank A', 'Smith, John A JR', 'Smith, John A III', 'Smith, John A (Johnny)', ); foreach my $name (@data) { my $p = parse_name($name); printf "First(%10.10s) Middle(%5.5s) Last(%10.10s)\n", @$p; } __END__ Output: First( John) Middle( ) Last( Smith) First( Frank) Middle( A) Last( Thompson) First( John) Middle( A) Last( Smith) First( John) Middle( A) Last( Smith) First( John) Middle( A) Last( Smith)

    Update:  Added test code.


    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
Re: Newbie parsing problem
by vaticide (Scribe) on Jan 25, 2007 at 16:17 UTC
    You may be able to use a Regular Expression for this, such as:
    my ($last_name, $first_name, $middle_initial) = ( $_ =~ /(\w+)\W*?(\w+ +)\s?(\w)?/) ;
    Here's the subroutine using the regexp you can plug right into liverpole's solution.
    sub parse_name { my ($last, $first, $mi) = ( $_[0] =~ /(\w+)\W*?(\w+)\s?(\w)?/) ; return [$first, $last, $mi]; }
    Updated: Noticed I missed the first no middle-initial case, oops!
      your solution ignores the name in case there's no middle initial (or just no trailing spaces), which is the first example. adding \W* or \s* instead of a single blank space fixes this.
      /(\w+)\W*?(\w+)\W*(\w)?/

      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      *women.pm
        Thanks, I noticed that, too, when I changed it to plug into the test case. Fairly embarrasing considering I do name parsing such as this regularly at work!