Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Split first and last names

by Bod (Priest)
on Nov 09, 2022 at 22:41 UTC ( #11148076=perlquestion: print w/replies, xml ) Need Help??

Bod has asked for the wisdom of the Perl Monks concerning the following question:

I need to split a name string into first and last names. The name comes from user input on a web form and the only validation is that there is data submitted. It could be just one character, it could have multiple spaces, training spaces, etc.

This is the regexp I have for doing this task.

my ($fname, undef, $sname) = $xmas->{'fromName'} =~ /(\w+)( +|\Z)(\w*) +/;

It works for my simple testing but is there a "better" way to do it? However you define better!

The obvious problem is that it fails with extended characters such as Zo.

Replies are listed 'Best First'.
Re: Split first and last names
by hippo (Bishop) on Nov 09, 2022 at 23:43 UTC

    The short answer is that it's not a solvable problem. See Falsehoods Programmers Believe About Names for why that is.

    If you insist on storing data labelled as first name and last name then the best plan is to ask the person for those data items as separate fields (ie. change your web form).


    🦛

      Falsehoods Programmers Believe About Names

      It goes even further.

      I know Mies van der Rohe (architect) from a street name, from an address change that once was important to me. I googled his name a long time ago, and that's almost all I remembered.

      The name "Mies van der Rohe" follows a common pattern for people with a history from the Netherlands, "$firstname van (der|den) $lastname". So, I almost automatically splitted his name into "Mies" and "van der Rohe". Guess how suprised I was when I heared "Ludwig Mies van der Rohe" recently. OK, so LMvdR surely had two first names, that's quite common here. "Ludwig" and "Mies" must be the first names, and "van der Rohe" the last name. WRONG! "Mies van der Rohe" was his last name.

      Just to annoy anyone handling names in computers, he was commonly refered to as "Mies" (which translates to english as lousy, crappy, bad). And to really annoy anyone handling names in computers, he was born as Maria Ludwig Michael Mies. A female first firstname, and no "van der Rohe" at all.

      Some more examples: Re^7: regex to return line with name but not if it has a number - and I guess the name of the second example in that post has drastically changed this year.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        > I guess the name of the second example in that post has drastically changed this year.

        You shouldn't confuse legal name and nobility titles.

        I'm pretty sure I read once, that he signed his marriage papers with Diana simply as "Charles Mountbatten-Windsor".

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

      If you insist on storing data labelled as first name and last name then the best plan is to ask the person for those data items as separate fields (ie. change your web form).

      As a marketer first and a programmer second and a DBA third, the form will not be changing! The reason is simple - more people fill in the form when we ask for the name as a single field than if we ask for it in two or three parts.

      We've tested pretty much every combination and found (for our audience) that the optimal form uses placeholders, not labels. It has three fields on the initial page which are labelled

      Your Name Your Phone (optional) Your Best Email
      If we need extra information (we rarely do) then we ask for it after we have captured the key details and stored them safely.

      As for the database storage...
      We store name information as follows:

      prefix firstname middle name(s) nickname lastname preferred firstname preferred fullname suffix

      The problem I see with storing the full name as they type it into a web form into the database is that we need to use both their full name (in an email to field perhaps or on the front of an envelope) but just their firstname in the salutaion. So we have to split it somewhere and, to my mind, it's best to do this early in the process - i.e. before storing it.

      We have discussed adding a separator component for terms like von but we have so few in our databse that we currently treat them as part of the lastname

        we need to use both their full name (in an email to field perhaps or on the front of an envelope) but just their firstname in the salutation

        Bod, though I know nothing of your (superb) business, it won't stop me from offering some free advice.

        If you're politely allowing them to enter anything their heart desires in the "Your Name" field -- and further assuming you're not being overwhelmed by a huge volume of registrations -- how about simply storing their preferred "Your Name" in your database ... and then later manually editing your new registrations, checking for goofs (and rude Turkish words) ... but also to manually enter a nice "salutation" field.

        Parsing human names is one niche where I suspect humans still outperform computers, as I discovered years ago when my company sent out a letter to "Dear Captain Cruises" after the computer program incorrectly derived a salutation from Captain Cook Cruises. :-)

Re: Split first and last names
by kcott (Archbishop) on Nov 10, 2022 at 00:42 UTC

    G'day Bod,

    I agree with others that you should change the form. Ask specifically for first name and last name.

    "The obvious problem is that it fails with extended characters such as Zo."

    Take a look at perlrecharclass and follow links from there.

    This code is not intended as a solution to your problem; it's just to demonstrate some options that are available:

    $ perl -Mstrict -Mwarnings -Mutf8 -C -E '
        my $n = "Zo ct-Smythe";
        my ($f, undef, $l)
            = $n =~ /([[:alpha:]]+)( +|\Z)([\p{Alpha}\p{Punct}]*)/;
        say $f;
        say $l;
    '
    Zo
    ct-Smythe
    

    — Ken

      Thanks Ken for the helpful code sample that I will muse over

      With regards changing the form, that won't be happening as explained in Re^2: Split first and last names

        "Thanks Ken for the helpful code ..."

        You're welcome.

        "With regards changing the form, that won't be happening as explained in Re^2: Split first and last names"

        That's new information (only posted an hour or so ago) but does add some clarity.

        "... full name (... envelope) ... firstname ... salutation."

        Perhaps something along these lines:

        #!/usr/bin/env perl use strict; use warnings; use utf8; use open OUT => qw{:encoding(UTF-8) :std}; my @names = ("Zo", "Zo ct-Smythe"); my $re = qr{(?x: ^ ( ( [\p{Alpha}'_-]+ ) [\s\p{Alpha}'_-]* ) $ )}; for my $name (@names) { my ($full, $first) = $name =~ $re; print "Name: $name\n"; print "First: $first\n"; print "Full: $full\n"; }

        Output:

        Name: Zo First: Zo Full: Zo Name: Zo ct-Smythe First: Zo Full: Zo ct-Smythe

        Note that I allowed three punctuation characters ('_-); alter as necessary. I know, from earlier posts, that you're across SQL injection issues. Be aware, that between reading data from the web and supplying it to SQL, there may be other code injection issues. Without knowing anything more about your code, that's something you'll need to assess for yourself: I didn't include any validation; but you should.

        "As for the database storage..."

        I looks like most of that would be covered by "If we need extra information ... we ask for it ...". The majority wouldn't be covered by user input anyway (e.g. nickname, preferred names). Again, something for you to determine using the same principles as above (i.e. limited regex capture, code injection & validation).

        — Ken

Re: Split first and last names
by soonix (Canon) on Nov 10, 2022 at 08:19 UTC
Re: Split first and last names
by LanX (Sage) on Nov 09, 2022 at 23:11 UTC
    > but is there a "better" way to do it?

    you could start with two input fields instead trying to parse one.

    Do you know the famous author "Orson Scott Card"?

    Whenever I stumble over him, my neural parser is puzzling again, to which name category the "Scott" belongs.

    (I've looked it up already, every time...)

    And names can get much harder than this ... see also Names_of_Sun_Yat-sen

    > The obvious problem is that it fails with extended characters such as Zo.

    Well, the obvious issue is encoding

    Could be your HTML/HTTP settings or your script, or both.

    Unicode rulez...

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

      Unicode rulez...

      Yes indeed...

      Instead of using the English spelling of my firstname on Facebook - Ian - I dot the first letter so it becomes İan. English speakers don't generally notice but I am learning Turkish and I am part of a few Turkish-speaking Facebook groups.

      Unfortunately, LAN is a rude word in Turkish and Turks see my firstname and it causes them great amusement. After the sixth time it stopped being funny to me... Strangely, whilst English speakers pay no attention to the dot, Turkish speakers see it completely differently and don't associate it in the same way so it solves that problem but does mean I need to be careful with code that reads names from Facebook - which at least one of our site logins does.

        LAN is a rude word in Turkish and Turks see my firstname and it causes them great amusement

        ... while in German, LANX is a rude word causing great amusement. :) Why not use "Bod"? (we Australians are fond of using shortened surnames as nicknames).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11148076]
Approved by LanX
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2023-03-24 03:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which type of climate do you prefer to live in?






    Results (60 votes). Check out past polls.

    Notices?