Re: Split first and last names
by hippo (Bishop) on Nov 09, 2022 at 23:43 UTC
|
The short answer is that it's not a solvable problem. See Falsehoods Programmers Believe About Names for why that is.
If you insist on storing data labelled as first name and last name then the best plan is to ask the person for those data items as separate fields (ie. change your web form).
| [reply] |
|
Falsehoods Programmers Believe About Names
It goes even further.
I know Mies van der Rohe (architect) from a street name, from an address change that once was important to me. I googled his name a long time ago, and that's almost all I remembered.
The name "Mies van der Rohe" follows a common pattern for people with a history from the Netherlands, "$firstname van (der|den) $lastname". So, I almost automatically splitted his name into "Mies" and "van der Rohe". Guess how suprised I was when I heared "Ludwig Mies van der Rohe" recently. OK, so LMvdR surely had two first names, that's quite common here. "Ludwig" and "Mies" must be the first names, and "van der Rohe" the last name. WRONG! "Mies van der Rohe" was his last name.
Just to annoy anyone handling names in computers, he was commonly refered to as "Mies" (which translates to english as lousy, crappy, bad). And to really annoy anyone handling names in computers, he was born as Maria Ludwig Michael Mies. A female first firstname, and no "van der Rohe" at all.
Some more examples: Re^7: regex to return line with name but not if it has a number - and I guess the name of the second example in that post has drastically changed this year.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] |
|
Great examples! ++
In the midst of a contractual dispute with Warner Bros. in 1993, he changed his stage name to the
unpronounceable love symbol
and was often referred to as The Artist Formerly Known as Prince (or TAFKAP)
-- from The late (mindbogglingly brilliant) musician Prince (wikipedia)
I find this stuff very entertaining for some reason.
I remember being amused in the extreme by Prince's new unpronounceable name.
Since I don't have a list of references on this topic yet, I'll remedy that oversight here so as not to disappoint the LanX. :)
Further suggestions welcome!
References on Parsing Names and Addresses
Parsing names:
Related:
Address de-duplication:
CPAN
Note: this node was updated with new references long after originally written.
| [reply] |
|
> I guess the name of the second example in that post has drastically changed this year.
You shouldn't confuse legal name and nobility titles.
I'm pretty sure I read once, that he signed his marriage papers with Diana simply as "Charles Mountbatten-Windsor".
| [reply] |
|
|
|
If you insist on storing data labelled as first name and last name then the best plan is to ask the person for those data items as separate fields (ie. change your web form).
As a marketer first and a programmer second and a DBA third, the form will not be changing! The reason is simple - more people fill in the form when we ask for the name as a single field than if we ask for it in two or three parts.
We've tested pretty much every combination and found (for our audience) that the optimal form uses placeholders, not labels. It has three fields on the initial page which are labelled
Your Name
Your Phone (optional)
Your Best Email
If we need extra information (we rarely do) then we ask for it after we have captured the key details and stored them safely.
As for the database storage...
We store name information as follows:
prefix
firstname
middle name(s)
nickname
lastname
preferred firstname
preferred fullname
suffix
The problem I see with storing the full name as they type it into a web form into the database is that we need to use both their full name (in an email to field perhaps or on the front of an envelope) but just their firstname in the salutaion. So we have to split it somewhere and, to my mind, it's best to do this early in the process - i.e. before storing it.
We have discussed adding a separator component for terms like von but we have so few in our databse that we currently treat them as part of the lastname
| [reply] [d/l] [select] |
|
we need to use both their full name (in an email to field perhaps or on the front of an envelope) but just their firstname in the salutation
Bod, though I know nothing of your (superb) business, it won't stop me from offering some free advice.
If you're politely allowing them to enter anything their heart desires in the "Your Name" field --
and further assuming you're not being overwhelmed by a huge volume of registrations --
how about simply storing their preferred "Your Name" in your database ... and then later manually editing
your new registrations,
checking for goofs (and rude Turkish words) ... but also to manually enter a nice "salutation" field.
Parsing human names is one niche where I suspect humans still outperform computers, as
I discovered years ago when my company sent out a letter to "Dear Captain Cruises"
after the computer program incorrectly derived a salutation from
Captain Cook Cruises. :-)
| [reply] |
|
Re: Split first and last names
by kcott (Archbishop) on Nov 10, 2022 at 00:42 UTC
|
G'day Bod,
I agree with others that you should change the form.
Ask specifically for first name and last name.
"The obvious problem is that it fails with extended characters such as Zoë."
Take a look at perlrecharclass and follow links from there.
This code is not intended as a solution to your problem;
it's just to demonstrate some options that are available:
$ perl -Mstrict -Mwarnings -Mutf8 -C -E '
my $n = "Zoë Åcçéñt-Smythe";
my ($f, undef, $l)
= $n =~ /([[:alpha:]]+)( +|\Z)([\p{Alpha}\p{Punct}]*)/;
say $f;
say $l;
'
Zoë
Åcçéñt-Smythe
| [reply] |
|
Thanks Ken for the helpful code sample that I will muse over
With regards changing the form, that won't be happening as explained in Re^2: Split first and last names
| [reply] |
|
#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use open OUT => qw{:encoding(UTF-8) :std};
my @names = ("Zoë", "Zoë Åcçéñt-Smythe");
my $re = qr{(?x:
^
(
(
[\p{Alpha}'_-]+
)
[\s\p{Alpha}'_-]*
)
$
)};
for my $name (@names) {
my ($full, $first) = $name =~ $re;
print "Name: $name\n";
print "First: $first\n";
print "Full: $full\n";
}
Output:
Name: Zoë
First: Zoë
Full: Zoë
Name: Zoë Åcçéñt-Smythe
First: Zoë
Full: Zoë Åcçéñt-Smythe
Note that I allowed three punctuation characters ('_-); alter as necessary.
I know, from earlier posts, that you're across SQL injection issues.
Be aware, that between reading data from the web and supplying it to SQL, there may be other code injection issues.
Without knowing anything more about your code, that's something you'll need to assess for yourself:
I didn't include any validation; but you should.
"As for the database storage..."
I looks like most of that would be covered by "If we need extra information ... we ask for it ...".
The majority wouldn't be covered by user input anyway (e.g. nickname, preferred names).
Again, something for you to determine using the same principles as above
(i.e. limited regex capture, code injection & validation).
| [reply] [d/l] [select] |
Re: Split first and last names
by soonix (Canon) on Nov 10, 2022 at 08:19 UTC
|
| [reply] |
Re: Split first and last names
by LanX (Sage) on Nov 09, 2022 at 23:11 UTC
|
> but is there a "better" way to do it?
you could start with two input fields instead trying to parse one.
Do you know the famous author "Orson Scott Card"?
Whenever I stumble over him, my neural parser is puzzling again, to which name category the "Scott" belongs.
(I've looked it up already, every time...)
And names can get much harder than this ... see also Names_of_Sun_Yat-sen
> The obvious problem is that it fails with extended characters such as Zoë.
Well, the obvious issue is encoding
Could be your HTML/HTTP settings or your script, or both.
Unicode rulez...
| [reply] |
|
Unicode rulez...
Yes indeed...
Instead of using the English spelling of my firstname on Facebook - Ian - I dot the first letter so it becomes İan. English speakers don't generally notice but I am learning Turkish and I am part of a few Turkish-speaking Facebook groups.
Unfortunately, LAN is a rude word in Turkish and Turks see my firstname and it causes them great amusement. After the sixth time it stopped being funny to me... Strangely, whilst English speakers pay no attention to the dot, Turkish speakers see it completely differently and don't associate it in the same way so it solves that problem but does mean I need to be careful with code that reads names from Facebook - which at least one of our site logins does.
| [reply] |
|
| [reply] |