Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Split confusion

by swampyankee (Parson)
on Jun 03, 2020 at 15:09 UTC ( #11117653=perlquestion: print w/replies, xml ) Need Help??

swampyankee has asked for the wisdom of the Perl Monks concerning the following question:

I'm getting back to Perl after about a decade, so I am, at times, running into cluelessness issues. One I just had is some confusion with split. I have to process surnames, some of which contain hyphens or whitespace. If the name contains neither whitespace nor a hyphen, I want to change it to proper case (ucfirst(lc($_)), but if it's hyphenated or has embedded whitespace, I want to split it at either the hyphen or the whitespace, upcase each chunk, and rejoin it.

The way (doubtless close to pessimal) that I'm using right now is something like this:

my $name = $record[0]; # it's being pulled from a roster written (for +some completely inane reason) in ALL CAPS) if($name =~ /-| /) { my @temp = split(/(-| )/,$name) { ucfirst(lc($_)) foreach(@temp); $name=''; $name =. $_ foreach(@temp); } else { $name = ucfirst(lc($name)); }
I was expecting a name (say SMITH-JONES) to be divided into three pieces: "SMITH", "-", "JONES". This is not what happened: I got "S","","M","","I","","T","","H","-","J","","O","","N","","E","","S" What did I do wrong? The split's documentation seems to say that / / doesn't split between every character, but " " does.

Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

Replies are listed 'Best First'.
Re: Split confusion
by AnomalousMonk (Bishop) on Jun 03, 2020 at 16:27 UTC

    If you're processing a list of upper-cased names and want to turn them into "properly" cased names, why not just do that?

    c:\@Work\Perl\monks>perl -wMstrict -le "my @names = ( 'JAMES SMITH-JONES', 'BOB SMITH-SMYTHE-SMITH', 'J. JONAH JAMESON', 'BILLY BOB THORNTON', ); ;; for my $name (@names) { printf qq{'$name' -> }; $name =~ s{ \b ([[:upper:]]+) \b }{\u\L$1}xmsg; print qq{'$name'}; } " 'JAMES SMITH-JONES' -> 'James Smith-Jones' 'BOB SMITH-SMYTHE-SMITH' -> 'Bob Smith-Smythe-Smith' 'J. JONAH JAMESON' -> 'J. Jonah Jameson' 'BILLY BOB THORNTON' -> 'Billy Bob Thornton'

    Update: See also Falsehoods Programmers Believe About Names.


    Give a man a fish:  <%-{-{-{-<

      If you're processing a list of upper-cased names and want to turn them into "properly" cased names, why not just do that?

      See also Falsehoods Programmers Believe About Names.

      It simply does not work correctly:

      #30 There exists an algorithm which transforms names and can be reversed losslessly.

      X:\>perl oops.pl 'JAMES SMITH-JONES' -> 'James Smith-Jones''BOB SMITH-SMYTHE-SMITH' -> +'Bob Smith -Smythe-Smith''J. JONAH JAMESON' -> 'J. Jonah Jameson''BILLY BOB THORN +TON' -> 'B illy Bob Thornton''LUDWIG VAN BEETHOVEN' -> 'Ludwig Van Beethoven' X:\>type oops.pl my @names = ( 'JAMES SMITH-JONES', 'BOB SMITH-SMYTHE-SMITH', 'J. JONAH JAMESON', 'BILLY BOB THORNTON', 'LUDWIG VAN BEETHOVEN', ); ;; for my $name (@names) { printf qq{'$name' -> }; $name =~ s{ \b ([[:upper:]]+) \b }{\u\L$1}xmsg; print qq{'$name'}; } X:\>

      Ol' Ludwig needs a lower case 'v' in his name. Quoting Wikipedia:

      The prefix van to the surname "Beethoven" reflects the Flemish origins of the family; the surname suggests that "at some stage they lived at or near a beet-farm".

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        ... "properly" ...   Nobody ever mentioned lossless reversal. :)


        Give a man a fish:  <%-{-{-{-<

      Thank you.

      I was considering using a regex, but those seem to be the first clues I've lost

      While I'm processing names in a formulaic manner, I actually know how they write them at least when using some extended version of the Roman alphabet (some of my students have names that are transliterated from Arabic, Serbian, and Macedonian). The names were upcased by the software from the system producing the reports.


      Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

      31. I can safely assume that this dictionary of bad words contains no peopleís names in it.
      My theorem is this:
      For any name of any person, there is a language in which it is a swearword.

        I have the same theory about car models

      Heh, #28 . . . Qapla'

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

Re: Split confusion
by hippo (Chancellor) on Jun 03, 2020 at 15:19 UTC

    Your split expression as written works for me:

    $ perl -E 'say join ":", split /(-| )/, "SMITH-JONES";' SMITH:-:JONES

    Can you provide an SSCCE?

    Update: There's a lot in your supplied code which looks suspiciously incorrect, I'm afraid. The open brace after the split has no matching closing brace and =. is not the concatenation operator, for example.

Re: Split confusion
by kcott (Bishop) on Jun 04, 2020 at 08:30 UTC

    G'day swampyankee,

    "I was expecting ... "SMITH", "-", "JONES" ... I got "S","","M","", ... What did I do wrong?"

    The documentation for split describes what happens with capturing. The section at the end (starting with "If the PATTERN contains capturing groups, ...") has a description followed by several examples.

    Here's your regex without capturing:

    $ perl -E 'say "|$_|" for split /-| /, "A B-C"' |A| |B| |C|

    Now with capturing (and what I think you intended):

    $ perl -E 'say "|$_|" for split /(-| )/, "A B-C"' |A| | | |B| |-| |C|

    If you coded /(-|)/ instead of /(-| )/, you would get the output you're seeing:

    $ perl -E 'say "|$_|" for split /(-|)/, "A B-C"' |A| || | | || |B| |-| |C|

    That, of course, is just a guess; however, given other issues (already noted by hippo) in your posted code, possibly a good guess.

    "The split's documentation seems to say that / / doesn't split between every character, but " " does."

    I expect you've misread or misunderstood something. Had you quoted the text that you thought seems to say what you suggest, I could comment further. There can be errors in documentation and those errors can be fixed; perhaps there's not an error but a clarification of the current text would help — obviously, the source of the confusion needs to be identified as a first step.

    Anyway, neither / / nor " " will "split between every character":

    $ perl -E 'say "|$_|" for split / /, "A B-C"' |A| |B-C| $ perl -E 'say "|$_|" for split " ", "A B-C"' |A| |B-C|

    Without the spaces, both will "split between every character":

    $ perl -E 'say "|$_|" for split //, "A B-C"' |A| | | |B| |-| |C| $ perl -E 'say "|$_|" for split "", "A B-C"' |A| | | |B| |-| |C|

    Regardless, I don't see how / / or " " relate to /(-| )/.

    — Ken

Re: Split confusion
by bliako (Prior) on Jun 03, 2020 at 21:15 UTC

    Lingua::EN::NameParse and Text::Names can be of help here.

    use Lingua::EN::NameParse qw(clean case_surname); use Text::Names; use strict; use warnings; # optional configuration arguments my %args = ( auto_clean => 1, lc_prefix => 1, initials => 3, allow_reversed => 1, joint_names => 0, extended_titles => 0 ); my $parser = Lingua::EN::NameParse->new(%args); my $error; for my $input ( 'JAMES SMITH-JONES', 'BOB SMITH-SMYTHE-SMITH', 'J. JONAH JAMESON', 'BILLY BOB THORNTON' ){ print "\n\ninput name is '$input'\n"; $error = $parser->parse($input); die "error: $error" if $error; print $parser->report; my $name3 = Text::Names::cleanName($input); print "name3: $name3\n"; }

    btw if you are inserting names into DB :) xkcd#327

    bw, bliako

Re: Split confusion
by davido (Cardinal) on Jun 03, 2020 at 22:07 UTC

    You can't really safely apply a set of capitalization rules to a person's name. You pretty much have to take it as they write it. And when you lose that information, you can't re-generate it. How to capitalize author names describes that de, d', van, and von may not be capitalized. So James Van Den Berghe could be spelled as I have done, or it could be James van den Berghe, or there could be some other magic combination. And some names defy all conventions.

    For your problem statement I would do this:

    #!/usr/bin/env perl use strict; use warnings; my @names = ( 'VAN DEN BERGHE', 'OSWALD', 'ANDERSON', 'LLOYD-WRIGHT', ); foreach my $name (@names) { my $altered = join('', map {ucfirst(lc($_))} split /(\s+|-)/, $nam +e); print "$name => $altered\n"; }

    which produces:

    VAN DEN BERGHE => Van Den Berghe OSWALD => Oswald ANDERSON => Anderson LLOYD-WRIGHT => Lloyd-Wright

    But that doesn't make any attempt at dealing with the nuances discussed above.


    Dave

      The list I'm case-correcting is from the rosters for several high school classes a teach; there are few enough exceptions to either split/join or regex codes that editing the remaining one or two by hand is trivial.


      Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

Re: Split confusion
by perlfan (Priest) on Jun 03, 2020 at 15:21 UTC
    I'm getting back to Perl after about a decade

    Welcome back!

    use strict; use warnings; use Data::Dumper (); my $name = "SMITH-JONES"; my @temp = split(/(-| )/,$name); print Data::Dumper::Dumper(\@temp); $name = "SMITH JONES"; @temp = split(/(-| )/,$name); print Data::Dumper::Dumper(\@temp);
    Output:
    $VAR1 = [ 'SMITH', '-', 'JONES' ]; $VAR1 = [ 'SMITH', ' ', 'JONES' ];
    Seems to work, are you sure you have an actual space on the RHS of the |?

    Also, seems like you could be kind to your computer and not save the capture results:

    my @temp = split(/(?:-| )/,$name);
    Final suggestion is to use the exact same regex in the if as you do in the split.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11117653]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2020-09-23 13:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If at first I donít succeed, I Ö










    Results (131 votes). Check out past polls.

    Notices?