Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Why split function treats single quotes literals as regex, instead of a special case?

by likbez (Sexton)
on Aug 14, 2020 at 02:21 UTC ( [id://11120703]=perlquestion: print w/replies, xml ) Need Help??

likbez has asked for the wisdom of the Perl Monks concerning the following question:

It looks like Perl split function treats single quotes literal semantically inconsistently with other constructs

But not always :-). For example

($line)=split(' ',$line,1)
is treated consistently (in AWK way). This is the only way I know to avoid using regex for a very common task of trimming the leading blanks.

In general, split function should behave differently if the first argument is string and not a regex. But right now single quoted literal is treated as regular expression. For example:

$line="head xxx tail";
say split('x+',$line);
will print
head  tail

Am I missing something? BTW this would be similar to Python distinguishing between split and re.split but in a more elegant, Perlish way. And a big help for sysadmins.

Replies are listed 'Best First'.
Re: Why split function treats single quotes literals as regex, instead of a special case?
by jwkrahn (Abbot) on Aug 14, 2020 at 03:33 UTC

    The single space character is a special case for split, anything else is treated as a regular expression, be it a string, function call, etc.

    Regular expressions are also treated a bit differently than regular expressions in qr//, m// and s///.

      The single space character is a special case for split ...
      I.e., per split:
      As another special case, split emulates the default behavior of the command line tool awk when the PATTERN is either omitted or a string composed of a single space character (such as ' ' or "\x20", but not e.g. / /). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were /\s+/; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator.
      You also write:
      Regular expressions are also treated a bit differently than regular expressions in qr//, m// and s///.
      I don't understand this statement. Can you elaborate?


      Give a man a fish:  <%-{-{-{-<

        The regular expression // works differently in split then elsewhere:

        $ perl -le' my $x = "1234 abcd 5678"; print $& if $x =~ /[a-z]+/; print $& if $x =~ //; print map qq[ "$_"], split /[a-z]+/, $x; print map qq[ "$_"], split //, $x; ' abcd abcd "1234 " " 5678" "1" "2" "3" "4" " " "a" "b" "c" "d" " " "5" "6" "7" "8"

        Also, the line anchors /^/ and /$/ don't require the /m option to match lines in a string.

Re: Why split function treats single quotes literals as regex, instead of a special case?
by Anonymous Monk on Aug 14, 2020 at 10:02 UTC
Re: Why split function treats single quotes literals as regex, instead of a special case?
by perlfan (Vicar) on Aug 14, 2020 at 16:51 UTC
    >Am I missing something?

    Yes, this is Perl not Python.

    >Why?

    I can assert that conextually, splitting on all characters for split //, $string is a lot more meaningful than splitting on nothing and returning just the original $string. The big surprise actually happens for users (like me) who don't realize the first parameter of split is a regular expression. But that surprise quickly turns into joy.

    >In general, split function should behave differently if the first argument is string and not a regex.

    Should? That's pretty presumptuous. You'll notice that Perl has FAR few built in functions (particularly string functions) than PHP, JavaScript, or Python. This is because they've all been generalized away into regular expressions. You must also understand that the primary design philosphy is more related to spoken linquistics than written code. The implication here is that humans are lazy and don't want to learn more words than they need to communicate - not true of all humans, of course. But true enough for 99% of them. This is also reflected in the Huffmanization of most Perl syntax. This refers to Huffman compression, which necessarily compresses more frequently used things (characters, words, etc) into the symbols of the smallest size. I mean Perl isn't APL, but certainly gets this idea from it.

    The balkanization of built-in functions that are truly special cases of a general case is against any philosophical underpinnings that Perl follows. I am not saying it's perfect, but it is highly resistent to becoming a tower of babble. If that's your interest (not accusing you of being malicious), there are more fruitful avenues to attack Perl. Most notably, the areas of object orientation and threading. But you'll have pretty much zero success convincing anyone who has been around Perl for a while that the approach to split is incorrect.

    Oh, also a string (as you're calling it) is a regular expression in the purest sense of the term. It's best described as a concatenation of a finite set of symbols in fixed ordering. For some reason a lot of people think this regex magic is only present in patterns that may have no beginning or no end, or neither. In your case it just happens to have both. Doesn't make it any less of a regular expression, though.

      The balkanization of built-in functions that are truly special cases of a general case is against any philosophical underpinnings that Perl follows. I am not saying it's perfect, but it is highly resistant to becoming a tower of babble. If that's your interest (not accusing you of being malicious), there are more fruitful avenues to attack Perl

      I respectfully disagree. Perl philosophy states that there should be shortcuts for special cases if they are used often. That's the idea behind suffix conditionals (  return if (index($line,'EOL')>-1) ) and bash-style if statement (($debug) && say line; )

      You also are missing the idea. My suggestion is that we can enhance the power of Perl by treating single quoted string differently from regex in split. And do this without adding to balkanization.

      Balkanization of built-ins is generally what Python got having two different functions. Perl can avoid this providing the same functionality with a single function. That's the idea.

      And my point is that this particular change requires minimal work in interpreter as it already treats ' ' in a special way (AWK way).

      So this is a suggestion for improving the language, not for balkanization, IMHO. And intuitively it is logical as people understand (and expect) the difference in behavior between single quoted literals and regex in split. So, in a way, the current situation can be viewed as a bug, which became a feature.

        >So, in a way, the current situation can be viewed as a bug, which became a feature.

        To be fair, this is a lot of perl. But I can't rightfully assert that this behavior was unintentional, in fact it appears to be very intentional (e.g., awk emulation).

        >You also are missing the idea.

        My understanding is that you wish for "strings" (versus "regexes") to invoke the awk behavior of trimming leading white space. Is that right? I'm not here to judge your suggestion, but I can easily think of several reasons why adding another special case to split is not a great idea.

        All I can say is you're the same guy who was looking for the trim method in Perl. If that's not a red flag for being okay with balkanization, I don't know what is.

        Finally, I must reiterate. A "string" is a regular expression. The single quoted whitespace is most definitely a special exception since it is also a regular expression. You're recommending not only removing one regex from the pool of potential regexes, but an entire class of them available via quoting - i.e., fixed length strings of a fixed ordering. I am not sure how this is really a suggestion of making all quoted things not be regexes, because then how do you decide if it is "regex" or not? (maybe use a regex? xD)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11120703]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-25 19:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found