Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

can split() use a regex?

by Anonymous Monk
on Jun 17, 2006 at 15:13 UTC ( [id://555979]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a line that starts with a 4 digit number (no decimals or commas), a space bar, then the remaining string.

I was wondering if it's possible to split() it using a regex-type function to get the number saved and then everything else.

my ($one, $two) = split("\d+\s+", $line);

Is this possible? I know I could just regex this baby out but thought I'd ask.

Replies are listed 'Best First'.
Re: can split() use a regex?
by BrowserUk (Patriarch) on Jun 17, 2006 at 16:07 UTC

    If you don't want to discard the number, you'd need to use capture brackets

    my ($one, $two) = split( /(\d+)\s+/, $line, 2);

    Probably easier to use m//:

    my ($one, $two) = $line =~ m[(\d+)\s+(.*)$];

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Make that
      my ($one, $two, $three) = split( /(\d+)\s+/, $line, 2);
      The number itself will be put into $two, what comes before it into $one, what comes after it (except for the leading whitespace, that is dropped) into $three.

      So yes, if you use capturing parens in the regex for split, you'll get more return values. From the docs:

      If the PATTERN contains parentheses, additional array elements are created from each matching substring in the delimiter.
      split(/([,-])/, "1-10,20", 3);
      produces the list value
      (1, '-', 10, ',', 20)

      I recommend using better variables' names.

        Right++ I forgot about the implicit null match at the beginning, but then I'd always use m// for this anyway :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: can split() use a regex?
by bobf (Monsignor) on Jun 17, 2006 at 19:40 UTC

    As other monks have mentioned, you can certainly use a regex in split. If you want to divide a string into parts based on field length rather than a specific delimiter (pattern), however, you can also use unpack.

    use strict; use warnings; my $string = '1234 this is the remaining string'; # A4 = take the first 4 ASCII characters # x = skip the next byte # A* = take all remaining ASCII characters my ( $num, $text ) = unpack( 'A4xA*', $string ); print "[$num][$text]\n";
    Prints:
    [1234][this is the remaining string]

    pack and unpack can be confusing. Pack/Unpack Tutorial (aka How the System Stores Data) is a great tutorial to get you started, and Super Search will help you find other references.

    Update:
    unpack significantly outperforms both split and the regex approach, as shown below (I used BrowserUK's code in Re: can split() use a regex? for the other two approaches):

    use strict; use warnings; use Benchmark qw( cmpthese ); my $string = '1234 this is the remaining string'; cmpthese( -5, { unpack => sub { unpack( 'A4xA*', $string ) }, regex => sub { $string =~ m[(\d+)\s+(.*)$] }, split => sub { split( /(\d+)\s+/, $string, 2 ) }, } );
    Results:
    Rate split regex unpack split 789472/s -- -44% -70% regex 1422218/s 80% -- -45% unpack 2594984/s 229% 82% --

    HTH

Re: can split() use a regex?
by Fletch (Bishop) on Jun 17, 2006 at 15:14 UTC

    You want to give a limit to the number of fields to split. In your case you want to split on whitespace and limit to 2 fields (discarding the whitespace).

Re: can split() use a regex?
by eXile (Priest) on Jun 17, 2006 at 15:18 UTC
    If you really wanted to split on '\d+\s+' you'd need to single quote the regex, ie:
    my ($one, $two) = split('\d+\s+', $line);
      or use (and IMHO better because it clearly designates it as a regex to the reader) the normal //:
      my ($one, $two) = split( /\d+\s+/, $line);
      BUT.. you need the delimeter as $one .. so need to capture it:
      my ($one, $two) = split( /(\d+\s+)/, $line);
      BUT .. that will have the space in $one .. so use a look-behind (see perlre) instead .. (and add the LIMIT)
      my ($one, $two) = split /(?<=\d+)\s+/, $line, 2;
      so to OP -- yes, it's possible w/split & regex ;)
Re: can split() use a regex?
by Moron (Curate) on Jun 19, 2006 at 09:36 UTC
    To my mind, with the exception of the suggestion to use unpack and the one to use m///, both which should work, the rest of the above doesn't DWYM. If you want the number to be transferred to the first variable, split won't do that, because it excludes as delimiters whatever matches the regexp. Normally the bracketed part of the regexp would transfer matches to $1, $2 etc. NOT to the list returned by split by the way.

    Unfortunately, in the case of running the regexp through split, this will cause match variables to be rendered undefined at the point where split returns.

    There might be some nasty trick to force split to abort and leave match variables intact, but I can hardly recommend such an approach.

    The OPs implied alternative for regexping without split when coded correctly, would look something like:

    $line =~ /^(\d{4})(.*)$/ or die "unexpected content at line $.\n"; my ( $one, $two ) = ( $1, $2 );

    -M

    Free your mind

Re: can split() use a regex?
by Maroder (Initiate) on Jun 19, 2006 at 13:05 UTC
    Yes, is posible:
    ($one,$two) = (split /\d+\s+/)[0,1];
Re: can split() use a regex?
by Irrelevant (Sexton) on Jun 20, 2006 at 09:57 UTC

    I would instinctively use a regex with captures for this. In most cases, I would reserve split() for when I had many (or an arbitrary number of) similarly-delimited fields. Intuitive regex solution:

    my ($number, $tail) = ($line =~ /^(\d{4}) (.*)/) or die "no match";

    bobf's suggestion of using unpack() would be more efficient, but it doesn't validate as it goes: it would split "ABCDEFGHIJK" into "ABCD" and "FGHIJK" without batting a proverbial eyelid. It's good if you trust your data, but it'd be premature optimisation to use it over a regex otherwise, IMO.

    use subs map{uc,lc}"a".."z";AUTOLOAD{map{print/j|p/ ?uc:lc}${(caller!1)[3]}=~/.$/g;v32}(S.t)->(U\j),n(A ),(e,l~R)->(p!r->(E&O|H~t)),(E,r,q)((.))->(k\H,a^c)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://555979]
Approved by wfsp
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (11)
As of 2024-04-18 10:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found