can split() use a regex?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: can split() use a regex? by BrowserUk (Patriarch) on Jun 17, 2006 at 16:07 UTC
If you don't want to discard the number, you'd need to use capture brackets `my ($one, $two) = split( /(\d+)\s+/, $line, 2);` [download] Probably easier to use m//: `my ($one, $two) = $line =~ m[(\d+)\s+(.*)$];` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: can split() use a regex? by bart (Canon) on Jun 18, 2006 at 11:33 UTC
Make that `my ($one, $two, $three) = split( /(\d+)\s+/, $line, 2);` [download] The number itself will be put into $two, what comes before it into $one, what comes after it (except for the leading whitespace, that is dropped) into $three. So yes, if you use capturing parens in the regex for split, you'll get more return values. From the docs: If the PATTERN contains parentheses, additional array elements are created from each matching substring in the delimiter. `split(/([,-])/, "1-10,20", 3);` [download] produces the list value `(1, '-', 10, ',', 20)` [download] I recommend using better variables' names.	[reply] [d/l] [select]
Re^3: can split() use a regex? by BrowserUk (Patriarch) on Jun 18, 2006 at 11:40 UTC
Right++ I forgot about the implicit null match at the beginning, but then I'd always use m// for this anyway :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: can split() use a regex? by bobf (Monsignor) on Jun 17, 2006 at 19:40 UTC
As other monks have mentioned, you can certainly use a regex in split. If you want to divide a string into parts based on field length rather than a specific delimiter (pattern), however, you can also use unpack. `use strict; use warnings; my $string = '1234 this is the remaining string'; # A4 = take the first 4 ASCII characters # x = skip the next byte # A* = take all remaining ASCII characters my ( $num, $text ) = unpack( 'A4xA', $string ); print "[$num][$text]\n";` [download] Prints: `[1234][this is the remaining string]` [download] pack and unpack can be confusing. Pack/Unpack Tutorial (aka How the System Stores Data) is a great tutorial to get you started, and Super Search will help you find other references. Update: unpack significantly outperforms both split and the regex approach, as shown below (I used BrowserUK's code in Re: can split() use a regex? for the other two approaches): `use strict; use warnings; use Benchmark qw( cmpthese ); my $string = '1234 this is the remaining string'; cmpthese( -5, { unpack => sub { unpack( 'A4xA', $string ) }, regex => sub { $string =~ m[(\d+)\s+(.*)$] }, split => sub { split( /(\d+)\s+/, $string, 2 ) }, } );` [download] Results: `Rate split regex unpack split 789472/s -- -44% -70% regex 1422218/s 80% -- -45% unpack 2594984/s 229% 82% --` [download] HTH	[reply] [d/l] [select]
Re: can split() use a regex? by Fletch (Bishop) on Jun 17, 2006 at 15:14 UTC
You want to give a limit to the number of fields to split. In your case you want to split on whitespace and limit to 2 fields (discarding the whitespace).	[reply]
Re: can split() use a regex? by eXile (Priest) on Jun 17, 2006 at 15:18 UTC
If you really wanted to split on '\d+\s+' you'd need to single quote the regex, ie: `my ($one, $two) = split('\d+\s+', $line);` [download]	[reply] [d/l]
Re^2: can split() use a regex? by davidrw (Prior) on Jun 17, 2006 at 16:02 UTC
or use (and IMHO better because it clearly designates it as a regex to the reader) the normal `//`: `my ($one, $two) = split( /\d+\s+/, $line);` [download] BUT.. you need the delimeter as $one .. so need to capture it: `my ($one, $two) = split( /(\d+\s+)/, $line);` [download] BUT .. that will have the space in $one .. so use a look-behind (see perlre) instead .. (and add the LIMIT) `my ($one, $two) = split /(?<=\d+)\s+/, $line, 2;` [download] so to OP -- yes, it's possible w/split & regex ;)	[reply] [d/l] [select]
Re: can split() use a regex? by Moron (Curate) on Jun 19, 2006 at 09:36 UTC
To my mind, with the exception of the suggestion to use unpack and the one to use m///, both which should work, the rest of the above doesn't DWYM. If you want the number to be transferred to the first variable, split won't do that, because it excludes as delimiters whatever matches the regexp. Normally the bracketed part of the regexp would transfer matches to $1, $2 etc. NOT to the list returned by split by the way. Unfortunately, in the case of running the regexp through split, this will cause match variables to be rendered undefined at the point where split returns. There might be some nasty trick to force split to abort and leave match variables intact, but I can hardly recommend such an approach. The OPs implied alternative for regexping without split when coded correctly, would look something like: `$line =~ /^(\d{4})(.)$/ or die "unexpected content at line $.\n"; my ( $one, $two ) = ( $1, $2 );` [download] -M Free your mind*	[reply] [d/l]
Re: can split() use a regex? by Maroder (Initiate) on Jun 19, 2006 at 13:05 UTC
Yes, is posible: `($one,$two) = (split /\d+\s+/)[0,1];` [download]	[reply] [d/l]
Re: can split() use a regex? by Irrelevant (Sexton) on Jun 20, 2006 at 09:57 UTC
I would instinctively use a regex with captures for this. In most cases, I would reserve `split()` for when I had many (or an arbitrary number of) similarly-delimited fields. Intuitive regex solution: `my ($number, $tail) = ($line =~ /^(\d{4}) (.*)/) or die "no match";` bobf's suggestion of using `unpack()` would be more efficient, but it doesn't validate as it goes: it would split "ABCDEFGHIJK" into "ABCD" and "FGHIJK" without batting a proverbial eyelid. It's good if you trust your data, but it'd be premature optimisation to use it over a regex otherwise, IMO. `use subs map{uc,lc}"a".."z";AUTOLOAD{map{print/j\|p/ ?uc:lc}${(caller!1)[3]}=~/.$/g;v32}(S.t)->(U\j),n(A ),(e,l~R)->(p!r->(E&O\|H~t)),(E,r,q)((.))->(k\H,a^c)` [download]	[reply] [d/l] [select]


"be consistent"
	PerlMonks