This is PerlMonks "Mobile"

Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

split and join

Regular expressions are used to match delimiters with the split function, to break up strings into a list of substrings. The join function is in some ways the inverse of split. It takes a list of strings and joins them together again, optionally, with a delimiter. We'll discuss split first, and then move on to join.

A simple example...

Let's first consider a simple use of split: split a string on whitespace.

$line = "Bart Lisa Maggie Marge Homer"; @simpsons = split ( /\s/, $line ); # Splits line and uses single whitespaces # as the delimiter.

@simpsons now contains "Bart", "", "Lisa", "Maggie", "Marge", and "Homer".

There is an empty element in the list that split placed in @simpsons. That is because \s matched exactly one whitespace character. But in our string, $line, there were two spaces between Bart and Lisa. Split, using single whitespaces as delimiters, created an empty string at the point where two whitespaces were found next to each other. That also includes preceding whitespace. In fact, empty delimiters found anywhere in the string will result in empty strings being returned as part of the list of strings.

We can specify a more flexible delimiter that eliminates the creation of an empty string in the list.

@simpsons = split ( /\s+/, $line ); #Now splits on one-or-more whitespaces.

@simpsons now contains "Bart", "Lisa", "Maggie", "Marge", and "Homer", because the delimiter match is seen as one or more whitespaces, multiple whitespaces next to each other are consumed as one delimiter.

Where do delimiters go?

"What does split do with the delimiters?" Usually it discards them, returning only what is found to either side of the delimiters (including empty strings if two delimiters are next to each other, as seen in our first example). Let's examine that point in the following example:

$string = "Just humilityanother humilityPerl humilityhacker."; @japh = split ( /humility/, $string );

The delimiter is something visible: 'humility'. And after this code executes, @japh contains four strings, "Just ", "another ", "Perl ", and "hacker.". 'humility' bit the bit-bucket, and was tossed aside.

Preserving delimiters

If you want to keep the delimiters you can. Here's an example of how. Hint, you use capturing parenthesis.

$string = "alpha-bravo-charlie-delta-echo-foxtrot"; @list = split ( /(-)/, $string );

@list now contains "alpha","-", "bravo","-", "charlie", and so on. The parenthesis caused the delimiters to be captured into the list passed to @list right alongside the stuff between the delimiters.

The null delimiter

What happens if the delimiter is indicated to be a null string (a string of zero characters)? Let's find out.

$string = "Monk"; @letters = split ( //, $string );

Now @letters contains a list of four letters, "M", "o", "n", and "k". If split is given a null string as a delimiter, it splits on each null position in the string, or in other words, every character boundary. The effect is that the split returns a list broken into individual characters of $string.

Split's return value

Earlier I mentioned that split returns a list. That list, of course, can be stored in an array, and often is. But another use of split is to store its return values in a list of scalars. Take the following code:

@mydata = ( "Simpson:Homer:1-800-000-0000:40:M", "Simpson:Marge:1-800-111-1111:38:F", "Simpson:Bart:1-800-222-2222:11:M", "Simpson:Lisa:1-800-333-3333:9:F", "Simpson:Maggie:1-800-444-4444:2:F" ); foreach ( @mydata ) { ( $last, $first, $phone, $age ) = split ( /:/ ); print "You may call $age year old $first $last at $phone.\n"; }

What happened to the person's sex? It's just discarded because we're only accepting four of the five fields into our list of scalars. And how does split know what string to split up? When split isn't explicitly given a string to split up, it assumes you want to split the contents of $_. That's handy, because foreach aliases $_ to each element (one at a time) of @mydata.

Words about Context

Put to its normal use, split is used in list context. It may also be used in scalar context, though its use in scalar context is deprecated. In scalar context, split returns the number of fields found, and splits into the @_ array. It's easy to see why that might not be desirable, and thus, why using split in scalar context is frowned upon.

The limit argument

Split can optionally take a third argument. If you specify a third argument to split, as in @list = split ( /\s+/, $string, 3 ); split returns no more than the number of fields you specify in the third argument. So if you combine that with our previous example.....

( $last, $first, $everything_else) = split ( /:/, $_, 3 );

Now, $everything_else contains Bart's phone number, his age, and his sex, delimited by ":", because we told split to stop early. If you specify a negative limit value, split understands that as being the same as an arbitrarily large limit.

Unspecified split pattern

As mentioned before, limit is an optional parameter. If you leave limit off, you may also, optionally, choose to not specify the split string. Leaving out the split string causes split to attempt to split the string contained in $_. And if you leave off the split string (and limit), you may also choose to not specify a delimiter pattern.

If you leave off the pattern, split assumes you want to split on /\s+/. Not specifying a pattern also causes split to skip leading whitespace. It then splits on any whitespace field (of one or more whitespaces), and skips past any trailing whitespace. One special case is when you specify the string literal, " " (a quoted space), which does the same thing as specifying no delimiter at all (no argument).

The star quantifier (zero or more)

Finally, consider what happens if we specify a split delimiter of /\s*/. The quantifier "*" means zero or more of the item it is quantifying. So this split can split on nothing (character boundaries), any amount of whitespace. And remember, delimiters get thrown away. See this in action:

$string = "Hello world!"; @letters = split ( /\s*/, $string );

@letters now contains "H", "e", "l", "l", "o", "w", "o", "r", "l", "d", and "!".
Notice that the whitespace is gone. You just split $string, character by character (because null matches boundaries), and on whitespace (which gets discarded because it's a delimiter).

Using split versus Regular Expressions

There are cases where it is equally easy to use a regexp in list context to split a string as it is to use the split function. Consider the following examples:

my @list = split /\s+/, $string; my @list = $string =~ /(\S+)/g;

In the first example you're defining what to throw away. In the second, you're defining what to keep. But you're getting the same results. That is a case where it's equally easy to use either syntax.

But what if you need to be more specific as to what you keep, and perhaps are a little less concerned with what comes between what you're keeping? That's a situation where a regexp is probably a better choice. See the following example:

my @bignumbers = $string =~ /(\d{4,})/g;

That type of a match would be difficult to accomplish with split. Try not to fall into the pitfall of using one where the other would be handier. In general, if you know what you want to keep, use a regexp. If you know what you want to get rid of, use split. That's an oversimplification, but start there and if you start tearing your hair out over the code, consider taking another approach. There is always more than one way to do it.


That's enough for split, let's take a look at join.

join: Putting it back together

If you're exhausted by the many ways to use split, you can rest assured that join isn't nearly so complicated. We can over-simplify by saying that join, does the inverse of split. If we said that, we would be mostly accurate. But there are no pattern matches going on. Join takes a string that specifies the delimiter to be concatenated between each item in the list supplied by subsequent parameter(s). Where split accommodates delimiters through a regular expression, allowing for different delimiters as long as they match the regexp, join makes no attempt to allow for differing delimiters. You specify the same delimiter for each item in the list being joined, or you specify no delimiter at all. Those are your choices. Easy.

To join a list of scalars together into one colon delimited string, do this:

$string = join ( ':', $last, $first, $phone, $age, $sex );

Whew, that was easy. You can also join lists contained in arrays:

$string = join ( ':', @array );

Use join to concatenate

It turns out that join is the most efficient way to concatenate many strings together at once; better than the '.' operator.

How do you do that? Like this:

$string = join ( '', @array );

As any good Perlish function should, join will accept an actual list, not just an array holding a list. So you can say this:

$string = join ( '*', "My", "Name", "Is", "Dave" );

Or even...

$string = join ( 'humility', ( qw/My name is Dave/ ) );

Which puts humility between each word in the list.

By specifying a null delimiter (nothing between the quotes), you're telling join to join up the elements in @array, one after another, with nothing between them. Easy.

Hopefully you've still got some energy left. If you do, dive back into the Tutorial.


Credits and updates

Replies are listed 'Best First'.
Re: Understanding Split and Join
by duff (Parson) on Dec 28, 2006 at 14:44 UTC

    I'd put more emphasis on the fact that the first argument to split is always, always, always a regular expression (except for the one special case where it isn't :-). Too often do I see people write code like this:

    @stuff = split "|", $string; # or worse ... $delim = "|"; @stuff = split $delim, $string;
    And expect it to split on the pipe symbol because they have fooled themselves into thinking that the first argument is somehow interpreted as a string rather than a regular expression.
Re: Understanding Split and Join
by jwkrahn (Abbot) on Dec 28, 2006 at 13:23 UTC
    There are cases where it is equally easy to use a regexp in list context to split a string as it is to use the split function. Consider the following examples:
    my @list = split /\s+/, $string; my @list = $string =~ /(\S+)/g;
    In the first example you're defining what to throw away. In the second, you're defining what to keep. But you're getting the same results. That is a case where it's equally easy to use either syntax.

    In your regexp example you don't need the parentheses, it will work the same without them.

    If $string contains leading whitespace then you will NOT get the same results. To demonstrate examples that produce the same results:

    my @list = split ' ', $string; my @list = $string =~ /\S+/g;
Re: Understanding Split and Join
by chromatic (Archbishop) on Dec 29, 2006 at 00:52 UTC
    What happens if the delimiter is indicated to be a null string (a string of zero characters)?

    perl behaves inconsistently with regard to the "empty" regex:

    my $string = 'Monk'; exit unless $string =~ /(o)/; my @matches = $string =~ //; warn join('=', @matches), "\n"; exit unless $string =~ /(o)/; my @letters = split( //, $string ); warn join('-', @letters), "\n";
Re: Understanding Split and Join
by ysth (Canon) on Dec 29, 2006 at 08:02 UTC
    chromatic has pointed out that split treats an empty pattern normally, not as a directive to reuse the last successfully matching pattern, as m// and s/// do.

    A pattern that split treats specially but m// and s/// treat normally is /^/. Normally, ^ only matches at the beginning of a string. Given the /m flag, it also matches after newlines in the interior of the string. It's common to want to break a string up into lines without removing the newlines as splitting on /\n/ would do. One way to do this is @lines = /^(.*\n?)/mg. Another, perhaps more straightforward, is @lines = split /^/m. Without the /m, the ^ should match only at the beginning of the string, so the split should return only one element, containing the entire original string. Since this is useless, and splitting on /^/m instead is common, /^/ silently becomes /^/m.

    This only applies to a pattern consisting of just ^; even the apparently equivalent /^(?#)/ or /^ /x are treated normally and don't split the string at all.

      Both exceptions, the special treatment of // and /^/ by split, are documented in split. Both may deserve to be mentioned in the tutorial quickly for the profit of the unaware. The last remark by ysth about the non-equivalence of /^(?#)/ and /^ /x with // for split purposes is a subtle thing. More subtle if you compare to the fact that / /x, / # /x or even / (?#)/x have the same treatment as // when passed to this function. Looks like a case to be fixed either in the docs or in the code of the Perl interpreter itself (if not barred by compatibility issues).
        Looks like a case to be fixed either in the docs or in the code of the Perl interpreter itself
        I'm not sure what you mean by "fixed"? split doesn't have the special logic for // that match and substitution do, but even those operations don't have special logic for / /x, / # /x or / (?#)/x.
Re: Understanding Split and Join
by FabsBSD (Acolyte) on Apr 08, 2011 at 04:25 UTC
    Hi! How could you remove the odd characters in a string? For example in: $string = "111213141516171" The idea is to remove the odd 1s Thanks Monks!
Re: Understanding Split and Join
by mvcorrea (Novice) on Jul 16, 2014 at 11:57 UTC
    Looking 4 wisdom in a situation.. I have a huge string. I need to split the text at max col80 "{80}" in the space(without breaking the words). If we find "/\.\s/" (dot space) should be a nice place to spit it. tks in advance,

      Have you looked at the FAQ, How do I reformat a paragraph? The core module Text::Wrap may be all you need:

      use strict; use warnings; use Text::Wrap qw($columns wrap); $columns = 80; print wrap('', '', <DATA>); __DATA__ The quick brown fox jumped over the lazy dog. The quick brown fox jump +ed over the lazy dog. The quick brown fox jumped over the lazy dog. T +he quick brown fox jumped over the lazy dog. The quick brown fox jump +ed over the lazy dog. The quick brown fox jumped over the lazy dog. T +he quick brown fox jumped over the lazy dog. The quick brown fox jump +ed over the lazy dog. The quick brown fox jumped over the lazy dog. T +he quick brown fox jumped over the lazy dog. The quick brown fox jump +ed over the lazy dog.

      Output:

      2:13 >perl 935_SoPW.pl The quick brown fox jumped over the lazy dog. The quick brown fox jump +ed over the lazy dog. The quick brown fox jumped over the lazy dog. The quick +brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog +. The quick brown fox jumped over the lazy dog. The quick brown fox jumped o +ver the lazy dog. The quick brown fox jumped over the lazy dog. The quick brow +n fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog +. The quick brown fox jumped over the lazy dog. 2:13 >

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Understanding Split and Join
by phamalda (Novice) on Jan 29, 2016 at 19:18 UTC

    First and foremost, thank you guys for the unending information. Great stuff! I am working on a script that splits a line into an array delimited by tildes. I need to completely remove a couple of those resulting fields if there is a way to do that. This is rather than removing them based on the regular expressions identified. Below is the piece of code that does that:

    my @field = split("~", $line); if ($field[11] eq '000015') { $line =~ s/$field[12]/92/g; $line =~ s/~~~~~~~~~10~/~10~/g; push(@newlines,$line); }

    I'd really appreciate your advice on this. Thanks in advance.

      Welcome to the Monastery, phamalda!

      Could you please repost this in Seekers of Perl Wisdom? I understand why you asked this here, but you'll get a wider audience if you post the question in the proper location.

      Thanks!
        Okay, thank you so much. I appreciate it. Sorry about that. Pham