Other form of greediness in Perl

When we talk about 'greediness' in Perl, we always think about regex, around 97% of the chances. At least, that's what I got when I used 'greedy' or 'greediness' in the Search box and in the Super Search. There are some functions, constructs, or areas where 'greediness' in fact exists to present occasional surprises to us who aren't aware as well as to provide expected conveniences.

List construct is always greedy. It eats up everything up to after the last comma, whether you want it or not. And this can be a problem when you need to provide a list inside another list.

my %chart = (
  name => 'vote',
  data => get_vote_data(), # assumed to return a scalar
  colors => join ',', 'red', 'green', 'blue',
  shape => 'bar',
);

print $chart{name}, "\n";   # 'vote'
print $chart{colors}, "\n"; # oops: 'red,green,blue,shape,cubic'
[download]

In my early days of Perl programming I was beaten by this. A few days ago I wrote similiar code above but the element with joined list as the value was the last element. I was aware to write it such. It just reminded me the mistake I made then. And this was the moment when it crossed my mind that "greediness is not all about regex". Then I decided to write something about it. This is not, by any means, an attempt to provide a solid reference or a complete tutorial about "other form of greediness in Perl".

List argument

Many Perl built-in functions take list as their arguments. Sometimes I like to use shortcuts like $var = $value, next if $true (something PBP does seems to discourage) instead of,

if ($true) {
  $var = $value;
  next;
}
[download]

This is a problem, for example, when we use warn. I admit that this is not particularly useful example. The whole line is always evaluated to true due to if 1. But I think it's enough.

while (1) {
  warn 1, 2, 3, last if 1;
}
[download]

One might expect that it prints 123 at ... line ... to STDERR and exits the while loop. Instead, it exits without showing anything. Why? According to the docs, the warn function takes LIST. So it swallows the stuffs including last because perl sees only one statement. As soon as the last is taken into account, it is executed and it exits the loop right away without warn having any chance to be executed. The B::Deparse helps to make it clear.

$ perl -MO=Deparse,-p -e 'while (1) { warn 1, 2, 3, last if 1 }'
while (1) {
  warn(1, 2, 3, last);
}
-e syntax OK
[download]

As you can see, last is considered part of list elements for warn. The parentheses around the list are added by B::Deparse due to -p switch. The if part is also removed because the condition is simple and always true.

Parentheses to the rescue

Just like regex that has non-greedy quantifiers (the ? mark), the list construct also provides non-greedy behavior by using parentheses. Of course, this is not something surprising. Parens have been there from the start. It's used for, among other things, explicitly declaring the order of precedences, or limiting the boundary of list elements. The latter is exactly what we need to solve our problems above.

The first problem with join is easy. By using parentheses around the arguments to join, the problem is solved right away.

my %chart = (
  name => 'vote',
  data => get_vote_data(), # assumed to return a scalar
  colors => join(',', 'red', 'green', 'blue'),
  shape => 'bar',
);

print $chart{name}, "\n";   # 'vote'
print $chart{colors}, "\n"; # 'red,green,blue'
[download]

The closing paren effectively marks the end of what join is supposed to grab. IOW, the closing paren limits the boundary of list elements join is expected to take. The first comma after the name => 'vote' pair acts normally as pair separator. However after the join function, the comma acts as argument separator. The list becomes greedy it could eat the last pair as the last two args for join. The => operator is really just a comma in disguise (with a power to force-stringify its left operand). After the closing paren that terminates the join arguments, the list greedinees is tamed, and the next comma becomes pair separator again. The %chart contains four pairs of elements as expected. The same thing for the warn example. When we use the parens, it's clear that we have two separate execution units. We already got hint from B::Deparse at the previous section. We only need to make sure where to put the closing paren.

while (1) {
  warn(1, 2, 3), last if 1;
}
[download]

It now prints 123 at ... line ... to STDERR then exits the loop. In this case, the comma operator after the closing paren acts as statement separator. It's exactly the same if it was written as,

while (1) {
  if (1) {
    warn 1, 2, 3;
    last;
  }
}
[download]

Unary operators are not greedy

On the contrary, functions in unary operators category are never greedy because they never take more than one argument. Take chr for example.

$ perl -e 'print chr $ARGV[0], $ARGV[1]' 66 67
B67
[download]

The code consist of two functions, print and chr, and 'a list' of arguments taken from @ARGV, the command line arguments. The print operator, as we know, takes list as its argument (after a filehandle, explicit or not) so it will spit out what ever we feed to it. On the other hand, the chr takes only one argument (or it will use $_ if none is supplied). Perl knows for sure that chr will only use the first argument ($ARGV[0]) and passes the rest to print. So what happens is chr takes 66 and returns the character that number represents (B). The print operator takes the return value and prints it together with 67, so we see B67. Now, let's see it through B::Deparse.

$ perl -MO=Deparse,-p -e 'print chr $ARGV[0], $ARGV[1]' 66 67
print(chr($ARGV[0]), $ARGV[1]);
-e syntax OK
[download]

Looking at how the parens are placed, we can see that the comma operator terminates rather than separates the arguments for chr. As for the print, the comma does separate the arguments. The same thing applies for functions that takes no argument at all, such as time.

$ perl -e 'print time, "TZ"'
1191177816TZ
[download]

Greedy and non-greedy behaviors

Interestingly, Perl also has functions that provide both greedy and non-greedy behaviors, depending on how many arguments we pass in to them. First, let's take examples of the split. This function breaks up string on specified pattern (or predefined pattern if one is omitted) into parts.

my $line = 'book|perl|programming';
my @fields = split /\|/, $line;
[download]

The split is greedy in this code because it will fill up the array @fields with as many elements as it can find. In this case, it contains book, perl, and programming. However, if we use the limit argument, then we get the non-greedy behavior. We can tell it how many elements exactly do we want.

my $line = 'book|perl|programming';
my @fields = split /\|/, $line, 2;
[download]

No matter how many elements it could split, @fields would always contain at the maximum two elements. From the code above, @fields contains book and perl|programming.

The second example is the substr function. It returns part of a string based on start position and how many to return. If we don't specify how many we want, it's greedy. The good side, we don't have to worry about how many characters (for some definition of "character") are there, just get them all.

my $line = 'Practical extraction and report language';
my $part = substr $line, 10;
[download]

The substr returns partial string from $line starting at position 10 (counting from 0) onward, extraction and report language in this case. Let's say we want certain number of charcters, then we get the non-greedy behavior and we can explicitly ask for the specific part of the string to return.

my $line = 'Practical extraction and report language';
my $what = substr $line, 10, 21; # extraction and report
[download]

Subroutine: Passing params and returning values

Somehow I feel that passing parameters to subroutines and returning values from subroutines are classical topics regarding some features of Perl. They also happen to have strong relationship with this greediness matter. Parameters passed in to a subroutine will be in the array @_, one of Perl special globals. We can then assign variables from this array. However, an array assigned from @_ is greedy. Nothing left for the next variable to be assigned from @_ anymore.

sub upstream {
  my(@source, @destination) = @_;
  my $src_place = $source[0]; # 'here'
  my $dest_place = $destination[0]; # undefined
}

my @src = qw(here sender OK);
my @dest = qw(there recipient UNKNOWN);
upstream(@src, @dest);
[download]

From the code above, the array @destination in the subroutine upstream() absolutely contains empty element, not the content of @dest as one might expect. OTOH, the array @source has all the combined elements of original @src and @dest from the caller. It's clear: @source is greedy. No matter how many storage units are passed in, it will be all in a single array, once the array is assigned from @_. Yes, prototyping can help, if it's really wanted. And its purpose can be beaten as well. Nevertheless, discussion on subroutine prototyping is out of scope of this article.

But there's another elegant solution to overcome this greediness problem. And it's also a good time to introduce the benefit of references. So to make the subroutine upstream() treats the two arrays as distinct storage units, they must be passed in as references. Likewise, to access the arrays, they must be dereferenced first.

sub upstream {
  my($source, $destination) = @_;
  my $src_place = $source->[0]; # 'here'
  my $dest_place = $destination->[0]; # 'there'
}

my @src = qw(here sender OK);
my @dest = qw(there recipient UNKNOWN);
upstream(\@src, \@dest);
[download]

This time, each array can be passed inand accessed individually. Two storage unist will be accepted as two storageunits as well. No greedy behavior.

Just like when passing in the parameters, greediness also happens when returning values from a subroutine to the caller. Perl uses the same stack model for parameters passing and values returning, as stated in the perlsub:

The Perl model for function call and return values is simple: all functions are passed as parameters one single flat list of scalars, and all functions likewise return to their caller one single flat list of scalars. Any arrays or hashes in these call and return lists will collapse, losing their identities....

The subroutine get_group() in the following code is supposed to return a list of groups a user is assgined to. There's no way we can distinct the type of groups returned even if we know it.

sub get_group {
  my $user_id = shift;
  my @special_groups = get_special_group($user_id); # ex: 'cdrom', 'ga
+mes'
  my @normal_groups = get_normal_group($user_id); # ex: 'westteam', 'm
+arketing', 'committee'

  return @special_groups, @normal_groups;
}

my(@specail, @normal) = get_group(1002);
print $special[-1]; # 'games'
print $normal[0]; # undefined
[download]

The whole elements of both arrays are returned as flat list, and the first array that captures the return values is greedy. Indeed, the array @special will have cdrom, games, westteam, marketing, and committee while the array @normal is just an empty one. To capture them as two different units, we again use references to return the groups, and dereference them to access the elements.

sub get_group {
  my $user_id = shift;
  my @special_groups = get_special_group($user_id); # ex: 'cdrom', 'ga
+mes'
  my @normal_groups = get_normal_group($user_id); # ex: 'westteam', 'm
+arketing', 'committee'

  return \@special_groups, \@normal_groups;
}

my($specail, $normal) = get_group(1002);
print $special->[-1]; # 'games'
print $normal->[-1]; # 'committee'
[download]

Aggregat, list, and scalar arguments in the function syntax descriptions

The perlfunc contains explanation about Perl functions at general as well as a syntax description for each function. A function can be generally called in many ways (various number of arguments) and context-sensitive (void, scalar, or list). Functions can take aggregat, list or scalar arguments.

Aggregat variables mean that they can have more than one scalar, so the function explicitly takes an ARRAY or a HASH as argument. If what we have is a reference to an array or a hash, then it must be dereferenced first, appropriately. In the following examples, the %$rgb, %ENV, @names, and @$another_members are each treated as a single unit, individually.

# syntax: <c>keys HASH</c>
my $rgb = { red => 0x00f, green => 0xf00, blue => 0x0f0 }; # from perl
+data
my @keys = keys $rgb; # fatal: Type of arg 1 to keys must be hash
my @keys = keys %$rgb; # correct
my @keys = keys %ENV; # correct

# syntax: <c>push ARRAY,LIST</c>
# first example
my @names = original_member();
my @new_members = get_new_members('20071001');
push @names, @new_members;

# second example
my $another_members = [qw(ray tom dave)];
push $another_members, @names; # fatal: Type of arg 1 to push must be 
+array
push @$another_members, @names; # correct
[download]

If we emulate it,

sub mypush {
  my(@target_array, @new_elements) = @_;
  ...
}
[download]

There's no greediness here. It's guaranteed that @target_array is the same as @names and @new_elements is the same as new_members in the first example. Likewise, @target_array is the same as @$another_members and @new_elements is the same as @names in the second one. List argument (denoted as LIST) is, however, greedy, as we discussed earlier in this article. Everything after the first array will be treated as @new_elements, no matter how many arrays or scalars are there.

A function can take a number of scalars too. For example, the splice syntax is splice ARRAY,OFFSET,LENGTH,LIST. The argument ARRAY must be a real array or a dereferenced array reference. OFFSET and LENGTH are two scalar arguments. They will be treated individually. So the first three arguments are not treated in greediness. But LIST, the last argument is, once again, greedy.

Conclusion

Well, I actually didn't meant to make some conclusion, but I need a closing to summarize what we have seen so far :-)

Greediness behavior in Perl is not just all about regex.
List constructs are greedy.
Array assignment in parameters passing and return values is greedy. (updated, per ikegami, see below)
A function can be both greedy and non-greedy, depending on the arguments we use.
Functions that take LIST argument are greedy, using parentheses is required to avoid greediness.
Aggregat arguments (HASH or ARRAY) and scalars arguments (OFFSET, LENGTH, NUMBER, FILEHANDLE, etc) won't be treated in greediness.

(Updated)I'd be more than glad to hear other monks experiences and feelings about this issue, including opinion on whether or not this issue is important at all. Thanks a lot.

Happy Perl programming, folks!

Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

Comment on Other form of greediness in Perl Select or Download Code

Replies are listed 'Best First'.
Re: Other form of greediness in Perl by ikegami (Patriarch) on Oct 01, 2007 at 16:55 UTC
Parameters passing and return values are greedy. It has nothing to do with parameter passing or values being returned. It's assigning a list to an array that's "greedy". It doesn't matter if the RHS of the assignment is `@_` or a function call. `my (@array, $scalar) = (1, 2, 3); print(scalar(@array), "\n"); # 3 print(defined($scalar)?1:0), "\n"); # 0` [download]	[reply] [d/l] [select]
Re^2: Other form of greediness in Perl by naikonta (Curate) on Oct 01, 2007 at 17:38 UTC
It's supposed to be related with the opening, There are some functions, constructs, or areas where 'greediness' in fact exists... So, yes, it's not about the passing and returning. Rather, the greediness could happen in both events. I did mention about the array assignment that is greedy. Thanks, ikegami, I will update the sentence. Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!	[reply]
Re: Other form of greediness in Perl by ikegami (Patriarch) on Oct 01, 2007 at 19:55 UTC
If list contruction is greedy, that would make all binary ops be greedy as well. If `foo $x,$y,$z` is greedy beacuse it means `foo($x,$y,$z)` and not `foo($x),$y,$z`, then `foo $x+$y+$z` is greedy beacuse it means `foo($x+$y+$z)` and not `foo($x)+$y+$z`. I think we don't talk of list being because it's easier to think it terms of how many operands are required. (e.g. 0, 3 or 4, any number, etc) Besides, it's just too inconsistent. `foo $x,$y,$z means foo($x,$y,$z) For functions, bar $x,$y,$z doesn't mean bar($x,$y,$z) it depends on the +prototype. print $x,$y,$z means print($x,$y,$z) For named operators, chr $x,$y,$z doesn't mean chr($x,$y,$z) it depends on the +prototype. not $x+$y+$z means not($x+$y+$z) For other operators, ! $x+$y+$z doesn't mean !($x,$y,$z) it depends on the +precedence.` [download]	[reply] [d/l] [select]
Re^2: Other form of greediness in Perl by naikonta (Curate) on Oct 04, 2007 at 15:29 UTC
I think we don't talk of list being because it's easier to think it terms of how many operands are required. (e.g. 0, 3 or 4, any number, etc) It's not the same in term of whether a function takes list, or more than one scalar (also list), or aggregat variables. That's why I took a different approach. Besides, 'list constructs are greedy' is a general statement, based on the examples I presented. So it's not about list construct per se for the whole statement. But rather, how many execution units are there. To determine this, we need to look at which functions are involved and what kind of argument they take. Here, "kinds of argument" I excerpt from the perlfunc are: aggregats (HASH, ARRAY), such in `keys HASH`, `unshift ARRAY` lists (LIST), such as `warn LIST`, `reverse LIST` scalars (NUMBER, FILEHANDLE, FILENAME), such as `chr NUMBER`, `chdir FILEHANDLE`, `chroot FILENAME` Of course, some functions doesn't take any argument at all, and some functions take all the kinds of argument, such as `splice ARRAY,OFFSET,LENGTH,LIST`. ARRAY is an aggregat, OFFSET and LENGTH are scalars, LIST is, well, a list. In this case, LIST is always greedy. If foo $x,$y,$z is greedy beacuse it means foo($x,$y,$z) and not foo($x),$y,$z, then foo $x+$y+$z is greedy beacuse it means foo($x+$y+$z) and not foo($x)+$y+$z. From this example, if `foo` is a unary operator, then in `foo $x, $y, $z` we have two execution units, `foo $x` and `$y, $z` separated by a comma (as a statement separator). The whole context then is `foo($x), $y, $z`. This is the list construct I'm talking about, `foo($x)`, `$y`, and `$z`, so `foo($x)` is not greedy. But the whole list can be greedy depending what's before `foo`. Similiarly for `foo $x + $y + $z`. I agree that it means `foo($x+$y+$z)`, but this is not greedy. I see there are two execution units here, `$x+$y+$z` and `foo($x+$y+$z)`. Without parens, perl will without doubt parse it as `foo($x+$y+$z)` and not `foo($x)+$y+$z`, I think because the precedence of + operator. As expression (as in `foo EXPR` syntax description), the addition will be performed first, and the result will be fed to `foo`. This is similiar to, `$ perl -le 'print chr 66' B $ perl -le 'print chr 60 + 5 + 1' B $ perl -le 'print chr 60 + 5 + 1, 10' B10` [download] Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!	[reply] [d/l] [select]


The stupid question is the question not asked
	PerlMonks