http://qs321.pair.com?node_id=920093

regan99 has asked for the wisdom of the Perl Monks concerning the following question:

I have a two PHP scripts that I am using for hadoop streaming, crunching some JSON data. For various reasons, I am trying to translate these scripts to Perl, but having a tough time. I actually managed to get the mapping script translated (I kinda cheated and used a regex to get around dealing with processing JSON data) but the reduce script is more complex. For starters, here's the sorted output from the mapping script (each line is a key, followed by a tab, followed by a value with a newline ending each line):

36dc0d7d0ac25ce60898c36ca135fbbd [[12051,840,501,33],{"23602":2 +2}] 4c38528ffe96a15c90e8cfcaaad048e3 [[13308,124,-1,62],{"8002":12} +] 5557a6bed3793133754d288e2b58763a [[2197,840,751,6],{"16501":1}] 5a9c1f69434c1a8b1d7880ef03ae4264 [[7525,616,-1,14347],{"24902": +37}] 87f63173118df680a4c1d63b7953faf3 [[2765,458,-1,11937],{"3102":1 +5}] 901d1a5dbd4ed87fd68db2513fb29762 [[1828,124,-1,63],{"8002":379} +] c23a2b2c10af8af96b1b24ddd4cc53d4 [[62,840,820,38],{"16801":303} +] d7af9cd8573ecbec6d42e453439e3e0f [[4680,124,-1,63],{"1012":1896 +}] d93adab6b345608d38ea84811012dce8 [[114,840,819,48],{"22502":322 +,"8002":3}] ffd50dd8b4986f40634d6b5925dc04c6 [[6089,840,803,5],{"1252":1}]

And here is the PHP code that does the reducing:

#!/usr/bin/php $data = array(); while (($line = fgets(STDIN)) !== false) { list($key,$value) = explode("\t",trim($line)); $value =& json_decode($value); $value[1] = get_object_vars($value[1]); if( isset($data[$key]) ) { foreach( $value[1] as $k=>$v ) { $data[$key][1][$k] += $v; } } else { $data[$key] = $value; } } foreach( $data as $key => $value ) { echo $key ."\t". json_encode( array($key=>$value) ) ."\n"; }

Particularly, this is the part I can't figure out how to translate:

$value =& json_decode($value); $value[1] = get_object_vars($value[1]);

I placed a couple of echos in that PHP code to see what values wind up in $value and $value1, and here's what they get with the first line of the input data:

Input line: 36dc0d7d0ac25ce60898c36ca135fbbd {"36dc0d7d0ac25ce60898c36ca135 +fbbd":[[12051,840,501,33],{"23602":22}]}
$value before the json_decode call : [[12051,840,501,33],{"23602":22}] $value after json_decode call (output via print_r): Array ( [0] => Array ( [0] => 12051 [1] => 840 [2] => 501 [3] => 33 ) [1] => Array ( [23602] => 22 ) ) $value[1] output via print_r: Array ( [23602] => 22 )

I can see that json_decode call basically takes the value and converts it to a multidimensional array and assigns them to $value. I don't quite understand what the get_objects_var call does to $value1 but the end result is that it contains another array containing a key->value pair.

My question is, how hard would it be to do the same thing in Perl? I took a look at the JSON module documentation, but didn't understand how to wind up with the same results this PHP code gets. Any takers on giving me a hand with this?

Replies are listed 'Best First'.
Re: Converting some MapReduce PHP scripts to Perl
by Tanktalus (Canon) on Aug 12, 2011 at 22:30 UTC

    The json_decode in PHP there looks like the same thing as decode_json in the JSON (or JSON::XS) module in perl. Note, however, that PHP is kinda lying to you - key-value pairs are stored in maps or, in perl, hashes. That PHP doesn't differentiate may make PHP easier, but makes translating what you've learned in PHP to other languages harder. (To be fair, translating from Perl to, say, C++, isn't trivial, either, for similar abstraction-based reasons.)

    Presumably, that get_object_vars function exists elsewhere in the code, though your printouts make it look like it doesn't do anything, as $value1 already will have that value by way of the decode_json function. So, a rough copy would be:

    #!/usr/bin/perl use JSON; my %data; while (<DATA>) { my ($key, $value) = split ' ', $_; $value = decode_json($value); if (exists $data{$key}) { while (my ($k,$v) = each %{$value[1]}) { $data{$key}{$k} += $v; } } else { $data{$key} = $value; } } use Data::Dumper qw(Dumper); print Dumper(\%data); __END__ 36dc0d7d0ac25ce60898c36ca135fbbd [[12051,840,501,33],{"23602":22}] 4c38528ffe96a15c90e8cfcaaad048e3 [[13308,124,-1,62],{"8002":12}] 5557a6bed3793133754d288e2b58763a [[2197,840,751,6],{"16501":1}] 5a9c1f69434c1a8b1d7880ef03ae4264 [[7525,616,-1,14347],{"24902":37}] 87f63173118df680a4c1d63b7953faf3 [[2765,458,-1,11937],{"3102":15}] 901d1a5dbd4ed87fd68db2513fb29762 [[1828,124,-1,63],{"8002":379}] c23a2b2c10af8af96b1b24ddd4cc53d4 [[62,840,820,38],{"16801":303}] d7af9cd8573ecbec6d42e453439e3e0f [[4680,124,-1,63],{"1012":1896}] d93adab6b345608d38ea84811012dce8 [[114,840,819,48],{"22502":322,"80 +02":3}] ffd50dd8b4986f40634d6b5925dc04c6 [[6089,840,803,5],{"1252":1}]
    Not sure if that's entirely correct, but it seems to work.

      Presumably, that get_object_vars function exists elsewhere in the code

      Actually, it's a standard PHP function. It turns an object into an (associative) array of its properties.

      PHP's json_decode turns the map/hash/whatever into technically an object rather than an array. Little odd, though perhaps fitting the 'JS' part of 'JSON' in a way. In that case, the extra step is needed to get something you can refer to as an array. From the print_r() output, it looks like it may have actually turned it into an array since it only had numeric keys, but that's too deep in PHP's DWIM for me to suss out.

      DISREGARD BELOW!!! I figured it out... I forgot to use encode_json on $data{$key} to properly format the output. Thanks for the help and suggestions!

      That works! Thank you for the reply. The only issue I am having now is with the output. I'm trying to get the output to be in the form:

      $key . "\t"' . $data{$key} . "\n";

      But when I iterate through the %data hash, the key and the tab come out as expected, but the value comes out as an array reference. I've tried to dereference it with no success. The dumper output for one key->value pair of the %data hash shows this:

      '5557a6bed3793133754d288e2b58763a' => [ [ 2197, 840, 751, 6 ], { '16501' => 1 } ]

      This looks as though the value contains both an array and a hash. Is there a way I can dereference those so that my output prints those values out as a string rather than the reference? For example, I currently get this:

      5557a6bed3793133754d288e2b58763a ARRAY(0x441c780)

      When what I'm after is this:

      5557a6bed3793133754d288e2b58763a [[2197,840,751,6],{"16501":1}]

      ...and here's the code I'm using to output the pairs:

      foreach $key (keys %data) { print $key . "\t" . "$data{$key}" . "\n"; }

      Any ideas how to dereference that $data{key} to get the string instead of the array reference?

        As you mentioned, the data contains an array and a hash, it needs more than just a print ... I'm sure there are many ways to do it but here is what i have ...
        foreach my $key ( keys %data ) { my $out = ''; foreach my $s ( @{ $data{$key}} ) { if ( ref $s eq 'ARRAY' ) { $out .= '[' . ( join ',', @$s ) . ']'; } else { while ( my ($k,$v) = each %$s ) { $out .= ",{\"$k\":$v}"; } } } print $key . "\t[$out]\n"; }
Re: Converting some MapReduce PHP scripts to Perl
by Anonymous Monk on Aug 12, 2011 at 22:27 UTC

    This is Perl. Use CPAN and it becomes trivial to deal with JSON