http://qs321.pair.com?node_id=1096847

vagabonding electron has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a table which has two columns: the number and the previous number. There can be more than two items in the chain. I build a hash where the key is the number and the value is the very first item in the chain.

Below is my code by now (a minimal example to play with). It just works, but it is fragile, since it relies on the assumption that there is at most 5 items in the chain. In the real task it should be so, but it is not guaranteed. What would be the better way to do it?

Thanks in advance!

#!/perl use strict; use warnings FATAL => qw(all); use Text::CSV_XS; use Data::Dump; my $csv_par = { binary => 1, auto_diag => 1, allow_whitespace => 1, sep_char => ';', eol => $/, quote_char => undef, }; my $csv = Text::CSV_XS->new($csv_par); my @data = @{ $csv->getline_all(*DATA) }; shift @data; # simply throw away the header. my %main; my %first_pass; for my $row ( @data ) { $first_pass{$row->[0]} = $row->[1]; $main{$row->[0]} = $row->[1]; } my %second_pass; for my $row (@data) { if ( $first_pass{$row->[1]} ) { $second_pass{$row->[0]} = $row->[1]; $main{ $row->[0] } = $main{ $row->[1] } || $row->[1]; } } my %third_pass; for my $row (@data) { if ( $second_pass{$row->[1]} ) { $third_pass{$row->[0]} = $row->[1]; $main{ $row->[0] } = $main{ $row->[1] } || $row->[1]; } } my %fourth_pass; for my $row (@data) { if ( $third_pass{$row->[1]} ) { $fourth_pass{$row->[0]} = $row->[1]; $main{ $row->[0] } = $main{ $row->[1] } || $row->[1]; } } for my $row ( @data ) { $main{$row->[1]} = "" unless exists $main{$row->[1]} or $row->[1] eq ""; } for my $num ( sort {$a <=> $b} keys %main ) { $main{$num} = $num unless length($main{$num}); print "$num => $main{$num}$/"; } # Sample chains: # 123-234-345-456-567 # 117-228-339 # 131 # 213-324-435 # 372 __DATA__ NUM;NUMPRED 567;456 456;345 345;234 234;123 339;228 228;117 131; 435;324 324;213 372;
The output:
117 => 117 123 => 123 131 => 131 213 => 213 228 => 117 234 => 123 324 => 213 339 => 117 345 => 123 372 => 372 435 => 213 456 => 123 567 => 123

Replies are listed 'Best First'.
Re: Looking for the first item in the chain
by Corion (Patriarch) on Aug 09, 2014 at 16:20 UTC

    Have you considered using a loop instead of writing out the passes verbatim?

    Consider starting with a for loop (perlsyn) that does the five steps:

    for my $step (1..5) { # Perform one step };

    Then you could consider repeating until you have not found any predecessor anymore:

    my $changed= 1; while( $changed ) { $changed= 0; ... for my $row (@table) { if( $main{ $row->[0] } ) { $changed= 1; # we still need to keep working }; }; };

      Thank you very much, Corion .

      I did think about a loop but it seems that I needed that the pope gives me a kick :-) I must admit however that I could not wrap my head around the flag $changed yet. Could you please give me another hint on this? Below is my new attempt with the do ... while loop. I updated the sample data to see that the longer chain will be proceeded.

      !/perl use strict; use warnings FATAL => qw(all); use Text::CSV_XS; use Data::Dump; my $csv_par = { binary => 1, auto_diag => 1, allow_whitespace => 1, sep_char => ';', eol => $/, quote_char => undef, }; my $csv = Text::CSV_XS->new($csv_par); my @data = @{ $csv->getline_all(*DATA) }; shift @data; # simply throw away the header. my %main; my %hash_old; ############### The first pass: for my $row ( @data ) { $hash_old{$row->[0]} = $row->[1]; $main{$row->[0]} = $row->[1]; } my %hash_new; ############### All the remaining steps: do { for my $row (@data) { if ( $hash_old{$row->[1]} ) { $hash_new{$row->[0]} = $row->[1]; $main{ $row->[0] } = $main{ $row->[1] } || $row->[1]; } } %hash_old = (); $hash_old{$_} = $hash_new{$_} for keys %hash_new; %hash_new = (); } while ( scalar keys %hash_old ); ############### Filling in the main number ############### for the main number themselves: for my $row ( @data ) { $main{$row->[1]} = $row->[1] unless exists $main{$row->[1]} or $row->[1] eq ""; $main{$row->[0]} = $row->[0] unless length($main{$row->[0]}); } # Sample chains: # 123-234-345-456-567-678-789 # 117-228-339 # 131 # 213-324-435 # 372 dd \%main;

      The output now:

      { 117 => 117, 123 => 123, 131 => 131, 213 => 213, 228 => 117, 234 => 123, 324 => 213, 339 => 117, 345 => 123, 372 => 372, 435 => 213, 456 => 123, 567 => 123, 678 => 123, 789 => 123, }
Re: Looking for the first item in the chain
by Cristoforo (Curate) on Aug 09, 2014 at 23:21 UTC
    Borrowing from Laurent_R's answer, you might want to consider using Graph::Undirected to get the connections. I also borrowed this graph example from somewhere, probably from here at Perl Monks but can't find the original in a search.
    #!/usr/bin/perl use strict; use warnings; use Graph; my $g = Graph::Undirected->new; while (<DATA>) { chomp; my ($x, $y) = split /;/; $y = $x if $y eq ''; # will pass $y == 0 if that's the case $g->add_edge($x, $y); } my %pred; for my $aref ($g->connected_components) { my @items = sort {$a <=> $b} @$aref; my $first = $items[0]; for my $element (@items) { $pred{$element} = $first; } } print "$_ => $pred{$_}\n" for sort {$a <=> $b} keys %pred; __DATA__ 567;456 456;345 345;234 234;123 339;228 228;117 131; 435;324 324;213 372;
    Prints:
    117 => 117 123 => 123 131 => 131 213 => 213 228 => 117 234 => 123 324 => 213 339 => 117 345 => 123 372 => 372 435 => 213 456 => 123 567 => 123
    Update: Changed $g->add_edge($x, $y || $x); to $g->add_edge($x, $y); and added $y = $x if $y eq ''; so to allow $y to be a possible zero.

      Thank you very much Cristoforo for showing me this possibility, I did not know about the module. It seems that it could be useful in other tasks in this project as well, and the "chains" are built for free :) I am reading the documentation now. By the way, I noticed that Graph is not maintained anymore. Do you think it could be a problem in future?

        You're welcome - glad you found this useful. I don't know whether there will be a problem if it isn't being maintained anymore. Its such a useful module, so that surprises me.

        I am not well versed on the use of this module. All I've learned is from examples here on Perl Monks. For example, I don't know if your chains could have been constructed using a Directed graph. I've seen an example solving a problem using a directed graph but I found a solution to your problem using an undirected graph,

Re: Looking for the first item in the chain
by graff (Chancellor) on Aug 09, 2014 at 23:20 UTC
    I think you only need one hash, and you just have to loop over it as often as necessary, to eliminate intermediate links in the chains:
    use strict; use warnings; my %parent_of; <DATA>; # skip header while (<DATA>) { chomp; my ( $child, $parent ) = split( /;/ ); $parent ||= $child; $parent_of{$child} = $parent; $parent_of{$parent} = $parent unless ( exists( $parent_of{$parent} + )); } my $changes; do { $changes = 0; for my $child ( keys %parent_of ) { my $parent = $parent_of{$child}; while ( $parent_of{$parent} != $parent ) { my $next_parent = $parent_of{$parent}; $parent_of{$child} = $next_parent; $parent = $next_parent; $changes++; } } } while ( $changes ); for my $child ( sort keys %parent_of ) { print " $child => $parent_of{$child}\n"; } __DATA__ NUM;NUMPRED 567;456 456;345 345;234 234;123 339;228 228;117 131; 435;324 324;213 372; 789;678 678;567
    (I commend you for using Text::CSV_XS, and you can certainly stick with that, but I assumed the input would be simple enough to do without it.) The DATA above includes some extra samples, to create a longer chain.

    The point of the do {...} while ( $changes ) loop is simply to keep iterating over the "parent_of" hash, changing values for child keys until all the values represent "primary nodes" in all the chains.

    Apart from that, I like the other suggestion given above: if possible, it would make more sense to create this form output as part of the same process that produces the sample input you've shown us here.

      Thank you very much graff ! Your example is very informative and self-explanatory. As for the data itself, it exists in the database in the form I showed, I am free to build the chains of numbers, but they does not exist in the raw form.

      Update: I benchmarked your solution with this of mine ( Re^2: Looking for the first item in the chain ) - your code is 48% faster!

Re: Looking for the first item in the chain
by Laurent_R (Canon) on Aug 09, 2014 at 20:25 UTC
    How about this:
    use strict; use warnings; my %pred; while (<DATA>) { chomp; my @items = split /-/, $_; my $first = $items[0]; for my $element (@items) { $pred{$element} = $first; } } print "$_ => $pred{$_}\n" for sort keys %pred; __DATA__ 123-234-345-456-567 117-228-339 131 213-324-435 372
    Output:
    $ perl predecessor.pl 117 => 117 123 => 123 131 => 131 213 => 213 228 => 117 234 => 123 324 => 213 339 => 117 345 => 123 372 => 372 435 => 213 456 => 123 567 => 123

      Thank you Laurent_R ! However the raw data does not exists as "chains". It is in a table where each row presents the number and the previous number. I just showed the chains in the commented lines to illustrate the problem. The chain building is very interesting itself however and can be useful in this project too. Cristoforo shows an example below. Thank you for the idea.

        OK, vagabonding electron, sorry that I misunderstood your requirement, I thought the chains were the input. This a quick (probably not optimal) way to do it with your actual source data:
        use strict; use warnings; use Data::Dumper; my %predec; while (<DATA>) { chomp; my ($succ, $pred) = split /;/, $_; $pred = $succ if $pred eq ''; push @{$predec{$pred}}, $succ; } my $continue = 1; while ($continue) { $continue = 0; for my $pred (keys %predec) { my $succ = @{$predec{$pred}}[-1]; if (defined $succ and exists $predec{$succ} and scalar @{$pred +ec{$succ}} != 0) { next if $succ eq $pred; push @{$predec{$pred}}, @{$predec{$succ}}; delete $predec{$succ}; $continue = 1; } $succ = @{$predec{$pred}}[0]; if (defined $succ and exists $predec{$succ} and scalar @{$pred +ec{$succ}} != 0) { unshift @{$predec{$pred}}, @{$predec{$succ}}; delete $predec{$succ}; $continue = 1; } } } print "Data structure: \n", Dumper \%predec; print "\n\nResult:\n\n"; for my $key (keys %predec) { next if scalar @{$predec{$key}} == 0; print "$key => $key\n" and next if $key eq @{$predec{$key}}[0]; print "$key => $key\n"; print "$_ => $key\n" for @{$predec{$key}}; } __DATA__ 567;456 456;345 345;234 234;123 339;228 228;117 131; 435;324 324;213 372;
        This will print the following:
        $ perl predecessor.pl Data structure: $VAR1 = { '123' => [ '234', '345', '456', '567' ], '131' => [ '131' ], '234' => [], '213' => [ '324', '435' ], '345' => [], '324' => [], '372' => [ '372' ], '117' => [ '228', '339' ] }; Result: 123 => 123 234 => 123 345 => 123 456 => 123 567 => 123 131 => 131 213 => 213 324 => 213 435 => 213 372 => 372 117 => 117 228 => 117 339 => 117
        This seems to be what you want, except that the lines are not printed out in the same order as your example. If the order matters, it should be quite simple to sort the lines in the desired order.
Re: Looking for the first item in the chain
by oakb (Scribe) on Aug 09, 2014 at 21:09 UTC
    May I ask, how do the numbers go from one of your chains into the CSV data table that you are using? I ask because it appears you might be able to combine the chain-to-data process with the data manipulation process you show here, and realize a considerable savings over working with the data twice. The only way I can imagine this not being true is if you never see the original chains, only the resulting CSV data.

    Also, are earlier chain numbers always lesser than later numbers, as they are in the sample chains you provided? This isn't a deal breaker, but it provides some interesting possibilities.

      Hi oakb! The raw data does not exist as chains. I see only the csv data (in the real life a table in the database). The earlier chain numbers should be always lesser than later numbers, this is correct.

        Sorry for the delayed reply on this... I've been busy, but I've been using spare gray cells to consider your situation. :)

        Given the rules you have provided and implied, a solution would seem to be fairly simple and straightforward. Indeed, overthinking the issue is a distinct danger to be avoided. The one big reservation I have is that, though you have implied it, you have never explicitly stated that each number is unique to only one chain. That is to say, something like this could never happen:

        123-234-339-456-567 117-234-339 131 213-339-456 372

        In the sample above, I have altered the numbers so that 234, 339, and 456 each appear in at least two chains. Your data sample, which shows only a number and its predecessor in the chain, cannot adequately represent the data in this altered environment, since there now exist multiple paths in the data for determining the first number in the chain. For instance, the first number of the chain ending in 567 is now indistinguishable; using the data as it's formatted, the first number could be 123, 117, or 213 -- although, of course, the only correct answer is 123.

        The reasonable assumption, then, must be to accept and rely on your implicit rule: each three-digit number can only appear in one chain.

        From there, the solution is fairly easy. Since your original question did not involve how to pull in your data, I'm going to simplify my sample code by starting with the data already formatted in a way that Perl understands. I am also going to stick with built-in Perl functions, and not require any extra modules.

        #!/usr/bin/perl -w use strict; my @data = ( [ 567, 456 ], [ 456, 345 ], [ 345, 234 ], [ 234, 123 ], [ 339, 228 ], [ 228, 117 ], [ 131, undef ], [ 435, 324 ], [ 324, 213 ], [ 372, undef ] ); sub replace { my ( $num_first_ref, $pred_num_ref, $num, $pred ) = @_; # Catch the first number in the chain $num_first_ref->{ $pred } = $pred; # Assign current $pred as possible first $num_first_ref->{ $num } = $pred; # Use reverse hash to look up later/higher numbers and update if ( exists( $pred_num_ref->{ $num } ) && ( my $higher_num = $pred +_num_ref->{ $num } ) ) { # Reassign current $pred as possible first $num_first_ref->{ $higher_num } = $pred; replace( $num_first_ref, $pred_num_ref, $higher_num, $pred ); } } my %num_first; # Calculated data for output goes here my %pred_num; # 'Reverse' hash for recursive lookup for my $nums ( @data ) { # Handle single values unless ( defined( $nums->[ 1 ] ) ) { $num_first{ $nums->[ 0 ] } = $nums->[ 0 ]; next; } # Load reverse hash with predecessor as key $pred_num{ $nums->[ 1 ] } = $nums->[ 0 ]; replace( \%num_first, \%pred_num, $nums->[ 0 ], $nums->[ 1 ] ); } for my $high_num ( sort { $a <=> $b } keys %num_first ) { print "$high_num => $num_first{ $high_num }\n"; } Output: 117 => 117 123 => 123 131 => 131 213 => 213 228 => 117 234 => 123 324 => 213 339 => 117 345 => 123 372 => 372 435 => 213 456 => 123 567 => 123

        This uses a reverse hash, %pred_num, and a recursive subroutine, replace, to keep track of later chain numbers that have already been seen, and update them to the current possible first number. It doesn't matter how many data rows you have, or how long the chains are, this will handle any amount of data you throw at it -- as long as the data follows the rules. Boilerplate code for handling rule exceptions can always be added as desired.