Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Parse a list of path strings into a nested hash

by jdporter (Canon)
on Sep 03, 2010 at 20:05 UTC ( #858801=CUFP: print w/replies, xml ) Need Help??

This has been tried (and done) many times before; it's almost a coming-of-age ritual. Heck, I've even done it myself before, long ago.

This is the way I feel like doing it today. What I don't like about it is that its performance is something like O(nm), where m is average number of parts in each path.

I'm sure there are better (for most definitions of "better") ways to do it; for example, tye suggested his Data::Diver module.

sub paths2tree { my $hr = {}; @{$hr}{@_} = map { {} } @_; my $n_repls; do { $n_repls=0; for ( sort { length($b) <=> length($a) } keys %$hr ) { if ( /(.*)\\(.*)/ ) { $hr->{$1}{$2} = delete $hr->{$_}; $n_repls++; } } } while ( $n_repls ); $hr }

NB - This is hardcoded for the CP/M-style backslash path separator. You could generalize it if you want to. ;-)

Update: Upon reflection, I'm not sure you can get away from the O(nm) performance. At best, you can make the inner loop more efficient, e.g. by eliminating an explicit for loop and using a //g regex instead. But that doesn't change the "big O".

What is the sound of Windows? Is it not the sound of a wall upon which people have smashed their heads... all the way through?

Replies are listed 'Best First'.
Re: Parse a list of path strings into a nested hash
by GrandFather (Sage) on Sep 04, 2010 at 03:14 UTC

    Often clean and fast go hand in hand:

    use strict; use warnings; use Benchmark qw(cmpthese); srand (1); my @testPaths = map { join '\\', ${['c:', 'd:', 'e:']}[rand 2], map {int rand 100} 0 .. rand 6 } 1 .. 100; cmpthese ( -3, { JDP_c => sub {paths2treeJDP_c (@testPaths)}, JDP => sub {paths2treeJDP (@testPaths)}, GF => sub {paths2TreeGF (@testPaths)}, } ); sub paths2TreeGF { my %pTree; for my $path (@_) { my @parts = split /\\/, $path; my $scan = \%pTree; $scan = $scan->{shift @parts} ||= {} while @parts; } return \%pTree; } sub paths2treeJDP { my $hr = {}; @{$hr}{@_} = map {{}} @_; my $n_repls; do { $n_repls = 0; for (sort {length ($b) <=> length ($a)} keys %$hr) { if (/(.*)\\(.*)/) { $hr->{$1}{$2} = delete $hr->{$_}; $n_repls++; } } } while ($n_repls); $hr } sub paths2treeJDP_c { my $hr = {map {$_ => {}} @_}; my $n_repls; do { $n_repls = 0; for (keys %$hr) { if (/(.*)\\(.*)/) { $hr->{$1}{$2} = delete $hr->{$_}; $n_repls++; } } } while ($n_repls); $hr }


    Rate JDP JDP_c GF JDP 625/s -- -19% -47% JDP_c 773/s 24% -- -34% GF 1175/s 88% 52% --

    The _c variant omits the sort for a modest improvement but the more interesting improvement is omitting the need for the delete.

    True laziness is hard work

      Can you please briefly explain paths2TreeGF to us, especially the tricky bit? It looks slick. I just can't get my head around it.

      $scan = $scan->{shift @parts} ||= {} while @parts;


        Sure. For each path split the path into parts:

        my @parts = split /\\/, $path;

        Then for each part in the path starting at the 'root' (drive specifier for example) add it to the parent hash. %pTree is the parent hash for the whole tree so we start out by setting $scan to reference that:

        my $scan = \%pTree;

        then the 'tricky' bit adds the parts. shift @parts removes the left most part of the path. $scan->{shift @parts} ||= {} will create a new hash element if there isn't one already. In any case it returns the hash ref for the current part of the path and that becomes the new parent $scan = ... while @parts;. Taken all together the 'tricky' bit walks down the tree generating entries as required:

        $scan = $scan->{shift @parts} ||= {} while @parts;

        A few important tricks:

        1. $scan->{shift @parts} ||= {} autovivifies (creates) an entry in the hash if it doesn't exist yet.
        2. ||= {} only assigns a new (empty) hash if a new entry has been created (or in the nasty special case that the directory name is 0).
        3. assignment operators (=, ||=, +=, ...) 'return' the value assigned which is how $scan gets updated.

        Note that with Perl 5.10 and later you can use //= in place of ||= and it fixes the 0 issue mentioned in item 2.

        True laziness is hard work
      Here's another way that produces a tree whose leaf nodes have undef values:
      sub paths2treeREPEL { my %tree; for (@_) { my $last = \\%tree; $last = \$$last->{$_} for split /\\/; } return \%tree; }

      Rate JDP JDP_c GF REPEL JDP 881/s -- -23% -53% -63% JDP_c 1140/s 29% -- -39% -52% GF 1867/s 112% 64% -- -21% REPEL 2377/s 170% 109% 27% --
Re: Parse a list of path strings into a nested hash
by ambrus (Abbot) on Sep 06, 2010 at 14:09 UTC
      Why aren't you using split /\\/, ?

      For precisely the way(s) in which split /\\/, is different from /(.*)\\(.*)/.

      What is the sound of Windows? Is it not the sound of a wall upon which people have smashed their heads... all the way through?

        You mean that the regex will split from the trailing end, whereas split splits from the leading end?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://858801]
Approved by GrandFather
Front-paged by ww
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2020-09-20 12:18 GMT
Find Nodes?
    Voting Booth?
    If at first I donít succeed, I Ö

    Results (121 votes). Check out past polls.