Parse a list of path strings into a nested hash

This has been tried (and done) many times before; it's almost a coming-of-age ritual. Heck, I've even done it myself before, long ago.

This is the way I feel like doing it today. What I don't like about it is that its performance is something like O(nm), where m is average number of parts in each path.

I'm sure there are better (for most definitions of "better") ways to do it; for example, tye suggested his Data::Diver module.


sub paths2tree
{
    my $hr = {};
    @{$hr}{@_} = map { {} } @_;
    my $n_repls;
    do
    {
        $n_repls=0;
        for ( sort { length($b) <=> length($a) } keys %$hr )
        {
            if ( /(.*)\\(.*)/ )
            {
                $hr->{$1}{$2} = delete $hr->{$_};
                $n_repls++;
            }
        }
    }
    while ( $n_repls );
    $hr
}
[download]

NB - This is hardcoded for the CP/M-style backslash path separator. You could generalize it if you want to. ;-)

Update: Upon reflection, I'm not sure you can get away from the O(nm) performance. At best, you can make the inner loop more efficient, e.g. by eliminating an explicit for loop and using a //g regex instead. But that doesn't change the "big O".

What is the sound of Windows? Is it not the sound of a wall upon which people have smashed their heads... all the way through?

Comment on Parse a list of path strings into a nested hash Select or Download Code

Replies are listed 'Best First'.
Re: Parse a list of path strings into a nested hash by GrandFather (Saint) on Sep 04, 2010 at 03:14 UTC
Often clean and fast go hand in hand: use strict; use warnings; use Benchmark qw(cmpthese); srand (1); my @testPaths = map { join '\\', ${['c:', 'd:', 'e:']}[rand 2], map {int rand 100} 0 .. rand 6 } 1 .. 100; cmpthese ( -3, { JDP_c => sub {paths2treeJDP_c (@testPaths)}, JDP => sub {paths2treeJDP (@testPaths)}, GF => sub {paths2TreeGF (@testPaths)}, } ); sub paths2TreeGF { my %pTree; for my $path (@_) { my @parts = split /\\/, $path; my $scan = \%pTree; $scan = $scan->{shift @parts} \|\|= {} while @parts; } return \%pTree; } sub paths2treeJDP { my $hr = {}; @{$hr}{@_} = map {{}} @_; my $n_repls; do { $n_repls = 0; for (sort {length ($b) <=> length ($a)} keys %$hr) { if (/(.)\\(.)/) { $hr->{$1}{$2} = delete $hr->{$_}; $n_repls++; } } } while ($n_repls); $hr } sub paths2treeJDP_c { my $hr = {map {$_ => {}} @_}; my $n_repls; do { $n_repls = 0; for (keys %$hr) { if (/(.)\\(.)/) { $hr->{$1}{$2} = delete $hr->{$_}; $n_repls++; } } } while ($n_repls); $hr } [download] Prints: `Rate JDP JDP_c GF JDP 625/s -- -19% -47% JDP_c 773/s 24% -- -34% GF 1175/s 88% 52% --` [download] The _c variant omits the sort for a modest improvement but the more interesting improvement is omitting the need for the delete. True laziness is hard work	[reply] [d/l] [select]
Re^2: Parse a list of path strings into a nested hash by Jim (Curate) on Sep 06, 2010 at 01:52 UTC
Can you please briefly explain paths2TreeGF to us, especially the tricky bit? It looks slick. I just can't get my head around it. `$scan = $scan->{shift @parts} \|\|= {} while @parts;` [download] Thanks.	[reply] [d/l]
Re^3: Parse a list of path strings into a nested hash by GrandFather (Saint) on Sep 06, 2010 at 02:32 UTC
Sure. For each path split the path into parts: `my @parts = split /\\/, $path;` [download] Then for each part in the path starting at the 'root' (drive specifier for example) add it to the parent hash. `%pTree` is the parent hash for the whole tree so we start out by setting `$scan` to reference that: `my $scan = \%pTree;` [download] then the 'tricky' bit adds the parts. `shift @parts` removes the left most part of the path. `$scan->{shift @parts} \|\|= {}` will create a new hash element if there isn't one already. In any case it returns the hash ref for the current part of the path and that becomes the new parent `$scan = ... while @parts;`. Taken all together the 'tricky' bit walks down the tree generating entries as required: `$scan = $scan->{shift @parts} \|\|= {} while @parts;` [download] A few important tricks: `$scan->{shift @parts} \|\|= {}` autovivifies (creates) an entry in the hash if it doesn't exist yet. `\|\|= {}` only assigns a new (empty) hash if a new entry has been created (or in the nasty special case that the directory name is 0). assignment operators (=, \|\|=, +=, ...) 'return' the value assigned which is how $scan gets updated. Note that with Perl 5.10 and later you can use //= in place of \|\|= and it fixes the 0 issue mentioned in item 2. True laziness is hard work	[reply] [d/l] [select]
Re^2: Parse a list of path strings into a nested hash by repellent (Priest) on Sep 09, 2010 at 23:37 UTC
Here's another way that produces a tree whose leaf nodes have `undef` values: `sub paths2treeREPEL { my %tree; for (@_) { my $last = \\%tree; $last = \$$last->{$_} for split /\\/; } return \%tree; }` [download] FWIW: `Rate JDP JDP_c GF REPEL JDP 881/s -- -23% -53% -63% JDP_c 1140/s 29% -- -39% -52% GF 1867/s 112% 64% -- -21% REPEL 2377/s 170% 109% 27% --` [download]	[reply] [d/l] [select]
Re: Parse a list of path strings into a nested hash by ambrus (Abbot) on Sep 06, 2010 at 14:09 UTC
See also Lexicographic tree. By the way, Why aren't you using `split /\\/,` ?	[reply] [d/l]
Re^2: Parse a list of path strings into a nested hash by jdporter (Paladin) on Sep 06, 2010 at 19:05 UTC
Why aren't you using `split /\\/,` ? For precisely the way(s) in which `split /\\/,` is different from `/(.)\\(.)/`. What is the sound of Windows? Is it not the sound of a wall upon which people have smashed their heads... all the way through?	[reply] [d/l] [select]
Re^3: Parse a list of path strings into a nested hash by ambrus (Abbot) on Sep 06, 2010 at 20:53 UTC
You mean that the regex will split from the trailing end, whereas split splits from the leading end?	[reply]
Re^4: Parse a list of path strings into a nested hash by jdporter (Paladin) on Sep 07, 2010 at 14:32 UTC

Back to Cool Uses for Perl