Re^2: Building a UNIX path from Irritating data

My mistake, kyle. My original code pulls the data directly from the command-line utilities provided with the software and formats it on-the-fly, and in my original post, I just created some sample data from memory. I've created some static versions of the data produced by those command-line utilities (see below code), and here's a modified version of code that uses those files:

#!/usr/bin/perl -w
use strict;

my $foldername_file = shift @ARGV;
my $folderfolder_file = shift @ARGV;
my %folders = ();
my %folderpaths = ();

open(FNAMES,"$foldername_file") or die "Can't open $foldername_file: $
+!\n";
foreach my $folderline (<FNAMES>) {
        chomp $folderline;
        if ($folderline =~ /^Folder (\d{3,5})\s+\-\s+(.*)\.$/) {
                $folders{$1} = $2;
        }
}
close(FNAMES);

open(FFINFO,"$folderfolder_file") or die "Can't open $folderfolder_fil
+e: $!\n";
my @subfolders = <FFINFO>;

foreach my $k (sort (keys (%folders))) {
        $folderpaths{$k} = build_path($k,@subfolders);
        print "$k => $folderpaths{$k}\n";
}
exit 0;

sub build_path {
        my $folderid = shift @_;
        my @dumpff = @_;
        my $path = "$folderid";
        my $parentid = "";
        foreach my $line (@dumpff) {
                if ($line =~ /^Folder (\d{3,5})\s+\-\s+.*\./) {
                        $parentid = $1;
                }
                elsif ($line =~ /\s+subfolder\s+$folderid\s+\-\s+.*\./
+) {
                        $path = join('/', build_path($parentid,@subfol
+ders),$folderid);
                }
        }
        return $path;
}
[download]

And you can use the following two data files to make it work:

foldernames.txt
subfolders.txt

Basically, the code above works fine for most all of the data, but there are about 60-70 folder ID's that appear as subfolders to more than one parent folder ID. Folder ID 3053 is a good example. It is a subfolder under 3051, 3057, 3063, and 3067. If you run my code, however, the path printed for 3053 is:

3053 => 100/3051/3053

To be totally complete, I would need to also print:

100/3057/3053
100/3063/3053
100/3067/3053

The only role build_dirs has is to find the parent folder ID of the folder ID it is given, and if the parent ID it finds isn't a root-level node, it calls itself again with the parent ID until a full path is created. However, since I'm building the path layer by layer, I can't figure out when to check for a duplicate path. Each iteration of build_dirs doesn't have any knowledge of any complete paths that have been found. I suppose I could return all the results from a search, but I'm not sure how that would look. Are you suggesting something like this?

3053 => 100/3051:3057:3063:3067/3053

Thank you for your assistance!

Comment on Re^2: Building a UNIX path from Irritating data Download Code

Replies are listed 'Best First'.
Re^3: Building a UNIX path from Irritating data by kyle (Abbot) on Nov 26, 2009 at 02:14 UTC
I replaced everything after the second open with this: my %parents_of; my $parentid; while ( my $line = <FFINFO> ) { if ($line =~ /^Folder (\d{3,5})\s+\-\s+.\./) { $parentid = $1; } elsif ($line =~ /\s+subfolder\s+(\S+)\s+\-\s+.\./) { push @{ $parents_of{ $1 } }, $parentid; } } sub build_path { my $folderid = shift @_; my @parents = @{ $parents_of{ $folderid } \|\| [] }; return @parents ? [ map { map { "$_/$folderid" } @{ build_path( +$_ ) } } @parents ] : [ $folderid ]; } foreach my $k (sort (keys (%folders))) { $folderpaths{$k} = build_path($k); print "$k => $_\n" for @{ $folderpaths{$k} }; } [download] The highlights of the changes are: I read the subfolders file only once. I store the subfolders data in a `%parents_of` hash of arrays, which maps each folder ID to the folders that have it as a subfolder. In `build_path`, I use the hash of arrays instead of iterating over the whole subfolders file. It's still recursive. Since `build_path` returns an array reference now, the output loop has to iterate its contents. For the ID that you pointed out, the new output is: `3053 => 100/3051/3053 3053 => 100/3057/3053 3053 => 100/3063/3053 3053 => 100/3066/3053` [download] The new code produces all the output of the old code plus another 1108 lines, and it runs in less than 1 second while the original takes about 220 seconds. I hope this helps.	[reply] [d/l] [select]
Re^4: Building a UNIX path from Irritating data by roswell1329 (Acolyte) on Nov 29, 2009 at 23:22 UTC
Hi kyle -- I've been working on your code for a while, and I cannot yet see how it builds a path, and I cannot get it to work for me. I copied your code above exactly as you said and used the data file that I provided, but here's the output I get: 100 => 100 101 => 101 1013 => 1013 1014 => 1014 1015 => 1015 1053 => 1053 1057 => 1057 1059 => 1059 1065 => 1065 119 => 119 1198 => 1198 1227 => 1227 1228 => 1228 1229 => 1229 1230 => 1230 1231 => 1231 1232 => 1232 1233 => 1233 1234 => 1234 1238 => 1238 1239 => 1239 1241 => 1241 1298 => 1298 1299 => 1299 1300 => 1300 1311 => 1311 1317 => 1317 1318 => 1318 1320 => 1320 1321 => 1321 1322 => 1322 1323 => 1323 1324 => 1324 1325 => 1325 1327 => 1327 1328 => 1328 1329 => 1329 1347 => 1347 1348 => 1348 1349 => 1349 1350 => 1350 1351 => 1351 1352 => 1352 1374 => 1374 1375 => 1375 1376 => 1376 1377 => 1377 1390 => 1390 1393 => 1393 1394 => 1394 1395 => 1395 1396 => 1396 1397 => 1397 1398 => 1398 ... [download] I think the line that may be wrong in my code is this one: `return @parents ? [ map { map { "$_/$folderid" } @{ build_path( $_ ) } } @parents ] : [ $folderid ];` [download] I've never seen a "return" command used in a conditional this way. It would seem to me that in order for the conditional to be true, the focus would have already returned to the line calling build_path originally. Does this actually mean if "return @parents" would succeed then do X, if not, Y? I really appreciate your help. I think this is very close to what I need, but I'm missing some keystone. To make sure I got your code correct, here's the exact script I'm running (in cygwin on a windows system, and on an HPUX system): #!/usr/bin/perl -w use strict; my $foldername_file = shift @ARGV; my $folderfolder_file = shift @ARGV; my %folders = (); my %folderpaths = (); open(FNAMES,"$foldername_file") or die "Can't open $foldername_file: $ +!\n"; foreach my $folderline (<FNAMES>) { chomp $folderline; if ($folderline =~ /^Folder (\d{3,5})\s+\-\s+(.)\.$/) { $folders{$1} = $2; } } close(FNAMES); open(FFINFO,"$folderfolder_file") or die "Can't open $folderfolder_fil +e: $!\n"; my @subfolders = <FFINFO>; my %parents_of; my $parentid; while ( my $line = <FFINFO> ) { if ($line =~ /^Folder (\d{3,5})\s+\-\s+.\./) { $parentid = $1; } elsif ($line =~ /\s+subfolder\s+(\S+)\s+\-\s+.*\./) { push @{ $parents_of{ $1 } }, $parentid; } } sub build_path { my $folderid = shift @_; my @parents = @{ $parents_of{ $folderid } \|\| [] }; return @parents ? [ map { map { "$_/$folderid" } @{ build_path( +$_ ) } } @parents ] : [ $folderid ]; } foreach my $k (sort (keys (%folders))) { $folderpaths{$k} = build_path($k); print "$k => $_\n" for @{ $folderpaths{$k} }; } [download]	[reply] [d/l] [select]
Re^5: Building a UNIX path from Irritating data by kyle (Abbot) on Nov 30, 2009 at 13:24 UTC
I think the problem is that what you have still reads everything into `@subfolders`. Here's the snippet I'm talking about. `open(FFINFO,"$folderfolder_file") or die "Can't open $folderfolder_fil +e: $!\n"; # This reads in every line from FFINFO my @subfolders = <FFINFO>; my %parents_of; my $parentid; # This loop also wants to read every line, # but the handle is already at EOF, # so it gets nothing. while ( my $line = <FFINFO> ) {` [download] If you need `@subfolders` for some purpose outside the code we're talking about, you could still have it. Just declare `my @subfolders;` at the same scope as `$parentid` and `%parents_of`, without initializing it. Then, right inside the `FFINFO` read loop, push every `$line` in. It winds up looking like this: `open(FFINFO,"$folderfolder_file") or die "Can't open $folderfolder_fil +e: $!\n"; my @subfolders; my %parents_of; my $parentid; while ( my $line = <FFINFO> ) { push @subfolders, $line;` [download] You also had a question about this: `return @parents ? [ map { map { "$_/$folderid" } @{ build_path( +$_ ) } } @parents ] : [ $folderid ];` [download] What this means is, "if `@parents` is non-empty, return first expression (with nested maps), else return the second expression (`$folderid` alone in an array reference)." Your description leads me to believe that you have the precedence mixed up. What you describe is this: `( return @parents ) ? [ X ] : [ Y ];` [download] What's happening is this: `return ( @parents ? [ X ] : [ Y ] );` [download] You can see this kind of thing yourself using B::Deparse. I ran this command line: `perl -MO=Deparse,-p input-file.pl` [download] With the `-p`, it puts parentheses in to clarify how expressions are interpreted. As a result, it showed me this (along with all the other code): `return((@parents ? [map({map({"$_/$folderid";} @{build_path($_);});} @ +parents)] : [$folderid]));` [download] That's not pretty, but it does show that the `return` is "outside" the whole rest of the line. I hope this helps. I feel I've written in some haste, so I wouldn't be surprised if I've been unclear. If I have left you with any other questions, feel free to ask.	[reply] [d/l] [select]
Re^6: Building a UNIX path from Irritating data by roswell1329 (Acolyte) on Nov 30, 2009 at 23:09 UTC


Come for the quick hacks, stay for the epiphanies.
	PerlMonks