Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^2: Building a UNIX path from Irritating data

by roswell1329 (Acolyte)
on Nov 25, 2009 at 22:50 UTC ( [id://809452]=note: print w/replies, xml ) Need Help??


in reply to Re: Building a UNIX path from Irritating data
in thread Building a UNIX path from Irritating data

My mistake, kyle. My original code pulls the data directly from the command-line utilities provided with the software and formats it on-the-fly, and in my original post, I just created some sample data from memory. I've created some static versions of the data produced by those command-line utilities (see below code), and here's a modified version of code that uses those files:
#!/usr/bin/perl -w use strict; my $foldername_file = shift @ARGV; my $folderfolder_file = shift @ARGV; my %folders = (); my %folderpaths = (); open(FNAMES,"$foldername_file") or die "Can't open $foldername_file: $ +!\n"; foreach my $folderline (<FNAMES>) { chomp $folderline; if ($folderline =~ /^Folder (\d{3,5})\s+\-\s+(.*)\.$/) { $folders{$1} = $2; } } close(FNAMES); open(FFINFO,"$folderfolder_file") or die "Can't open $folderfolder_fil +e: $!\n"; my @subfolders = <FFINFO>; foreach my $k (sort (keys (%folders))) { $folderpaths{$k} = build_path($k,@subfolders); print "$k => $folderpaths{$k}\n"; } exit 0; sub build_path { my $folderid = shift @_; my @dumpff = @_; my $path = "$folderid"; my $parentid = ""; foreach my $line (@dumpff) { if ($line =~ /^Folder (\d{3,5})\s+\-\s+.*\./) { $parentid = $1; } elsif ($line =~ /\s+subfolder\s+$folderid\s+\-\s+.*\./ +) { $path = join('/', build_path($parentid,@subfol +ders),$folderid); } } return $path; }
And you can use the following two data files to make it work:

foldernames.txt
subfolders.txt

Basically, the code above works fine for most all of the data, but there are about 60-70 folder ID's that appear as subfolders to more than one parent folder ID. Folder ID 3053 is a good example. It is a subfolder under 3051, 3057, 3063, and 3067. If you run my code, however, the path printed for 3053 is:

3053 => 100/3051/3053

To be totally complete, I would need to also print:

100/3057/3053
100/3063/3053
100/3067/3053

The only role build_dirs has is to find the parent folder ID of the folder ID it is given, and if the parent ID it finds isn't a root-level node, it calls itself again with the parent ID until a full path is created. However, since I'm building the path layer by layer, I can't figure out when to check for a duplicate path. Each iteration of build_dirs doesn't have any knowledge of any complete paths that have been found. I suppose I could return all the results from a search, but I'm not sure how that would look. Are you suggesting something like this?

3053 => 100/3051:3057:3063:3067/3053

Thank you for your assistance!

Replies are listed 'Best First'.
Re^3: Building a UNIX path from Irritating data
by kyle (Abbot) on Nov 26, 2009 at 02:14 UTC

    I replaced everything after the second open with this:

    my %parents_of; my $parentid; while ( my $line = <FFINFO> ) { if ($line =~ /^Folder (\d{3,5})\s+\-\s+.*\./) { $parentid = $1; } elsif ($line =~ /\s+subfolder\s+(\S+)\s+\-\s+.*\./) { push @{ $parents_of{ $1 } }, $parentid; } } sub build_path { my $folderid = shift @_; my @parents = @{ $parents_of{ $folderid } || [] }; return @parents ? [ map { map { "$_/$folderid" } @{ build_path( +$_ ) } } @parents ] : [ $folderid ]; } foreach my $k (sort (keys (%folders))) { $folderpaths{$k} = build_path($k); print "$k => $_\n" for @{ $folderpaths{$k} }; }

    The highlights of the changes are:

    • I read the subfolders file only once.
    • I store the subfolders data in a %parents_of hash of arrays, which maps each folder ID to the folders that have it as a subfolder.
    • In build_path, I use the hash of arrays instead of iterating over the whole subfolders file. It's still recursive.
    • Since build_path returns an array reference now, the output loop has to iterate its contents.

    For the ID that you pointed out, the new output is:

    3053 => 100/3051/3053 3053 => 100/3057/3053 3053 => 100/3063/3053 3053 => 100/3066/3053

    The new code produces all the output of the old code plus another 1108 lines, and it runs in less than 1 second while the original takes about 220 seconds.

    I hope this helps.

      Hi kyle --

      I've been working on your code for a while, and I cannot yet see how it builds a path, and I cannot get it to work for me. I copied your code above exactly as you said and used the data file that I provided, but here's the output I get:

      100 => 100 101 => 101 1013 => 1013 1014 => 1014 1015 => 1015 1053 => 1053 1057 => 1057 1059 => 1059 1065 => 1065 119 => 119 1198 => 1198 1227 => 1227 1228 => 1228 1229 => 1229 1230 => 1230 1231 => 1231 1232 => 1232 1233 => 1233 1234 => 1234 1238 => 1238 1239 => 1239 1241 => 1241 1298 => 1298 1299 => 1299 1300 => 1300 1311 => 1311 1317 => 1317 1318 => 1318 1320 => 1320 1321 => 1321 1322 => 1322 1323 => 1323 1324 => 1324 1325 => 1325 1327 => 1327 1328 => 1328 1329 => 1329 1347 => 1347 1348 => 1348 1349 => 1349 1350 => 1350 1351 => 1351 1352 => 1352 1374 => 1374 1375 => 1375 1376 => 1376 1377 => 1377 1390 => 1390 1393 => 1393 1394 => 1394 1395 => 1395 1396 => 1396 1397 => 1397 1398 => 1398 ...
      I think the line that may be wrong in my code is this one:
      return @parents ? [ map { map { "$_/$folderid" } @{ build_path( $_ ) } } @parents ] : [ $folderid ];
      I've never seen a "return" command used in a conditional this way. It would seem to me that in order for the conditional to be true, the focus would have already returned to the line calling build_path originally. Does this actually mean if "return @parents" would succeed then do X, if not, Y?

      I really appreciate your help. I think this is very close to what I need, but I'm missing some keystone. To make sure I got your code correct, here's the exact script I'm running (in cygwin on a windows system, and on an HPUX system):

      #!/usr/bin/perl -w use strict; my $foldername_file = shift @ARGV; my $folderfolder_file = shift @ARGV; my %folders = (); my %folderpaths = (); open(FNAMES,"$foldername_file") or die "Can't open $foldername_file: $ +!\n"; foreach my $folderline (<FNAMES>) { chomp $folderline; if ($folderline =~ /^Folder (\d{3,5})\s+\-\s+(.*)\.$/) { $folders{$1} = $2; } } close(FNAMES); open(FFINFO,"$folderfolder_file") or die "Can't open $folderfolder_fil +e: $!\n"; my @subfolders = <FFINFO>; my %parents_of; my $parentid; while ( my $line = <FFINFO> ) { if ($line =~ /^Folder (\d{3,5})\s+\-\s+.*\./) { $parentid = $1; } elsif ($line =~ /\s+subfolder\s+(\S+)\s+\-\s+.*\./) { push @{ $parents_of{ $1 } }, $parentid; } } sub build_path { my $folderid = shift @_; my @parents = @{ $parents_of{ $folderid } || [] }; return @parents ? [ map { map { "$_/$folderid" } @{ build_path( +$_ ) } } @parents ] : [ $folderid ]; } foreach my $k (sort (keys (%folders))) { $folderpaths{$k} = build_path($k); print "$k => $_\n" for @{ $folderpaths{$k} }; }

        I think the problem is that what you have still reads everything into @subfolders. Here's the snippet I'm talking about.

        open(FFINFO,"$folderfolder_file") or die "Can't open $folderfolder_fil +e: $!\n"; # This reads in every line from FFINFO my @subfolders = <FFINFO>; my %parents_of; my $parentid; # This loop also wants to read every line, # but the handle is already at EOF, # so it gets nothing. while ( my $line = <FFINFO> ) {

        If you need @subfolders for some purpose outside the code we're talking about, you could still have it. Just declare my @subfolders; at the same scope as $parentid and %parents_of, without initializing it. Then, right inside the FFINFO read loop, push every $line in. It winds up looking like this:

        open(FFINFO,"$folderfolder_file") or die "Can't open $folderfolder_fil +e: $!\n"; my @subfolders; my %parents_of; my $parentid; while ( my $line = <FFINFO> ) { push @subfolders, $line;

        You also had a question about this:

        return @parents ? [ map { map { "$_/$folderid" } @{ build_path( +$_ ) } } @parents ] : [ $folderid ];

        What this means is, "if @parents is non-empty, return first expression (with nested maps), else return the second expression ($folderid alone in an array reference)." Your description leads me to believe that you have the precedence mixed up. What you describe is this:

        ( return @parents ) ? [ X ] : [ Y ];

        What's happening is this:

        return ( @parents ? [ X ] : [ Y ] );

        You can see this kind of thing yourself using B::Deparse. I ran this command line:

        perl -MO=Deparse,-p input-file.pl

        With the -p, it puts parentheses in to clarify how expressions are interpreted. As a result, it showed me this (along with all the other code):

        return((@parents ? [map({map({"$_/$folderid";} @{build_path($_);});} @ +parents)] : [$folderid]));

        That's not pretty, but it does show that the return is "outside" the whole rest of the line.

        I hope this helps. I feel I've written in some haste, so I wouldn't be surprised if I've been unclear. If I have left you with any other questions, feel free to ask.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://809452]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2024-04-25 15:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found