fourmi has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I have a base directory containing (amongst other files/dirs) eleven sub directories (Sub1-Sub11), each of these contains approx 10000 html files (amongst other files), which don't actually contain html, just a list of keywords (30 max). I wish to produce a list of sorted unique keywords. There is a lot of repetition, and new keywords can be added at any time to any file (hence this needs to be done dynamically). Worst case is that i might have to sort and uniq a list of (11x10000x30) 3300000 items. I can't flatten the directory structure as the OS (windows) has difficulty with directories containing such large numbers of small files. My code is definately not optimised, I've never really worried abut optimisation for saving milliseconds, but given the magnitude here, I am happy to save minutes! Apologies for any offensive code!
Any solutions, pointers, references, or just thoughts and comments will be well received.
Thanks ant
use strict;
use Cwd;
my ($PWD) = getcwd;
my (@DirList);
my ($DirItem);
my (@SUBDirList);
my ($SUBDirItem);
my ($CurrentHtmlFile);
my (@Lines);
my ($Line);
my (%een);
my (@KeyWords);
opendir(DIR, $PWD) || die "Cannot Open The Directory \"$PWD\"\n";
@DirList = readdir(DIR);
closedir DIR;
foreach $DirItem (@DirList)
{
if ($DirItem =~ /^Sub/)
{
opendir(SUBDIR, "$PWD\\$DirItem") || die "Cannot Open The Dire
+ctory \"$PWD\\$DirItem\"\n";
@SUBDirList = readdir(SUBDIR);
closedir SUBDIR;
foreach $SUBDirItem (@SUBDirList)
{
if ( $SUBDirItem =~ /html$/)
{
$CurrentHtmlFile = "$PWD\\$DirItem\\$SUBDirIte
+m" ;
open (READ, "<$CurrentHtmlFile") || die "Could
+n't Read From $CurrentHtmlFile";
$Line = <READ>;
@Lines = split (/,/,$Line);
close (READ);
push (@KeyWords, @Lines);
}
}
}
}
foreach (@KeyWords) {++$een{$_};}
print sort keys %een;
a bit of background. i have an image library, keywords for image \d+.jpg are stored in \d+.html, and this can be updated
Slight paradigm shift. Am not going to run this dynamically each time a keyword file is updated, but instead just update the keyword list. (troll through the keyword list, if seen, nothing, else appened new keyword to the end) keywords can only be added, not removed, should reduce load considerably.
Thanks to all that commented, it will definately help with the initial collection, and i'm a better perl programmer now too!
cheers
Re: Code Efficiency
by PodMaster (Abbot) on Mar 25, 2004 at 11:42 UTC
|
| [reply] [d/l] |
|
looking up Plucene, dying is just at this stage, will lose it on implementation, very nice point re timestamp, and again with giant has & giant array
thanks!
| [reply] |
Re: Code Efficiency
by Jaap (Curate) on Mar 25, 2004 at 11:31 UTC
|
1. Lose the parens () around a variable declaration (just my $PWD;). Edit on Abigail: this is just a style issue, unless you need the parens, i'd omit them.
2. Do not use @KeyWords, just put them in the hash as you find them:
foreach (split (/,/,$Line))
{
$een{$_}++;
}
3. Declare variables as late as possible (if you want the advantages of use strict;).
4. I assume you use @DirList and @SUBDirList because you can't 'flatten' the directory structure? There is not real speed problem here but my instinct tells me to lose the two arrays and open the subdirs in a while loop of the readdir(DIR).
These are all minor items, none of them speed up the most time-consuming part. | [reply] [d/l] [select] |
|
Lose the parens () around a variable declaration (just my $PWD;).
Why? It's utterly useless to give "suggestions" without
motivating them. It's even more useless to give suggestions
that appear to be general, but which are sometimes plain
wrong. There's a difference between my $var and
my ($var), and while sometimes it doesn't matter
whether you place the parens (like in the code being discussed), it sometimes changes the meaning of the expression. You seem to think it matters in this case.
Could you explain why?
Declare variables as late as possible (if you want the advantages of use strict;).
Huh? While declaring variables in an as restrictive scope
as necessary is usually a good thing, what does that have
to do with any advantage of use strict?
Abigail
| [reply] [d/l] [select] |
|
Declare variables as late as possible (if you want the advantages of use strict;).
--
Huh? While declaring variables in an as restrictive scope as necessary is usually a good thing, what does that have to do with any advantage of use strict?
The first use of a variable is usually important for consequent uses.
If you declare the variable as late as possible, you make sure the program breaks when you remove the first use, but not all others.
my $foo; # Predeclared: bad style
# Usually done by C-coders, who predeclare a whole bunch of variables:
# my ($foo, $bar, $baz, $quux, $whatever, $i, $think, $is, $needed, $a
+nywhere);
...
$foo = foo();
...
print $foo;
Now, I remove the assignment:
my $foo;
...
print $foo; # No error, just a warning at run time.
However, if you declare $foo as late as possible:
my $foo = foo(); # Declared when needed
...
print $foo;
Removing the assignment again:
...
print $foo; # Error at compile time!
| [reply] [d/l] [select] |
|
|
|
cool, thanks a lot, useful stuff i wouldn't have thought about, i agree that the guts are going to take ages anyway though!
cheers
| [reply] |
|
| [reply] |
|
What is een? Are you one of those people who have rrays and calars?
Actually, calar would be very nice looking name (when compared to other two) :D.
| [reply] |
|
|
| [reply] |
Re: Code Efficiency
by Abigail-II (Bishop) on Mar 25, 2004 at 13:10 UTC
|
If I understand correctly, this can be done in the shell using less than 20 characters (yeah, shell!):
sort -u Sub*/*html
With the added benefit that sort knows how to sort
huge lists efficiently. (This is assuming that your shell doesn't have problems with the number of arguments that result from the expansion - but even then, a shell solution
is way smaller)
But if you really want to save time, you shouldn't look in
sorting/uniquing phase. Look at the updating phase. Keep files
sorted. Reduce the number of files. Reduce the number of
files per directory.
Abigail | [reply] [d/l] |
|
i WISH i was using nix! just came to the same conclusion re: updating too, i can't changed the tree/file structure, but the updating is a sinch now hopefully (queue BSOD) cheers!
| [reply] |
|
| [reply] |
|
windows also has a built in sort.
run help sort from the command line for options.
| [reply] [d/l] |
|
Re: Code Efficiency
by Hena (Friar) on Mar 25, 2004 at 12:27 UTC
|
You could lose one if structure (testing for html in last foreach loop) there by using grep in readdir.
# instead of
# @SUBDirList = readdir(SUBDIR);
@SUBDirList = grep {/html$/} readdir (SUBDIR);
Don't know how much speed it gives though :). | [reply] [d/l] |
|
I'll take anything that goes, thanks a lot, also applicable in the other regexp if, less involved there, only 11 recursions, but all the same!!
cheers!
| [reply] |
Re: Code Efficiency
by fourmi (Scribe) on Mar 25, 2004 at 13:16 UTC
|
a bit of background. i have an image library, keywords for image \d+.jpg are stored in \d+.html, and this can be updated
Slight paradigm shift. Am not going to run this dynamically each time a keyword file is updated, but instead just update the keyword list. (troll through the keyword list, if seen, nothing, else appened new keyword to the end) keywords can only be added, not removed, should reduce load considerably.
Thanks to all that commented, it will definately help with the initial collection, and i'm a better perl programmer now too!
cheers | [reply] |
Re: Code Efficiency
by MidLifeXis (Monsignor) on Mar 25, 2004 at 18:16 UTC
|
If you have 10000 files in a directory, you may also be running into problems with the OS being able to handle directories that large. See my writup on this on a different thread.
Depending on the OS, you are better off (from a memory / OS standpoint) making your directory structure deeper, so that when the OS is opening the file, it does not need to read more than necessary. The wider / flatter the directory structure, the more resources may be necessary to get to an individual file (on open() for example).
--MidLifeXis
| [reply] [d/l] |
|
Hi,
yeah i definately found that windows tends to get screwy once a dir has more than 30,000 files in it, hence splitting into smaller chunks, even then it's still fairly screwy, that it a VERY interesting thread you link to, thanks very much, i didn't catch it initially.
cheers!
| [reply] |
Re: Code Efficiency
by TilRMan (Friar) on Mar 25, 2004 at 19:49 UTC
|
You say "new keywords can be added at any time." What about changing and deleting? If you only have keywords added, then you could optimize (for time) very efficiently by just detecting what new keywords show up and then inserting them into the sorted list. For instance, make a cache (copy) of the tree and compare the old vs. the new.
The next optimization step: Keep a timestamp (and/or hash) of every file instead of a copy. Since you are only adding keywords, any time a file changes, dump its whole contents into the sorted list.
| [reply] |
|
Hi
It's a photo archive, so hopefully previous keywords won't be incorrect (carrots shouldn't dissapear from the image!).
Basically what i have done now is made a file 'UBER1' containing
PicRefID: Keyword1, Keyword2 ... The keyword list is basically another file 'UBER2', identical but without the PicRefID, and sorted, and uniq'd. if a keyword is added to a picture, it adds it to the relevant PicRefId line in 'UBER1', and also if the new keyword does not already appear in 'UBER2', places it in the relevant place.
It's all much much nicer now. All the data in the 100,000ish files is in one file, and the keyword list only needs occasional updating.
Basically my initial plan was based on a tiny subset running well, and not expecting such massive over heads. now it only takes a minute or two, and that's more the upload of 15M of keywords, rather than script load (and then the upload!)
Cheers to all that helps, much appreciated!!
ant
| [reply] |
|
|