Code Efficiency

fourmi has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have a base directory containing (amongst other files/dirs) eleven sub directories (Sub1-Sub11), each of these contains approx 10000 html files (amongst other files), which don't actually contain html, just a list of keywords (30 max). I wish to produce a list of sorted unique keywords. There is a lot of repetition, and new keywords can be added at any time to any file (hence this needs to be done dynamically). Worst case is that i might have to sort and uniq a list of (11x10000x30) 3300000 items. I can't flatten the directory structure as the OS (windows) has difficulty with directories containing such large numbers of small files. My code is definately not optimised, I've never really worried abut optimisation for saving milliseconds, but given the magnitude here, I am happy to save minutes! Apologies for any offensive code!

Any solutions, pointers, references, or just thoughts and comments will be well received.

Thanks
ant

use strict;
use Cwd;
my ($PWD) = getcwd;
my (@DirList);
my ($DirItem);
my (@SUBDirList);
my ($SUBDirItem);
my ($CurrentHtmlFile);
my (@Lines);
my ($Line);
my (%een);
my (@KeyWords);

opendir(DIR, $PWD) || die "Cannot Open The Directory \"$PWD\"\n";
    @DirList = readdir(DIR);
closedir DIR;

foreach $DirItem (@DirList)
{
    if ($DirItem =~ /^Sub/)
    {
        opendir(SUBDIR, "$PWD\\$DirItem") || die "Cannot Open The Dire
+ctory \"$PWD\\$DirItem\"\n";
                   @SUBDirList = readdir(SUBDIR);
            closedir SUBDIR;

        foreach $SUBDirItem (@SUBDirList)
        {
            if ( $SUBDirItem =~ /html$/)
            {
                        $CurrentHtmlFile = "$PWD\\$DirItem\\$SUBDirIte
+m" ;
                        open (READ, "<$CurrentHtmlFile") || die "Could
+n't Read From $CurrentHtmlFile";
                            $Line = <READ>;
                        @Lines = split (/,/,$Line);
                        close (READ);
                    push (@KeyWords, @Lines);
            }
        }
    }
}

foreach (@KeyWords) {++$een{$_};}
print sort keys %een;
[download]

a bit of background. i have an image library, keywords for image \d+.jpg are stored in \d+.html, and this can be updated

Slight paradigm shift. Am not going to run this dynamically each time a keyword file is updated, but instead just update the keyword list. (troll through the keyword list, if seen, nothing, else appened new keyword to the end) keywords can only be added, not removed, should reduce load considerably.
Thanks to all that commented, it will definately help with the initial collection, and i'm a better perl programmer now too!
cheers

Comment on Code Efficiency Download Code

Replies are listed 'Best First'.

Re: Code Efficiency
by PodMaster (Abbot) on Mar 25, 2004 at 11:42 UTC

Plucene

DB_File

Now looking at the code...

One easy way to speed this up is to implement caching. If the timestamp hasn't changed, no need to rescan the entire file/directory.
Don't build a giant array only so you can build a giant hash afterwards, just build the giant hash ($wordlist{$word}++).
Why are you keeping a @DirList? You don't appear to be doing anything with it, might think about timestamps again.
You can write foreach my $SUBDirItem( @SUBDirList ){ ... }
Are you sure you wanna die (ex, you've reached the last two files, and you can't read the one before last, why not just move on to the next one)?

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re: Re: Code Efficiency

by fourmi (Scribe) on Mar 25, 2004 at 11:50 UTC

[reply]

Re: Code Efficiency
by Jaap (Curate) on Mar 25, 2004 at 11:31 UTC

my $PWD;

Edit

foreach (split (/,/,$Line))
{
  $een{$_}++;
}
[download]

use strict;

@DirList

@SUBDirList

[reply]
[d/l]
[select]

Re: Code Efficiency

by Abigail-II (Bishop) on Mar 25, 2004 at 13:11 UTC

Lose the parens () around a variable declaration (just my $PWD;).

my $var

my ($var)

Declare variables as late as possible (if you want the advantages of use strict;).

Abigail

[reply]
[d/l]
[select]

Re: Re: Code Efficiency

by Juerd (Abbot) on Mar 25, 2004 at 13:21 UTC

Declare variables as late as possible (if you want the advantages of use strict;). -- Huh? While declaring variables in an as restrictive scope as necessary is usually a good thing, what does that have to do with any advantage of use strict?

The first use of a variable is usually important for consequent uses. If you declare the variable as late as possible, you make sure the program breaks when you remove the first use, but not all others.

my $foo;  # Predeclared: bad style

# Usually done by C-coders, who predeclare a whole bunch of variables:
# my ($foo, $bar, $baz, $quux, $whatever, $i, $think, $is, $needed, $a
+nywhere);

...

$foo = foo();

...

print $foo;
[download]

my $foo;

...

print $foo;  # No error, just a warning at run time.
[download]

my $foo = foo();  # Declared when needed

...

print $foo;
[download]

...

print $foo;  # Error at compile time!
[download]

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

[reply]
[d/l]
[select]

Re: Code Efficiency

by Abigail-II (Bishop) on Mar 25, 2004 at 13:27 UTC

Re: Re: Code Efficiency

by Juerd (Abbot) on Mar 25, 2004 at 13:30 UTC

Re: Re: Code Efficiency

by fourmi (Scribe) on Mar 25, 2004 at 11:53 UTC

[reply]

Re: Re: Code Efficiency

by Juerd (Abbot) on Mar 25, 2004 at 13:15 UTC

een

What is een? Are you one of those people who have rrays and calars?

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

[reply]

Re: Re: Re: Code Efficiency

by Hena (Friar) on Mar 25, 2004 at 13:28 UTC

What is een? Are you one of those people who have rrays and calars?

[reply]

Re: Re: Re: Re: Code Efficiency

by Juerd (Abbot) on Mar 25, 2004 at 13:33 UTC

Re: Re: Re: Code Efficiency

by fourmi (Scribe) on Mar 25, 2004 at 14:06 UTC

yup ;o)

[reply]

Re: Code Efficiency
by Abigail-II (Bishop) on Mar 25, 2004 at 13:10 UTC

    sort -u Sub*/*html
[download]

sort

But if you really want to save time, you shouldn't look in sorting/uniquing phase. Look at the updating phase. Keep files sorted. Reduce the number of files. Reduce the number of files per directory.

Abigail

[reply]
[d/l]

Re: Re: Code Efficiency

by fourmi (Scribe) on Mar 25, 2004 at 13:20 UTC

[reply]

Re: Re: Re: Code Efficiency

by Juerd (Abbot) on Mar 25, 2004 at 13:23 UTC

i WISH i was using nix!

Unixish utilities are available for many platforms, including MS Windows.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

[reply]

Re: Re: Re: Code Efficiency

by krujos (Curate) on Mar 25, 2004 at 14:55 UTC

help sort

[reply]
[d/l]

Re: Re: Re: Re: Code Efficiency

by fourmi (Scribe) on Mar 25, 2004 at 15:38 UTC

Re: Code Efficiency
by Hena (Friar) on Mar 25, 2004 at 12:27 UTC

# instead of
# @SUBDirList = readdir(SUBDIR);
@SUBDirList = grep {/html$/} readdir (SUBDIR);
[download]

[reply]
[d/l]

Re: Re: Code Efficiency

by fourmi (Scribe) on Mar 25, 2004 at 13:03 UTC

[reply]

Re: Code Efficiency
by fourmi (Scribe) on Mar 25, 2004 at 13:16 UTC

\d+

[reply]

Re: Code Efficiency
by MidLifeXis (Monsignor) on Mar 25, 2004 at 18:16 UTC

If you have 10000 files in a directory, you may also be running into problems with the OS being able to handle directories that large. See my writup on this on a different thread.

Depending on the OS, you are better off (from a memory / OS standpoint) making your directory structure deeper, so that when the OS is opening the file, it does not need to read more than necessary. The wider / flatter the directory structure, the more resources may be necessary to get to an individual file (on open() for example).

--MidLifeXis

[reply]
[d/l]

Re: Re: Code Efficiency

by fourmi (Scribe) on Mar 26, 2004 at 11:15 UTC

[reply]

Re: Code Efficiency
by TilRMan (Friar) on Mar 25, 2004 at 19:49 UTC

You say "new keywords can be added at any time." What about changing and deleting? If you only have keywords added, then you could optimize (for time) very efficiently by just detecting what new keywords show up and then inserting them into the sorted list. For instance, make a cache (copy) of the tree and compare the old vs. the new.

The next optimization step: Keep a timestamp (and/or hash) of every file instead of a copy. Since you are only adding keywords, any time a file changes, dump its whole contents into the sorted list.


Your skill will accomplish what the force of many cannot
	PerlMonks