comment on

Hi BrowserUk

Your indexing program is very neat. I never realized that it can be so simple and effective (in perl). Thanks!

I tried it on some large files (50 to 500 MB, with average line length about 150 characters).

I quickly spotted speed a problem with

$index .= pack 'd', tell FILE while <FILE>;
[download]

... it has to copy all previously packed data in every .= operation, so that the time grows with the square of number of lines. A big oh, O(x^2) to be precise.

Here is my (almost) drop-in replacement which trades memory space for indexing time

my @index = ( pack 'd', 0 );
push @index, pack 'd', tell FILE while <FILE>;
pop @index;
my $index = join '', @index;
[download]

... and the timing that shows roughly O(x) times for mine, and O(x^2) for yours (you can see the parabola in the table, if you look at it sideways).

In the last test case (the 527 MB file) with my script version the process memory usage peaked at +270 MB for a final index size of 27.5 MB.

Indexing mine : time   1 s, size   19949431, lines  136126
Indexing mine : time   2 s, size   40308893, lines  258457
Indexing mine : time   5 s, size   95227350, lines  634392
Indexing mine : time  29 s, size  527423877, lines 3441911

Indexing yours: time   2 s, size   19949431, lines  136127
Indexing yours: time   6 s, size   40308893, lines  258458
Indexing yours: time  31 s, size   95227350, lines  634393
Indexing yours: time 809 s, size  527423877, lines 3441912
[download]

I also added pop @index;, to get rid of the last index - it points to the end of the file, after the last line.

Rudif

In reply to Re^2: Displaying/buffering huge text files by Rudif
in thread Displaying/buffering huge text files by spurperl

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks