Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hi BrowserUk

Your indexing program is very neat. I never realized that it can be so simple and effective (in perl). Thanks!

I tried it on some large files (50 to 500 MB, with average line length about 150 characters).

I quickly spotted speed a problem with

$index .= pack 'd', tell FILE while <FILE>;
... it has to copy all previously packed data in every .= operation, so that the time grows with the square of number of lines. A big oh, O(x^2) to be precise.

Here is my (almost) drop-in replacement which trades memory space for indexing time

my @index = ( pack 'd', 0 ); push @index, pack 'd', tell FILE while <FILE>; pop @index; my $index = join '', @index;
... and the timing that shows roughly O(x) times for mine, and O(x^2) for yours (you can see the parabola in the table, if you look at it sideways).

In the last test case (the 527 MB file) with my script version the process memory usage peaked at +270 MB for a final index size of 27.5 MB.

Indexing mine : time 1 s, size 19949431, lines 136126 Indexing mine : time 2 s, size 40308893, lines 258457 Indexing mine : time 5 s, size 95227350, lines 634392 Indexing mine : time 29 s, size 527423877, lines 3441911 Indexing yours: time 2 s, size 19949431, lines 136127 Indexing yours: time 6 s, size 40308893, lines 258458 Indexing yours: time 31 s, size 95227350, lines 634393 Indexing yours: time 809 s, size 527423877, lines 3441912
I also added pop @index;, to get rid of the last index - it points to the end of the file, after the last line.

Rudif


In reply to Re^2: Displaying/buffering huge text files by Rudif
in thread Displaying/buffering huge text files by spurperl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (2)
As of 2024-04-20 06:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found