http://qs321.pair.com?node_id=433614


in reply to Displaying/buffering huge text files

This builds an index to a million line file in around 20 seconds, and accesses and prints 10,000 lines at random in under 1 second.

If the 20 seconds is too long for startup, then you could always read enough to allow you to populate your listbox, and then push the rest of the index building off into a background thread to finish, while you populate the listbox with the first 1000 lines or so.

The indexing could be sped up by a more intelligent indexer, that reads larger chunks and searched for the newlines, rather than reading one line at a time.

The index requires just 8 MB of ram. That could be halved by using 'N' or 'V' as your pack format, rather than 'd', if your files will never go over 4GB. By using 'd' you are good for files up to around 8,500 Terabytes, which should see you through the next couple of machine changes or so:)

#! perl -slw use strict; $| = 1; open FILE, '<', $ARGV[ 0 ] or die $!; print 'Before indexing: ', time; my $index = pack 'd', 0; $index .= pack 'd', tell FILE while <FILE>; print 'After indexing: ', time; print 'Size of index: ', length $index; for my $i ( map{ int rand( length( $index )/8 ) } 1 .. 10_000 ) { my $line = unpack( 'd', substr $index, $i*8, 8 ); seek FILE, $line, 0; chomp( $_ = <FILE> ); printf "\r$line : '%s'", $_; } print "\nAfter reading 10,000 random lines: ", time; __END__ P:\test>433953 data\1millionlines.dat Before indexing: 1109146372 After indexing: 1109146392 Size of index: 8000008 1865840 : '00186585' After reading 10,000 random lines: 1109146393

Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

Replies are listed 'Best First'.
Re^2: Displaying/buffering huge text files
by spurperl (Priest) on Feb 23, 2005 at 08:35 UTC
    First of all, thanks from the detailed answer. One can always expect that from you :-)

    Now, indexing may indeed be my way to go. In fact, I may be needing to do it in C++. I also ran indexing, it took 14 seconds on a 3e6 line file (the file itself is ~60MB) using STL streams and 8 seconds using C file access (fgets(), ftell()). The memory consumption is pretty good, only 12 MB for these 3e6 files (in a vector). Accessing random lines with fseek() is, as expected, almost immediate.

    This closes in on acceptable, and I'm starting to believe that it is indeed possible to keep it in memory. (I don't need to handle files > 4 GB).

    Another problem that I forgot to mention is that the file is being updated "live" and I should keep up with it. I can get notifications, and probably just add new indexes (the file always grows, never shrinks).

      Yes, appending to the index should cause no problems at all. As your limiting yourself to under 4 GB, then using pack 'J' (perl native unsigned 32-bit which saves a little conversion) rather 'd' effects a speedup of the indexing of around x4 giving under 5 seconds for the 1e6 file.

      #! perl -slw use strict; $| = 1; open FILE, '<', $ARGV[ 0 ] or die $!; print 'Before indexing: ', time; my $index = pack 'J', 0; $index .= pack 'J', tell FILE while <FILE>; print 'After indexing: ', time; print 'Size of index: ', length $index; for my $i ( map{ int rand( length( $index )/4 ) } 1 .. 10_000 ) { my $line = unpack( 'J', substr $index, $i*4, 4 ); seek FILE, $line, 0; chomp( $_ = <FILE> ); printf "\r$line : '%s'", $_; } print "\nAfter reading 10,000 random lines: ", time; __END__ P:\test>433953 data\1millionlines.dat Before indexing: 1109148435 After indexing: 1109148440 Size of index: 4000004 1087640 : '00108765' After reading 10,000 random lines: 1109148441

      Almost quick enough that you could avoid dropping to C++ :)

      Win32 also supports Memory Mapped Files natively, complete with Copy-on-Write where applicable. I also think that tye's Win32API::File may give you access to some, if not all the apis required to use them. I'm not sure that it would work out any quicker than indexing though and the fact that the file is being written to may cause problems.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        You could look at Win32::MMF, which claims to provide native Memory Mapped File Service for shared memory support under Windows.


        acid06
        perl -e "print pack('h*', 16369646), scalar reverse $="
Re^2: Displaying/buffering huge text files
by Rudif (Hermit) on Mar 28, 2005 at 19:39 UTC
    Hi BrowserUk

    Your indexing program is very neat. I never realized that it can be so simple and effective (in perl). Thanks!

    I tried it on some large files (50 to 500 MB, with average line length about 150 characters).

    I quickly spotted speed a problem with

    $index .= pack 'd', tell FILE while <FILE>;
    ... it has to copy all previously packed data in every .= operation, so that the time grows with the square of number of lines. A big oh, O(x^2) to be precise.

    Here is my (almost) drop-in replacement which trades memory space for indexing time

    my @index = ( pack 'd', 0 ); push @index, pack 'd', tell FILE while <FILE>; pop @index; my $index = join '', @index;
    ... and the timing that shows roughly O(x) times for mine, and O(x^2) for yours (you can see the parabola in the table, if you look at it sideways).

    In the last test case (the 527 MB file) with my script version the process memory usage peaked at +270 MB for a final index size of 27.5 MB.

    Indexing mine : time 1 s, size 19949431, lines 136126 Indexing mine : time 2 s, size 40308893, lines 258457 Indexing mine : time 5 s, size 95227350, lines 634392 Indexing mine : time 29 s, size 527423877, lines 3441911 Indexing yours: time 2 s, size 19949431, lines 136127 Indexing yours: time 6 s, size 40308893, lines 258458 Indexing yours: time 31 s, size 95227350, lines 634393 Indexing yours: time 809 s, size 527423877, lines 3441912
    I also added pop @index;, to get rid of the last index - it points to the end of the file, after the last line.

    Rudif

Re^2: Displaying/buffering huge text files
by Anonymous Monk on Dec 22, 2011 at 17:28 UTC
    Hey! I was looking for a way to handle big files and this looks awesome. I already tried and runs so cool. However, I am kind newbie on Perl. I was trying to modify in order to get secuencial access, not random. I haven't good luck on this. I just tried to remove the int rand( length( $index )/8) and let this with length of $index. Didn't work, the seek doesn't move the pointer in the file. Could someone can give me a hint? Thank you in advance!
      I was trying to modify in order to get secuencial access, not random.

      Sorry, but that makes no sense at all. You have to read the entire file sequentially once in order to build an index.

      If you want to process the file sequentially, just read it sequentially. There is nothing to be gained by using an index unless you need random access.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

      Us N instead of non-portable J.
      my $index = pack 'N', 0; $index .= pack 'N', tell FILE while <FILE>; for( my $line_idx = $s; $line_idx < $fileLastLine; $line_idx++ ) { my $line = unpack( 'N', substr($index, $line_idx*4, 4)); seek FILE, $line, 0; chomp( $_ = <FILE> ); my $line_str = $_;
        Us N instead of non-portable J.

        'J' may be non-portable, but why oh why would you build an index on one machine and the export it to another? (Hint:You wouldn't unless you are naive!)

        The reason for using 'J' in preference to 'N', is that it is 'native'. That is, whatever architecture of machine you use it on it uses the format that is natural, and therefore most efficient, for that machine.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        In case it isn't clear, $s is for start, the line to start on. For $fileLastLine I used:
        my $fileLastLine = `wc -l ${filename}`; chomp $fileLastLine; $fileLastLine = int ($fileLastLine);