http://qs321.pair.com?node_id=433593

spurperl has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,

My GUI application contains multiple windows. One of them displays a text file - read only. The user can scroll the file, selects portions of it, run "Find" on it, simple stuff.

The problems begin when this file is very big - 100s of MBs. The application natually tries to load it wholly into memory, and BOOM.

Web searches brought up surprisingly little results. Even most of the popular text editors ignore this problem and just collapse on files too big.

But I know it's possible, because some editors do it, and it sounds possible in theory.

The problem with text files is that we can't seek in them freely. In binary files it's possible, in text files not.

I can display a single page to the user (perhaps padded by a buffer from above and below) - as far as he is concerned all the rest is virtual. But there's a problem - say a user drags a scroll bar to some far away location in the file - line 999999, for example. How do I get there quickly ?

One solution is just read the file line by line until 999999. This is slow.

Another solution: when the file is initially opened, I read it and create an index table: line -> byte. Say, line 225 starts at byte 1069 in the file. Then I can immediately go to the desired line by seeking in the file.

There's a problem: 1 million lines => about 8 million bytes to store the index. Still, quite a lot of memory. (There can also be 10 mln lines, as far as I'm concerned).

So, I can keep this index in a separate, binary file. When the user asks line 999999, I go to my index binary file, quickly seek to record 999999, read the byte start in the text file and jump there.

Does this sound logical ? Can you think of simpler solutions ?

Thanks in advance

Replies are listed 'Best First'.
Re: Displaying/buffering huge text files
by BrowserUk (Patriarch) on Feb 23, 2005 at 08:26 UTC

    This builds an index to a million line file in around 20 seconds, and accesses and prints 10,000 lines at random in under 1 second.

    If the 20 seconds is too long for startup, then you could always read enough to allow you to populate your listbox, and then push the rest of the index building off into a background thread to finish, while you populate the listbox with the first 1000 lines or so.

    The indexing could be sped up by a more intelligent indexer, that reads larger chunks and searched for the newlines, rather than reading one line at a time.

    The index requires just 8 MB of ram. That could be halved by using 'N' or 'V' as your pack format, rather than 'd', if your files will never go over 4GB. By using 'd' you are good for files up to around 8,500 Terabytes, which should see you through the next couple of machine changes or so:)

    #! perl -slw use strict; $| = 1; open FILE, '<', $ARGV[ 0 ] or die $!; print 'Before indexing: ', time; my $index = pack 'd', 0; $index .= pack 'd', tell FILE while <FILE>; print 'After indexing: ', time; print 'Size of index: ', length $index; for my $i ( map{ int rand( length( $index )/8 ) } 1 .. 10_000 ) { my $line = unpack( 'd', substr $index, $i*8, 8 ); seek FILE, $line, 0; chomp( $_ = <FILE> ); printf "\r$line : '%s'", $_; } print "\nAfter reading 10,000 random lines: ", time; __END__ P:\test>433953 data\1millionlines.dat Before indexing: 1109146372 After indexing: 1109146392 Size of index: 8000008 1865840 : '00186585' After reading 10,000 random lines: 1109146393

    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
      First of all, thanks from the detailed answer. One can always expect that from you :-)

      Now, indexing may indeed be my way to go. In fact, I may be needing to do it in C++. I also ran indexing, it took 14 seconds on a 3e6 line file (the file itself is ~60MB) using STL streams and 8 seconds using C file access (fgets(), ftell()). The memory consumption is pretty good, only 12 MB for these 3e6 files (in a vector). Accessing random lines with fseek() is, as expected, almost immediate.

      This closes in on acceptable, and I'm starting to believe that it is indeed possible to keep it in memory. (I don't need to handle files > 4 GB).

      Another problem that I forgot to mention is that the file is being updated "live" and I should keep up with it. I can get notifications, and probably just add new indexes (the file always grows, never shrinks).

        Yes, appending to the index should cause no problems at all. As your limiting yourself to under 4 GB, then using pack 'J' (perl native unsigned 32-bit which saves a little conversion) rather 'd' effects a speedup of the indexing of around x4 giving under 5 seconds for the 1e6 file.

        #! perl -slw use strict; $| = 1; open FILE, '<', $ARGV[ 0 ] or die $!; print 'Before indexing: ', time; my $index = pack 'J', 0; $index .= pack 'J', tell FILE while <FILE>; print 'After indexing: ', time; print 'Size of index: ', length $index; for my $i ( map{ int rand( length( $index )/4 ) } 1 .. 10_000 ) { my $line = unpack( 'J', substr $index, $i*4, 4 ); seek FILE, $line, 0; chomp( $_ = <FILE> ); printf "\r$line : '%s'", $_; } print "\nAfter reading 10,000 random lines: ", time; __END__ P:\test>433953 data\1millionlines.dat Before indexing: 1109148435 After indexing: 1109148440 Size of index: 4000004 1087640 : '00108765' After reading 10,000 random lines: 1109148441

        Almost quick enough that you could avoid dropping to C++ :)

        Win32 also supports Memory Mapped Files natively, complete with Copy-on-Write where applicable. I also think that tye's Win32API::File may give you access to some, if not all the apis required to use them. I'm not sure that it would work out any quicker than indexing though and the fact that the file is being written to may cause problems.


        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.
      Hi BrowserUk

      Your indexing program is very neat. I never realized that it can be so simple and effective (in perl). Thanks!

      I tried it on some large files (50 to 500 MB, with average line length about 150 characters).

      I quickly spotted speed a problem with

      $index .= pack 'd', tell FILE while <FILE>;
      ... it has to copy all previously packed data in every .= operation, so that the time grows with the square of number of lines. A big oh, O(x^2) to be precise.

      Here is my (almost) drop-in replacement which trades memory space for indexing time

      my @index = ( pack 'd', 0 ); push @index, pack 'd', tell FILE while <FILE>; pop @index; my $index = join '', @index;
      ... and the timing that shows roughly O(x) times for mine, and O(x^2) for yours (you can see the parabola in the table, if you look at it sideways).

      In the last test case (the 527 MB file) with my script version the process memory usage peaked at +270 MB for a final index size of 27.5 MB.

      Indexing mine : time 1 s, size 19949431, lines 136126 Indexing mine : time 2 s, size 40308893, lines 258457 Indexing mine : time 5 s, size 95227350, lines 634392 Indexing mine : time 29 s, size 527423877, lines 3441911 Indexing yours: time 2 s, size 19949431, lines 136127 Indexing yours: time 6 s, size 40308893, lines 258458 Indexing yours: time 31 s, size 95227350, lines 634393 Indexing yours: time 809 s, size 527423877, lines 3441912
      I also added pop @index;, to get rid of the last index - it points to the end of the file, after the last line.

      Rudif

      Hey! I was looking for a way to handle big files and this looks awesome. I already tried and runs so cool. However, I am kind newbie on Perl. I was trying to modify in order to get secuencial access, not random. I haven't good luck on this. I just tried to remove the int rand( length( $index )/8) and let this with length of $index. Didn't work, the seek doesn't move the pointer in the file. Could someone can give me a hint? Thank you in advance!
        I was trying to modify in order to get secuencial access, not random.

        Sorry, but that makes no sense at all. You have to read the entire file sequentially once in order to build an index.

        If you want to process the file sequentially, just read it sequentially. There is nothing to be gained by using an index unless you need random access.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        Us N instead of non-portable J.
        my $index = pack 'N', 0; $index .= pack 'N', tell FILE while <FILE>; for( my $line_idx = $s; $line_idx < $fileLastLine; $line_idx++ ) { my $line = unpack( 'N', substr($index, $line_idx*4, 4)); seek FILE, $line, 0; chomp( $_ = <FILE> ); my $line_str = $_;
Re: Displaying/buffering huge text files
by ctilmes (Vicar) on Feb 23, 2005 at 12:20 UTC
    Instead of indexing every single line, you could index a fraction of them (say every 10th or every 25th) and just skip a few lines after seeking.

    If they want to see line 57, seek to line 50 using your index, and skip 7 lines after reading.

      This is an excellent idea - I'll definitely keep it in mind. For now, it looks that full indexing is good enough, but I may stumble upon corner cases...
Re: Displaying/buffering huge text files
by graff (Chancellor) on Feb 23, 2005 at 06:33 UTC
    I haven't tried it myself yet (at least, not in Perl), but this sounds like a good case for using Mmap. The idea would be that you set yourself a comfortable buffer size (a few meg, or whatever makes sense for the GUI), and you map this amount of the big file into the scalar that gets put into the Text widget.

    When the user scrolls the widget to a point outside the current mapping window, remap the scalar to the appropriate offset in the file. Assuming that it actually works, it ought to be pretty fast, and could probably be built into a callback function from the scroller widget.

    (Unless of course you happen to be using a system that doesn't support mmap, in which case this is a useless idea.)

      I have a sinking feeling that mmap is not supported on Windows... I will check it more thoroughly, though, thanks.
        I realize this is probably to late to be be of use for your text-viewing task (which is presumably solved by now), but for future reference, you might want to look up a Windows port of the UNIX environment and tools provided by AT&T Research: http://www.research.att.com/sw/tools/uwin/.

        The ATT "UNIX for Windows" package is called "UWIN" (how clever?), it's free/open-source, and I noticed that the description of features includes:

        Memory mapping and shared memory:
        Both mmap() and the system V shared memory facilities are provided.
Re: Displaying/buffering huge text files
by crenz (Priest) on Feb 23, 2005 at 14:51 UTC

    This is an excellent question with excellent answers. May I suggest you put the whole thing into a module and on CPAN? It might be useful for many others as well.

    I like the idea of only indexing every 10th or 25th line, then skipping on read. Most OSes will read a whole block at a time anyway, so for most files, you will be reading a lot of lines from the hard disk at the same time anyway. Might well make use of them. Of course, if it's in a module, the skipping could even be handled transparently (and customized by setting a parameter, and the user could just do a $file->GetLine(100_000) without worrying about what's going on.

    One more idea: You could only read and index $n lines initially, then provide a callback routine that can be called regularly to read and index $m lines more, until the file is fully indexed. This way, a text editor can display the first few lines very quickly, then continue indexing in the background by calling your callback routine in a separate thread or in the main thread's GUI loop.

      In fact, you can stat the file and get back the system's preferred size of blocks, and then make your index be a mapping of line range to block number, and seek to block * block_size, and then skip lines from there.

      Going through even 500 lines is nearly instantaneous, probably less than a screen redraw, and i guess that's about as much as you can expect to fit in a 4k block, which is pretty much the standard.

      The advantage is that you will probably (if you're careful about off-by-one) can minimize the disk access so cleanly, that after several seeks the relevant items will all be in memorized pages, and subsequent reads will be cheap.

      As for implementation, Event has good IO handling, idle callbacks, and Tk integration - it could be used to index the file incrementally, without threading, if that's scary.

      Update: I suddenly remember a snippet in some perlfaq using tr to count the number of files efficiently. It uses 4k increments. This should make indexing very quick... keep a sum, and just increment the sum for every block, and record the intermediate results somewhere.

      -nuffin
      zz zZ Z Z #!perl
Re: Displaying/buffering huge text files
by spurperl (Priest) on Feb 24, 2005 at 06:35 UTC
    The idea of storing every n-th index in the index table that a couple of monks brought up works even better than expected. I wasn't sure about it at first, but since my reading of the file is buffered, it emposes no performance cost.

    My approach now is: I have a BUFFER_BLOCK, currently 1000 lines long. I store every BUFFER_BLOCK lines in the index, that is, 0th line, 1000th line, etc. When the class is asked for a line which is not in the buffer, it rounds the line to the highest BUFFER_BLOCK (I.e. for line 7781 it goes to 7000), grabs another BUFFER_BLOCK lines down an another up (that is 6000-9000 for line 7781) and returns the desired line.

    This works like magic, and blazingly fast. I'm experimenting with a 100MB file now (~6 million lines). Reading and indexing it (I'm doing it now in C++ and the smaller amount of push_back to the vector gives gains) takes below 2 seconds ! Afterwards, accesses to lines that are not in buffer take ~70 ms (in buffer is immediate of course).

    Memory consumption: the index table takes 4*1/BUFFER_BLOCK bytes for each line. That is, in the gigantic file I'm testing, it takes only 24 KB.

    The buffer itself is 3000 lines at 30 chars / line on average, only 90 KB or so.

    So, the class "mirrors" a 100 MB file, taking only about 120 KB of memory and working blazingly fast.

    Thanks for all the good and interesting answers, monks. I wonder, though, if Perl can match C++'s speed here. Indexing a 100 MB file in 1.7 seconds is quite impressive.

Re: Displaying/buffering huge text files
by sfink (Deacon) on Feb 24, 2005 at 05:40 UTC
    If you're sure that indexing the start offset of every line will fit in memory, then go for it. You can handle a lot of data that way -- but just be sure you don't hit the nasty cases, such as a file containing only newlines.

    For that case, and the general case of even larger files, consider a variant of the index-every-nth line idea: index based on disk block (or more likely, some multiple of the disk block). Say you use a block size of 8KB. Then keep the line number of the first complete line starting within each block. When seeking to a given line number, you do a binary search in your index to find the block number that contains the largest line number less than or equal to the line you're looking for. Then you read the block in, scanning linearly through the text for the line you want.

    This approach deals with the problematic cases more gracefully -- if you have a huge number of newlines, you'll still only read the block containing the line you want. (Well, you might have to read the following block too, to get the whole line.) Or, if you have enormous single lines, you'll never have the problem of your index giving you a starting position way before the line you want, as might happen if you were indexing every 25th line.

    Generally speaking, your worst-case performance is defined in terms of the time to process some number of bytes, not lines, so you'll be better off if your index thinks in terms of bytes.

    All that being said, I would guess that this would be overkill for your particular application, and you'd be better off with an offset-per-line index. It's simpler and good enough. And if that gets too big, you can always store the index in a gdbm file.

      sfink: You can handle a lot of data that way -- but just be sure you don't hit the nasty cases, such as a file containing only newlines.

      a file containing *only* newlines is not a nasty case and requires no indexing at all. If you can figure out before you read it that the file contains only newlines, you change your "figure out the seek offset" subroutine so that to get to line 120000 it seeks to byte 120000. It doesn't get any easier!

Re: Displaying/buffering huge text files
by bibliophile (Prior) on Feb 23, 2005 at 15:02 UTC
    If you *do* turn this into a CPAN module, can you update this thread? I know *I'd* love to play with this :-)
Re: Displaying/buffering huge text files
by scooper (Novice) on Feb 23, 2005 at 23:35 UTC
    Extremely large text files do not give up useful information readily. Even if the tools used to index them or jump to a particular location are fast, it's still a chore for users to get information out of them. Databases are much better at storing and using huge amounts of data. Nobody looks at a huge text file and thinks "I'd like to jump to line 150,000 at this point" they just think "what a useless large text file- how on earth am I going to find the information I want here??" Enforcing a fixed limit on the size of text file that your GUI application can handle would not be unreasonable. Or handling huge files with less functionality is another option- for example the Linux "less" paging program gives you the option to abort line number calculation when it reads in huge files, giving you the option of getting the file up faster at the cost of losing the "go to line 235,000" functionality.
Re: Displaying/buffering huge text files
by JavaFan (Canon) on Dec 23, 2011 at 03:04 UTC
    8 million bytes? Surely on modern computers that isn't a lot, is it? It may be a lot on small embedded systems, but I doubt you have a need to interactively edit files hundreds of Mb large. And 8 bytes for each offset? You expect files that large? Your file has to exceed 64 Pb before it needs 8 byte offsets. (1 byte offset: up to 256 bytes in a file; 2 byte offset: 64k; 3 byte offset: 16 Mb; 4 byte offset: 4 Gb; 5 byte offset: 1 Tb; 6 byte offset: 256 Tb; 7 byte offset: 64 Pb; 8 byte offset: 16 Eb)

    Anyway, if storing all the offsets in an index makes the index too large, why not store the offset of every 100th line? That makes your index 99% smaller. You do have to read up to 100 lines if your user jumps to a certain line, but unless you're writing a line editor, you want to read multiple lines on jumps anyway.

      And 8 bytes for each offset? ... 5 byte offset: 1 Tb;

      I, for one, would like to see your code for building an 5-byte index?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        I, for one, would like to see your code for building an 5-byte index?
        Storing a million 5-byte integers using 5 million bytes for data, and 44 bytes additional overhead:
        my $buffer = ""; my $BYTES = 5; my $BITS_IN_BYTE = 8; my $FULL_BYTE = (1 << $BITS_IN_BYTE) - 1; sub store { my ($pos, $value) = @_; for (0 .. $BYTES - 1) { vec($buffer, $pos * $BYTES + $_, $BITS_IN_BYTE) = ($value >> ($BITS_IN_BYTE * ($BYTES - 1 - $_))) & $FULL_BYT +E; } } sub fetch { my ($pos) = @_; my $sum = 0; for (0 .. $BYTES - 1) { $sum <<= $BITS_IN_BYTE; $sum += vec($buffer, $pos * $BYTES + $_, $BITS_IN_BYTE); } $sum; } # # Testing # my $TEST_SIZE = 1_000_000; my @offsets = map {int rand 1 << ($BYTES * $BITS_IN_BYTE)} 1 .. $TEST_ +SIZE; for (my $i = 0; $i < @offsets; $i++) { store $i, $offsets[$i]; } for (my $i = 0; $i < @offsets; $i++) { my $val = fetch $i; die unless $val == $offsets[$i]; } use Devel::Size 'size'; use 5.010; say "Index size: ", size $buffer; __END__ Index size: 5000044