http://qs321.pair.com?node_id=548913

The topic seems to come up almost weekly: "How do I jump directly to the 300th line of a file without scanning through the file line by line?" I assume this question comes to us again and again because the people asking it perceive that it's a little inefficient to scan through a huge file, line by line, just to find the Nth line. ...they may be right, if they're performing this task often on the same file, and the file is big enough.

Of course the problem here is one of logic: How do you find the Nth line of a file, where lines can be of unknown length, without first skimming through the file to find each line ending? The answer is, you can't. I've previously suggested this is not a problem specific to Perl. It's also not a problem specific to computers either. My previous object lesson has been to point out that it's impossible to find the Nth occurrence of some phrase or word in a book without opening the book and counting your way through it. Files on computers aren't all that different in this regard.

One solution, if a person happens to need to skim through a long file frequently, is to build an index file to keep track of each 'record separator' so that they can be found quickly later. The following snippet of code creates a module: File::Index, and puts it through the paces to demonstrate a proof of concept of developing an index file to keep track of lines in a file.

package File::Index; use strict; use warnings; use Carp; use Readonly; use Config; Readonly my $CONFIG_LONG_SIZE => $Config{longsize}; sub new { my( $class, $data_filename, $options ) = @_; my $self = {}; bless $self, $class; $options = defined( $options ) ? $options : {}; $self->{data_filename} = $data_filename; $self->{idx_filename} = exists( $options->{idx_file} ) ? $options->{idx_file} : $data_filename . '.idx'; $self->{record_sep} = exists( $options->{record_sep} ) ? $options->{record_sep} : "\n"; $self->{data_binmode} = exists( $options->{data_binmode} ) ? $options->{data_binmode} : 0; if( not -f $self->{idx_filename} ) { $self->create_index_file(); } return $self; } sub create_index_file { my( $self ) = @_; open my $datafile_handle, '<', $self->{data_filename} or croak $!; if( $self->{data_binmode} ) { binmode $datafile_handle; } open my $index_handle , '>', $self->{idx_filename} or croak $!; local $/ = $self->{record_sep}; binmode( $index_handle ); print $index_handle pack( 'L', 0 ) or croak $!; while( <$datafile_handle> ) { my $entry = pack 'L', tell(); print $index_handle $entry or croak $!; } close $datafile_handle; close $index_handle or croak $!; } sub get_index { my( $self, $item_no ) = @_; open my $index_handle, '<', $self->{idx_filename} or croak $!; binmode( $index_handle ); local $\ = \$CONFIG_LONG_SIZE; seek( $index_handle, $item_no * $CONFIG_LONG_SIZE, 0 ) or croak $!; defined( my $packed_index = <$index_handle>) or croak "No record for item number $item_no\n"; my $unpacked_index = unpack( 'L', $packed_index ); return $unpacked_index; } 1; package main; use strict; use warnings; use Readonly; # Uncomment the next line if File::Index is not in the # same script file as the main program. # use File::Index; # Initialize some filenames. Readonly my $BIG_FILENAME => 'bigfile.tmp' ; Readonly my $INDEX_FILENAME => $BIG_FILENAME . '.idx'; # If we already have a "big file" don't bother creating a new one. unless( -f $BIG_FILENAME ) { make_big_file( $BIG_FILENAME, 10000 ); } # Initialize an index object based on newlines for the record # separator. my $idx_newline = File::Index->new( $BIG_FILENAME, { index_filename => $INDEX_FILENAME, record_sep => "\n" } ); print "Choose a record number (blank entry exits).\n"; while( <> ) { chomp; last unless m/^\d+$/; print "\tIndex for record $_: ", my $index = $idx_newline->get_index( $_ ), "\n"; open my $infile, '<', $BIG_FILENAME or die $!; seek( $infile, $index, 0 ) or die $!; defined( my $line = <$infile> ) or die "Line $_ doesn't exist!\n"; print "\t$line\n"; } # ----- Helper subs for demonstration purposes. ----- # Creates a file of N lines length, where each line is of random # width (from 0 through 78), and filled with random # printable characters. sub make_big_file { my( $filename, $lines ) = @_; open my $bigfile, '>', $filename or die $!; foreach( 0 .. $lines ) { my $line = generate_line( 0, 78 ); print $bigfile $line; } close $bigfile or die $!; } # Generates a line of random printable characters. sub generate_line { my( $min_length, $max_length ) = @_; $max_length++; # Make the range "inclusive". my $line_length = int( rand( $max_length - $min_length ) ) + $min_length; my $line = ''; # Zero length is possible. $line .= generate_char() foreach 1 .. $line_length; $line .= "\n"; return $line; } # Generate a single random printable character. sub generate_char { # Range is limited to printable, from ASCII space # through ASCII 'z'. return chr( int( rand( 90 ) ) + 32 ); }

This incomplete yet functional module makes a few assumptions: First, you are going to have to decide for yourself whether the index file and the data file are actually still in synch with each other. Second, you're dealing with record separators, not text matching. Third, the we're making the assumption that the file being indexed doesn't change 'frequently', that it's long enough to take awhile to scan through, and that we're scanning through it often enough to justify indexing it. And also, no attempt at file locking is made, so don't use this for anything serious. ...and a bunch of other assumptions that one might expect of a 'rough draft' of a module.

Here's how the module works:

  1. Create an index object with File::Index->new(). The arguments should be self explanatory. The hashref args are optional, and sane defaults exist.
  2. If an index for the datafile already exists, a new one is not created. If it doesn't exist, create_index() is called, and an index file is generated.
  3. Anytime the user suspects that the index file is out of synch, it's ok to invoke create_index() again, to recreate the file.
  4. get_index() finds a seekable location for the Nth record.

The index file consists of packed unsigned longs. It's not portable across platforms, so if a file is moved, it should be re-indexed. Each long integer represents a tell/seek location of a single record.

Try it out and let me know what you think of the concept, and early stages of its implementation. "It sucks." is OK too, as long as you explain how you would make it 'not suck so much'. :)

This would probably be a lot more useful if it were implemented as a subclass of Tie::File. Tie::File::Indexed would create an index file so that subsequent ties wouldn't require re-indexing, assuming the file doesn't get modified without our knowledge. But I don't know Tie::File well enough to subclass it, so we get File::Index instead. In a more advanced draft, it would be worthwhile to implement regexp matching instead of only simple 'record separator' searching.

Other thoughts: They say flat files can only carry you so far. This module might be treading where non-flat-file solutions already do a better job. But the fact that the question comes up so often has to indicate there's at least a small need.


Dave

Replies are listed 'Best First'.
Re: Proof of concept: File::Index
by spurperl (Priest) on May 12, 2006 at 08:28 UTC
    This is indeed a common task, and it's nice that you're trying to build a generalized solution.

    I had to solve this problem before, and I used the kind help of the Monks to compose a solution. You might find the following threads interesting:

    This one and this one

Re: Proof of concept: File::Index (detect updates)
by tye (Sage) on May 12, 2006 at 16:45 UTC

    I think this is great. I'd add the following features...

    When you create the index, allow the user to specify "only appends expected". Record in the index the inode of the indexed file. When the file is re-tied, if the file has been modified more recently than the index file has been modified (alternately, store a timestamp in the index file rather than relying on the file system's timestamp for it), then the index might need to be recreated...

    When recreating the index, if "only appends expected" is true and the inode of the indexed file hasn't changed, then seek to the character before the Nth-to-last line and verify that it is "\n". Repeat for the Nth last lines. If all of that passes, then just scan the end of the file starting at the last indexed line and add indices for any additional lines (and update the index's timestamp).

    When grabbing the Nth line, if N > 0, actually read starting from one character before the Nth line and then verify that the first character is "\n". Verify that the last character is either "\n" or is the last character of the file. If either of these tests fail, reindex the file from the beginning. Strip the leading "\n" before returning the line, of course.

    You could also index the file lazily. That is, if they ask for the Nth line, then only index up to the Nth line. Next week, they could tie the file, the module would note that nothing had changed, then they ask for the 2*Nth line and the module would then index the N+1st line through the 2*Nth line.

    Yes, the checks for updates aren't bullet-proof and users need to be able to force a re-index manually, of course, but I think the logic I outlined could be very valuable.

    Thanks,

    - tye        

Re: Proof of concept: File::Index
by Zaxo (Archbishop) on May 12, 2006 at 19:22 UTC

    A Makefile suffix rule to call a similar program (which gets $BIGFILENAME from the command line) whenever a foo.txt's mtime is newer than foo.txt.idx's:

    .txt.idx .txt: [tab]/path/to/davido_indexer $<
    make is for more than installations.

    There is a neat non-perl solution to the related problem you mention of finding word locations. Gnu idutils constructs a big dbm file of distinct words and their locations for a whole directory tree. File selection and language rules are customizable. Renamed from id-utils, maintainance was resumed last year. Recommended.

    After Compline,
    Zaxo

Re: Proof of concept: File::Index
by thor (Priest) on May 12, 2006 at 12:34 UTC
    What does this give us that Tie::File doesn't? I didn't read all of your node, but I didn't see that point addressed.

    thor

    The only easy day was yesterday

      Tie::File builds the index every time you tie the file, and doesn't store the index. That means every time a script starts up, Tie::File has to skim through the entire 'big file' to find all line endings or record separators. Building this index consumes O(n) time, every time the index is built. The entire point of File::Index is to avoid rebuilding the index file unless you specifically tell it to do so.

      File::Index is helpful for 'big files' that don't change frequently. File::Index builds an index and stores it for future use. That way, the next time the script is run, the index alread exists, and item lookups will occur in O(1) time. Building the index still consumes O(n) time, but the index is only built once, or at worst, when you tell it to be rebuilt.

      I also used packed longs for the index so that the index file is as compact as is practical. And finding the Nth entry in the index file doesn't require reading through the entire file, it just involves seeking to LONG_SIZE * entry-number into the index file... an O(1) operation.


      Dave

Re: Proof of concept: File::Index
by spiritway (Vicar) on May 12, 2006 at 18:00 UTC

    This looks like a great solution to a fairly common and annoying problem. I do have a question, though. If the user thinks the file and index are out of sync, (s)he can recreate the index. No problem there. But what happens if some other process diddles with the file? Or rather, what would protect the file, once it's created?

    I could see some other program writing to the file, unaware of its special status, and trashing the relationship with the index. If then a user relies on the index, bad things could happen. Perhaps a sort of sanity check (modification times match)?

      Yes, future enhancements will perform basic sanity checks. It's a slipery slope though; I could perform a checksum check, but then that defeats the efficiency purpose of storing the index in the first place. The best suggestions have been to keep track of modification times and file sizes. While not idiot-proof, such enhancements will catch a lot of out of sync problems.

      Ultimately it will be up to the module's users to decide if the file being indexed is reliable enough to be worth indexing, and how often it will need to be re-indexed. There are some situations where it just won't be practical; such as in environments where many sources are contributing to the data file at unknown intervals.


      Dave

        I feel your pain - when is it ever enough, vs. when does it cost too much to check these things?

        Even so this looks to be a great solution to a whole lot of semi-minor major problems, where you don't want to just slurp the file, but don't much care to walk through it ever time, either. I think it's great;-).

        Update: Corrected "minor" mistake.

Re: Proof of concept: File::Index (consistancy)
by BrowserUk (Patriarch) on May 12, 2006 at 20:03 UTC

    Also, when the data file has been modified, and has increased in size, check the consistancy of the last dozen indexes. If they are consistent, only index the new records appended rather than reindexing the whole file.

    Actually, checking say log(n) (where n=no of lines) or so randomly selected lines spread across the file for consistancy is probably (probabilistically) sufficient to detect changes, given that any 1-byte change in the length of any line will invalidate the offsets of all subsequent records until an exactly opposite change of length re-syncs them.

    Ignoring the possibility of deliberate tampering, any of the maths wizards feel like calculating the odds of a randomly re-written file matching more than say log(n) randomly chosen index offsets from the previous contents consistantly?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Proof of concept: File::Index
by xdg (Monsignor) on May 12, 2006 at 22:12 UTC
    This would probably be a lot more useful if it were implemented as a subclass of Tie::File.

    Actually, this seems like a job for inside-out objects. I just uploaded to CPAN a new, quick-and-dirty version of File::Marker (0.11) that adds the ability to save/load markers from an external file. It works similarly to what you've described, though you have to do the line-numbering work yourself. On the flip side, it handles the seeking for you and works like an IO::File object (which it subclasses).

    Creating the index:

    use File::Marker; my $file = File::Marker->new( $filename ); # make the (zero-based) index my $i = 0; while ( ! $file->eof ) { $file->set_marker($i++); <$file>; } # save and rewind $file->save_markers( $index_filename ); $file->goto_marker('LAST'); # rewind

    Re-use the index

    use File::Marker; my $file = File::Marker->new( $filename ); $file->load_markers( $index_filename ); $file->goto_marker( 22 ); # start of 23rd line

    You could probably extend it to focus on line-numbers instead of marking any type of position in a file or add the various special features people have mentioned. Alternatively -- perhaps even better -- would be to reimplement it using Class::InsideOut, since File::Marker is only really intended to be a teaching module. If I get around to figuring out how to sanely and safely serialize IO::File based inside-out objects with Storable, this could be even easier.

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Proof of concept: File::Index (touch the index file)
by BrowserUk (Patriarch) on May 12, 2006 at 19:51 UTC

    Instead of adding extra meta data to the index file, and mess up the direct relationship between lineno and offset; just 'touch' the index file's stats to mirror those of the datafile.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Proof of concept: File::Index
by ioannis (Abbot) on May 13, 2006 at 04:16 UTC
    Instead of creating indexes, we could use the indexed array already available in BerkeleyDB using the recno structure; then, we walk along the cursor from any desired line_num. The database, of course, is always update-table.

    This command constructs such Recno database from our given text file: cat foo.txt | db4.3_load -T -t recno bar.db

    Although simple, the command above needs further refinement if we are unsatisfied with the default definition of newline, with the escaping of inserted text, or problems due to binary input. (See db4.3_load(1) )

Re: Proof of concept: File::Index
by mattr (Curate) on May 18, 2006 at 04:31 UTC
    Nice, looks interesting and eminently useful. Of course the "any time the user suspects" part is a bit weak.

    it's impossible to find the Nth occurrence of some phrase or word in a book without opening the book and counting your way through it.

    I would like to note that after practice I became able to consistently open a book to the correct page in the case of a thick Japanese character dictionary (Nelson's) back in school. I think this may be like a lookup table that matches thickness of pages before one's thumb to a list of >100 chapters. At least it always worked for the most important chapter.

    So if you know where a change had been made in a file, you could in fact jump to a prestudied location before that point, which you know has X occurrences of a pattern before it, and then count N-X occurrences starting from there instead. Metadata describing the various prestudied points (or results of prerun pattern matches) could be saved in a memo at the head of the index file.

    You could also save a series of checksums per chapter (if not per line) and this could help determine where a change was made, though maybe Diff could do something similar. This would let you enjoy the benefits of a flat file, i.e. do regex pattern matching or tie the file to some module's object model like Config, while also enjoying some of the structure given by a record-based object store.

    Personally I would probably rather have an index that operated based on keywords or patterns than using a recno. If the text file has a list of paragraphs, I could save a few words describing each paragraph in the index and then later jump to the Nth article matching a given keyword or above a certain score. Or perhaps I have a list of events in a calendar, and each would have an event type or event owner associated with it. In this case maybe I would like to have multiple lines per record, in other words the delimiter would not be "\n". Maybe I'd like a (not necessarily unique) date-based key, or a certain format serial number. These are just ideas.

    I am trying to think of when I would want to use your new module, and I keep thinking of extracting descriptive words from text as in NLP (natural language processing) and saving them with each paragraph or sentence. Regardless of whether this is a single flat file or not, it would be useful, and a tool to navigate the precompiled index with pointers into the data would seem useful. Perhaps a callback or plugins for index creation would be useful.

    At the moment I am thinking of indexing books, which make nice flat files. I wrote a little program that lets me read books from my server on my cell phone when on the train (turns out that's not cheap but..) anyway I read 10KB per page (max that fits in RAM and enough to reach the next station). It would be nice if I had an index built so as to allow me to make one page end at the end of a sentence, within the 10K limit. It is so much of a pain that currently I even split words across pages. A recno could be used as a bookmark, if the recno is created based on a page length and a "try not to break sentences across pages" heuristic. So to make a long story short, it would be interesting if your module would support creation of indices based on pages of a length decided somewhat intelligently. Would that be possible with your module? Keep up the good work!