comment on

The topic seems to come up almost weekly: "How do I jump directly to the 300th line of a file without scanning through the file line by line?" I assume this question comes to us again and again because the people asking it perceive that it's a little inefficient to scan through a huge file, line by line, just to find the Nth line. ...they may be right, if they're performing this task often on the same file, and the file is big enough.

Of course the problem here is one of logic: How do you find the Nth line of a file, where lines can be of unknown length, without first skimming through the file to find each line ending? The answer is, you can't. I've previously suggested this is not a problem specific to Perl. It's also not a problem specific to computers either. My previous object lesson has been to point out that it's impossible to find the Nth occurrence of some phrase or word in a book without opening the book and counting your way through it. Files on computers aren't all that different in this regard.

One solution, if a person happens to need to skim through a long file frequently, is to build an index file to keep track of each 'record separator' so that they can be found quickly later. The following snippet of code creates a module: File::Index, and puts it through the paces to demonstrate a proof of concept of developing an index file to keep track of lines in a file.

package File::Index;

use strict;
use warnings;

use Carp;
use Readonly;
use Config;

Readonly my $CONFIG_LONG_SIZE => $Config{longsize};

sub new {
    my( $class, $data_filename, $options ) = @_;
    
    my $self = {};
    bless $self, $class;
    
    $options = defined( $options ) ? $options : {};

    $self->{data_filename} = $data_filename;

    $self->{idx_filename} = 
        exists( $options->{idx_file} ) 
        ? $options->{idx_file} 
        : $data_filename . '.idx';
                        
    $self->{record_sep} =       
        exists( $options->{record_sep} ) 
        ? $options->{record_sep} 
        : "\n";
    $self->{data_binmode} = 
        exists( $options->{data_binmode}    )
        ? $options->{data_binmode}
        : 0;

    if( not -f $self->{idx_filename} ) {
        $self->create_index_file();
    }

    return $self;
}


sub create_index_file {
    my( $self ) = @_;
    
    open my $datafile_handle, '<', $self->{data_filename} or croak $!;

    if( $self->{data_binmode} ) {
        binmode $datafile_handle;
    }
    
    open my $index_handle   , '>', $self->{idx_filename}  or croak $!;
    
    local $/ = $self->{record_sep};
    binmode( $index_handle );
    
    print $index_handle pack( 'L', 0 ) or croak $!;
    
    while( <$datafile_handle> ) {
        my $entry = pack 'L', tell();
        print $index_handle $entry or croak $!;
    }
    
    close $datafile_handle;
    close $index_handle or croak $!;
}


sub get_index {
    my( $self, $item_no ) = @_;
    
    open my $index_handle, '<', $self->{idx_filename} or croak $!;

    binmode( $index_handle );
    local $\ = \$CONFIG_LONG_SIZE;

    seek( $index_handle, $item_no * $CONFIG_LONG_SIZE, 0 ) 
        or croak $!;
    
    defined( my $packed_index = <$index_handle>) 
        or croak "No record for item number $item_no\n";
    
    my $unpacked_index = unpack( 'L', $packed_index );
    
    return $unpacked_index;
}

1;







package main;

use strict;
use warnings;

use Readonly;

# Uncomment the next line if File::Index is not in the 
# same script file as the main program.
# use File::Index; 

# Initialize some filenames.
Readonly my $BIG_FILENAME   => 'bigfile.tmp'         ;
Readonly my $INDEX_FILENAME => $BIG_FILENAME . '.idx';

# If we already have a "big file" don't bother creating a new one.
unless( -f $BIG_FILENAME ) {
    make_big_file( $BIG_FILENAME, 10000 );
}

# Initialize an index object based on newlines for the record
# separator.
my $idx_newline = File::Index->new( 
    $BIG_FILENAME, 
    { 
        index_filename => $INDEX_FILENAME, 
        record_sep     => "\n" 
    }
);

print "Choose a record number (blank entry exits).\n";

while( <> ) {

    chomp;
    last unless m/^\d+$/;

    print "\tIndex for record $_: ", 
          my $index = $idx_newline->get_index( $_ ), 
          "\n"; 

    open my $infile, '<', $BIG_FILENAME or die $!;
    seek( $infile, $index, 0 ) or die $!;

    defined( my $line = <$infile> ) or die "Line $_ doesn't exist!\n";

    print "\t$line\n";
}






# ----- Helper subs for demonstration purposes. -----

# Creates a file of N lines length, where each line is of random
# width (from 0 through 78), and filled with random
# printable characters.
sub make_big_file {
    my( $filename, $lines ) = @_;

    open my $bigfile, '>', $filename or die $!;

    foreach( 0 .. $lines ) {
        my $line = generate_line( 0, 78 );
        print $bigfile $line;
    }

    close $bigfile or die $!;
}

# Generates a line of random printable characters.
sub generate_line {
    my( $min_length, $max_length ) = @_;
    
    $max_length++;      # Make the range "inclusive".
    
    my $line_length = 
        int( rand( $max_length - $min_length ) ) + $min_length;
    
    my $line = '';

    # Zero length is possible.
    $line .= generate_char() foreach 1 .. $line_length;  

    $line .= "\n";

    return $line;
}


# Generate a single random printable character.
sub generate_char {
    # Range is limited to printable, from ASCII space 
    # through ASCII 'z'.
    return chr( int( rand( 90 ) ) + 32 );
}
[download]

This incomplete yet functional module makes a few assumptions: First, you are going to have to decide for yourself whether the index file and the data file are actually still in synch with each other. Second, you're dealing with record separators, not text matching. Third, the we're making the assumption that the file being indexed doesn't change 'frequently', that it's long enough to take awhile to scan through, and that we're scanning through it often enough to justify indexing it. And also, no attempt at file locking is made, so don't use this for anything serious. ...and a bunch of other assumptions that one might expect of a 'rough draft' of a module.

Here's how the module works:

Create an index object with File::Index->new(). The arguments should be self explanatory. The hashref args are optional, and sane defaults exist.
If an index for the datafile already exists, a new one is not created. If it doesn't exist, create_index() is called, and an index file is generated.
Anytime the user suspects that the index file is out of synch, it's ok to invoke create_index() again, to recreate the file.
get_index() finds a seekable location for the Nth record.

The index file consists of packed unsigned longs. It's not portable across platforms, so if a file is moved, it should be re-indexed. Each long integer represents a tell/seek location of a single record.

Try it out and let me know what you think of the concept, and early stages of its implementation. "It sucks." is OK too, as long as you explain how you would make it 'not suck so much'. :)

This would probably be a lot more useful if it were implemented as a subclass of Tie::File. Tie::File::Indexed would create an index file so that subsequent ties wouldn't require re-indexing, assuming the file doesn't get modified without our knowledge. But I don't know Tie::File well enough to subclass it, so we get File::Index instead. In a more advanced draft, it would be worthwhile to implement regexp matching instead of only simple 'record separator' searching.

Other thoughts: They say flat files can only carry you so far. This module might be treading where non-flat-file solutions already do a better job. But the fact that the question comes up so often has to indicate there's at least a small need.

Dave

In reply to Proof of concept: File::Index by davido

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks