Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: search a large text file

by Your Mother (Archbishop)
on Feb 08, 2011 at 17:13 UTC ( [id://887012]=note: print w/replies, xml ) Need Help??


in reply to search a large text file

I'm boostering for KinoSearch because I think it's undervalued or perhaps just not known well enough.

use warnings; use strict; use KinoSearch::Plan::Schema; use Time::HiRes qw( tv_interval gettimeofday ); my $schema = KinoSearch::Plan::Schema->new; my $string_type = KinoSearch::Plan::StringType->new; my $store_only = KinoSearch::Plan::StringType->new( indexed => 0 ); $schema->spec_field( name => "term", type => $string_type ); $schema->spec_field( name => "number", type => $store_only ); my $index_dir = "./ks-index"; unless ( -d $index_dir ) { my $data_file = shift || die "Give me a data file!\n"; mkdir $index_dir or die $!; my $indexer = KinoSearch::Index::Indexer->new( index => $index_dir, schema => $schema, create => ! -d $index_dir, # Truncate each run or duplicate content. truncate => 1, ); open my $data, "<", $data_file or die "Couldn't open $data_file to read: $!"; while (<$data>) { my ( $term, $number ) = split /\s+/; $indexer->add_doc({ term => $term, number => $number, }); } $indexer->commit; } print "I'm going to search as long as you give me input...\n"; my $searcher = KinoSearch::Search::IndexSearcher ->new( index => $index_dir ); while (<STDIN>) { chomp; my $t0 = [gettimeofday]; my $hits = $searcher->hits( query => $_, offset => 0, num_wanted => 5, ); while ( my $hit = $hits->next ) { printf "%20s --> %d\n", $hit->{term}, $hit->{number}; } printf qq{Found %d matches looking for "%s"\n}, $hits->total_hits, + $_; printf "Search took %.3f seconds\n", tv_interval( $t0, [gettimeofd +ay] ); } exit 0;

Using this dataset (one million records)-

perl -Minteger -le 'printf"text%d\t%d\n",rand(100), rand(100) for 1 .. + 1_000_000'

That code gives these results (on a fairly modest *nix box)-

I'm going to search as long as you give me input... text13 text13 --> 31 text13 --> 22 text13 --> 69 text13 --> 81 text13 --> 96 Found 10044 matches looking for "text13" Search took 0.002 seconds text99 text99 --> 66 text99 --> 76 text99 --> 11 text99 --> 59 text99 --> 26 Found 9964 matches looking for "text99" Search took 0.002 seconds text100 Found 0 matches looking for "text100" Search took 0.000 seconds

It scales extremely well. It seems like it's not a conceptual or obvious match for your problem space but it has the goods and might be exactly what you need.

Replies are listed 'Best First'.
Re^2: search a large text file
by creamygoodness (Curate) on Feb 09, 2011 at 22:24 UTC

    I suspect that KinoSearch would work about as well as a database like SQLite or PostgreSQL for this. It's actually a decent conceptual match -- inverted indexers like KinoSearch, Lucene, Xapian, etc. are optimized for many reads and fewer inserts, as opposed to the typical B-tree indexes on databases which handle inserts a little better. The only thing that's odd is that the original poster doesn't seem to need the relevance-based ranking that inverted indexes do well.

    Regardless, the problem is straightforward and there are lots of good options for solving it.

      PostgreSQL does indeed have btree indexes, but also inverted indexes (GIN), and the excellent GIST index type. (it seems to me the btree type does well enough in this case; if you see my example below, where searching in a 223-million+ rows table takes a tenth of a millisecond).

      PostgreSQL index-type docs here.

      I'm just reacting to the juxtaposition of sqlite and postgres; really: SQLite, handy as it often is, can not be compared with a powerful database system like postgresql.

      (And I should really try & compare Your Mother's example with KinoSearch, and see if he is right; maybe in the weekend... )

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://887012]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-04-23 07:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found