search a large text file

perl_lover_always has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: search a large text file by chrestomanci (Priest) on Feb 08, 2011 at 11:40 UTC
Considering the size of the file, nothing will be fast, however, unless you need to do lots of searches, there is no need to sort the file. You are only interested in the lines that match you search term. If you don't have many searches to do, then the approach I would take would be to filter the file using unix tools to get the lines of interest. eg: `grep 'text2' <filename>` This will give you the lines containing the data you want in linear time. From there you easily turn it into a perl array. If you need to do lots of searches, then you will need to consturct an index of some sort, or ingest the whole thing into a database, and let the database engine worry about the index.	[reply] [d/l]
Re: search a large text file by davido (Cardinal) on Feb 08, 2011 at 11:45 UTC
No need to sort, given your description of the problem. `my @found; while( <file> ) { next unless m/text2/; # Reject disqualified lines quickly. m/\s+(\d+)/ or die "Unexpected data format at file line $.\n"; push @found, $1; }` [download] Dave	[reply] [d/l]
Re^2: search a large text file by perl_lover_always (Acolyte) on Feb 08, 2011 at 13:03 UTC
since my file is big and my searches are often and numerous, searching with this method over a large file is very slow.	[reply]
Re: search a large text file by erix (Prior) on Feb 08, 2011 at 17:27 UTC
I put together an example in case you want to use PostgreSQL: The file I used is available here: ftp://ftp.ncbi.nih.gov/genbank/livelists It's similar to yours; but it has three columns. I unzipped it, and put it into postgres, in a table t; there are more than 223-million rows. `$ ls -lh GbAccList.0206.2011 -rw-rw-r-- 1 aardvark aardvark 4.6G Feb 8 17:21 GbAccList.0206.2011 $ head -n 3 GbAccList.0206.2011 AACY024124353,1,129566152 AACY024124495,1,129566175 AACY024124494,1,129566176 $ time < GbAccList.0206.2011 psql -qc " create table t (c text, i1 integer, i2 integer); copy t from stdin csv delimiter E',';" real 3m47.448s $ time echo " create index t_i2_idx on t (i2); analyze t;" \| psql -q real 5m50.291s` [download] Searches are now around a tenth of a millisecond: `# 5 'random' searches like: echo "explain analyze select * from t where i2 = $gi;" \| psql` [download] Just showing the timings of five searches: `Index Cond: (i2 = 2017697) Total runtime: 0.157 ms Index Cond: (i2 = 6895719) Total runtime: 0.109 ms Index Cond: (i2 = 3193323) Total runtime: 0.119 ms Index Cond: (i2 = 8319666) Total runtime: 0.091 ms Index Cond: (i2 = 1573171) Total runtime: 0.119 ms` [download] Of course, performance depends on the hardware used. (a similar problem/solution here: Re^3: sorting very large text files (slander))	[reply] [d/l] [select]
Re^2: search a large text file by BrowserUk (Patriarch) on Feb 08, 2011 at 17:35 UTC
Nice one again++ :)	[reply]
Re: search a large text file by BrowserUk (Patriarch) on Feb 08, 2011 at 12:48 UTC
I have a very large text file (~5g) Is this a single 5GB file used over and over? Or a new 5GB file each time? I need to search each time How many searches do you need to do? How often? What is your target time. I want to search for 'text2' and retrieve 2 and 3, How long are the texts? Are they ascii or unicode? but the sorting process takes ages How long is "ages"? The more clearly you explain your task, the more likely that someone will see a viable solution. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: search a large text file by perl_lover_always (Acolyte) on Feb 08, 2011 at 13:01 UTC
The file is a single file and it is created once and does not change anymore. so I need to use it as a kind of dictionary! it will not be updated. the searches are frequent. would be in the loops. so once I access my file I have source terms that I have to searh in the big file. The texts ate max two word length in Unicode format! I've run it for few days and still is running! I even split it to small portions (less than 200 mg) and after 24 hours they are still running.	[reply]
Re^3: search a large text file by chrestomanci (Priest) on Feb 08, 2011 at 13:31 UTC
So in short you have a static 5G dataset, that you need to search frequently. I think your best bet would be use a database to index the data, and let it worry about how to create an optimised index. I would put the entire file contents into the database, and discard the original file. If each line also contains lots of other stuff that you will not be searching on, then I would still keep it in the database, but I would put it in a different collum without an index so as not to bloat the database to much.	[reply]
Re^3: search a large text file by BrowserUk (Patriarch) on Feb 08, 2011 at 13:33 UTC
This really does sound like a perfect application for using a database. Especially of you are generating the file and can load it directly into the DB and cut out the middle man file. That said, loading the Db via the tools bulk loader is often faster than loading it via DBI one record at a time. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: search a large text file by jethro (Monsignor) on Feb 08, 2011 at 13:34 UTC
This is the ideal application for a hash tied to a file. You might like to take a look at DBM::Deep. This is a well-tested and well-liked implementation of a disk based hash. Just use a script to generate your hash once (that will take a while), after that any search will be nearly as fast as a single disk access. Store multiple values either concatenated as a string or better use an array for that. Since DBM::Deep is multilevel, storing a HashofArrays is no further problem	[reply]
Re^4: search a large text file by perl_lover_always (Acolyte) on Feb 08, 2011 at 14:41 UTC
Re^5: search a large text file by jethro (Monsignor) on Feb 08, 2011 at 17:34 UTC
Some notes below your chosen depth have not been shown here
Re^3: search a large text file by moritz (Cardinal) on Feb 08, 2011 at 13:58 UTC
Sounds like a perfect match for dictd, which is very fast, and has a Perl client on CPAN: Net::Dict Perl 6 - second systems done right	[reply]
Re^4: search a large text file by perl_lover_always (Acolyte) on Feb 08, 2011 at 14:44 UTC
Re: search a large text file by fisher (Priest) on Feb 08, 2011 at 11:37 UTC
1. What did you do exactly? Can we see your code? 2. Is your file static or does it change permanently on the fly? And if yes, does it change in Append mode or in Random-Access?	[reply]
Re: search a large text file by pajout (Curate) on Feb 08, 2011 at 13:43 UTC
You can preprocess it using, for instance, DB_File module. See DB_HASH chapter of http://perldoc.perl.org/DB_File.html#A-Simple-Example -> you can have huge hash which is stored in the file but it behaves relatively transparently in Perl code, like as normal hash. Or you can use one table in some SQL database, having just two columns, 'key' and 'val', with index on 'key'.	[reply]
Re: search a large text file by Your Mother (Archbishop) on Feb 08, 2011 at 17:13 UTC
I'm boostering for KinoSearch because I think it's undervalued or perhaps just not known well enough. Read more... (3 kB) That code gives these results (on a fairly modest *nix box)- `I'm going to search as long as you give me input... text13 text13 --> 31 text13 --> 22 text13 --> 69 text13 --> 81 text13 --> 96 Found 10044 matches looking for "text13" Search took 0.002 seconds text99 text99 --> 66 text99 --> 76 text99 --> 11 text99 --> 59 text99 --> 26 Found 9964 matches looking for "text99" Search took 0.002 seconds text100 Found 0 matches looking for "text100" Search took 0.000 seconds` [download] It scales extremely well. It seems like it's not a conceptual or obvious match for your problem space but it has the goods and might be exactly what you need.	[reply] [d/l] [select]
Re^2: search a large text file by creamygoodness (Curate) on Feb 09, 2011 at 22:24 UTC
I suspect that KinoSearch would work about as well as a database like SQLite or PostgreSQL for this. It's actually a decent conceptual match -- inverted indexers like KinoSearch, Lucene, Xapian, etc. are optimized for many reads and fewer inserts, as opposed to the typical B-tree indexes on databases which handle inserts a little better. The only thing that's odd is that the original poster doesn't seem to need the relevance-based ranking that inverted indexes do well. Regardless, the problem is straightforward and there are lots of good options for solving it.	[reply]
Re^3: search a large text file by erix (Prior) on Feb 10, 2011 at 13:46 UTC
PostgreSQL does indeed have btree indexes, but also inverted indexes (GIN), and the excellent GIST index type. (it seems to me the btree type does well enough in this case; if you see my example below, where searching in a 223-million+ rows table takes a tenth of a millisecond). PostgreSQL index-type docs here. I'm just reacting to the juxtaposition of sqlite and postgres; really: SQLite, handy as it often is, can not be compared with a powerful database system like postgresql. (And I should really try & compare Your Mother's example with KinoSearch, and see if he is right; maybe in the weekend... )	[reply]
Re: search a large text file by tilly (Archbishop) on Feb 08, 2011 at 15:36 UTC
For previous discussion of this problem see statistics of a large text and Reaped: a large text file into hash. As I pointed out to you in the previous discussions, this is likely to be slow. The next step that I suggested is to parallelize work with Hadoop. Have you tried that yet?	[reply]
Re^2: search a large text file by BrowserUk (Patriarch) on Feb 08, 2011 at 17:31 UTC
Doesn't hadoop require a cluster of servers and extensive software setup?	[reply]
Re^3: search a large text file by tilly (Archbishop) on Feb 08, 2011 at 17:52 UTC
Yes. But EC2 is reasonably priced, and the setup is not that bad. See http://wiki.apache.org/hadoop/AmazonEC2 for a guide.	[reply]
Re: search a large text file by JavaFan (Canon) on Feb 08, 2011 at 13:53 UTC
`CREATE TABLE foo ( column1 varchar(...), column2 integer, KEY(column1) ); SELECT column1, column2 FROM foo WHERE column1 IN ('text1', 'text2', ' +text3');` [download] In between of course, you need to populate the table. Once.	[reply] [d/l]
Re: search a large text file by sundialsvc4 (Abbot) on Feb 08, 2011 at 15:08 UTC
“5 gigabytes” is not what one would consider a particularly “large” file anymore. Simply create a database ... SQLite would be an excellent choice since it requires no server and runs everywhere ... and load the information into it using any one of several existing SQLite tools. I suspect that you may have persuaded yourself that “I must write a (Perl) program to do this,” when in fact you probably don’t have to write anything at all. Load your reference data in one table, and your search into another, and do an `INNER JOIN`. Mission accomplished. “To a hammer, everything is a nail.”
Re^2: search a large text file by perl_lover_always (Acolyte) on Feb 08, 2011 at 15:17 UTC
I dont need to just load that into memory! I need to extract the info from the file and use it! so I'm working in perl, I want to search for specific string and extract the numbers that are assigned to the string stored in my so called 'big file' ! Since Im in have my code in perl I just want to write a function to do it for me!	[reply]


Keep It Simple, Stupid
	PerlMonks