Re: search a large text file
by chrestomanci (Priest) on Feb 08, 2011 at 11:40 UTC
|
Considering the size of the file, nothing will be fast, however, unless you need to do lots of searches, there is no need to sort the file. You are only interested in the lines that match you search term.
If you don't have many searches to do, then the approach I would take would be to filter the file using unix tools to get the lines of interest. eg:
grep 'text2' <filename>
This will give you the lines containing the data you want in linear time. From there you easily turn it into a perl array.
If you need to do lots of searches, then you will need to consturct an index of some sort, or ingest the whole thing into a database, and let the database engine worry about the index.
| [reply] [d/l] |
Re: search a large text file
by davido (Cardinal) on Feb 08, 2011 at 11:45 UTC
|
No need to sort, given your description of the problem.
my @found;
while( <file> ) {
next unless m/text2/; # Reject disqualified lines quickly.
m/\s+(\d+)/ or die "Unexpected data format at file line $.\n";
push @found, $1;
}
| [reply] [d/l] |
|
since my file is big and my searches are often and numerous, searching with this method over a large file is very slow.
| [reply] |
Re: search a large text file
by erix (Prior) on Feb 08, 2011 at 17:27 UTC
|
I put together an example in case you want to use PostgreSQL:
The file I used is available here:
ftp://ftp.ncbi.nih.gov/genbank/livelists
It's similar to yours; but it has three columns.
I unzipped it, and put it into postgres, in a table t; there are more than 223-million rows.
$ ls -lh GbAccList.0206.2011
-rw-rw-r-- 1 aardvark aardvark 4.6G Feb 8 17:21 GbAccList.0206.2011
$ head -n 3 GbAccList.0206.2011
AACY024124353,1,129566152
AACY024124495,1,129566175
AACY024124494,1,129566176
$ time < GbAccList.0206.2011 psql -qc "
create table t (c text, i1 integer, i2 integer);
copy t from stdin csv delimiter E',';"
real 3m47.448s
$ time echo "
create index t_i2_idx on t (i2);
analyze t;" | psql -q
real 5m50.291s
Searches are now around a tenth of a millisecond:
# 5 'random' searches like:
echo "explain analyze
select * from t
where i2 = $gi;" | psql
Just showing the timings of five searches:
Index Cond: (i2 = 2017697) Total runtime: 0.157 ms
Index Cond: (i2 = 6895719) Total runtime: 0.109 ms
Index Cond: (i2 = 3193323) Total runtime: 0.119 ms
Index Cond: (i2 = 8319666) Total runtime: 0.091 ms
Index Cond: (i2 = 1573171) Total runtime: 0.119 ms
Of course, performance depends on the hardware used.
(a similar problem/solution here: Re^3: sorting very large text files (slander))
| [reply] [d/l] [select] |
|
| [reply] |
Re: search a large text file
by BrowserUk (Patriarch) on Feb 08, 2011 at 12:48 UTC
|
I have a very large text file (~5g)
Is this a single 5GB file used over and over? Or a new 5GB file each time?
I need to search each time
How many searches do you need to do? How often? What is your target time.
I want to search for 'text2' and retrieve 2 and 3,
How long are the texts? Are they ascii or unicode?
but the sorting process takes ages
How long is "ages"?
The more clearly you explain your task, the more likely that someone will see a viable solution.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
The file is a single file and it is created once and does not change anymore. so I need to use it as a kind of dictionary! it will not be updated.
the searches are frequent. would be in the loops. so once I access my file I have source terms that I have to searh in the big file.
The texts ate max two word length in Unicode format!
I've run it for few days and still is running! I even split it to small portions (less than 200 mg) and after 24 hours they are still running.
| [reply] |
|
So in short you have a static 5G dataset, that you need to search frequently.
I think your best bet would be use a database to index the data, and let it worry about how to create an optimised index.
I would put the entire file contents into the database, and discard the original file. If each line also contains lots of other stuff that you will not be searching on, then I would still keep it in the database, but I would put it in a different collum without an index so as not to bloat the database to much.
| [reply] |
|
This really does sound like a perfect application for using a database. Especially of you are generating the file and can load it directly into the DB and cut out the middle man file.
That said, loading the Db via the tools bulk loader is often faster than loading it via DBI one record at a time.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
This is the ideal application for a hash tied to a file. You might like to take a look at DBM::Deep. This is a well-tested and well-liked implementation of a disk based hash.
Just use a script to generate your hash once (that will take a while), after that any search will be nearly as fast as a single disk access. Store multiple values either concatenated as a string or better use an array for that. Since DBM::Deep is multilevel, storing a HashofArrays is no further problem
| [reply] |
|
|
|
|
Sounds like a perfect match for dictd, which is very fast, and has a Perl client on CPAN: Net::Dict
| [reply] |
|
Re: search a large text file
by fisher (Priest) on Feb 08, 2011 at 11:37 UTC
|
1. What did you do exactly? Can we see your code?
2. Is your file static or does it change permanently on the fly? And if yes, does it change in Append mode or in Random-Access?
| [reply] |
Re: search a large text file
by Your Mother (Archbishop) on Feb 08, 2011 at 17:13 UTC
|
I'm boostering for KinoSearch because I think it's undervalued or perhaps just not known well enough.
That code gives these results (on a fairly modest *nix box)-
I'm going to search as long as you give me input...
text13
text13 --> 31
text13 --> 22
text13 --> 69
text13 --> 81
text13 --> 96
Found 10044 matches looking for "text13"
Search took 0.002 seconds
text99
text99 --> 66
text99 --> 76
text99 --> 11
text99 --> 59
text99 --> 26
Found 9964 matches looking for "text99"
Search took 0.002 seconds
text100
Found 0 matches looking for "text100"
Search took 0.000 seconds
It scales extremely well. It seems like it's not a conceptual or obvious match for your problem space but it has the goods and might be exactly what you need.
| [reply] [d/l] [select] |
|
I suspect that KinoSearch would work about as well as a database like SQLite or PostgreSQL for this. It's actually a decent conceptual match -- inverted indexers like KinoSearch, Lucene, Xapian, etc. are optimized for many reads and fewer inserts, as opposed to the typical B-tree indexes on databases which handle inserts a little better. The only thing that's odd is that the original poster doesn't seem to need the relevance-based ranking that inverted indexes do well.
Regardless, the problem is straightforward and there are lots of good options for solving it.
| [reply] |
|
PostgreSQL does indeed have btree indexes, but also inverted indexes (GIN), and the excellent GIST index type. (it seems to me the btree type does well enough in this case; if you see my example below, where searching in a 223-million+ rows table takes a tenth of a millisecond).
PostgreSQL index-type docs here.
I'm just reacting to the juxtaposition of sqlite and postgres; really: SQLite, handy as it often is, can not be compared with a powerful database system like postgresql.
(And I should really try & compare Your Mother's example with KinoSearch, and see if he is right; maybe in the weekend... )
| [reply] |
Re: search a large text file
by pajout (Curate) on Feb 08, 2011 at 13:43 UTC
|
You can preprocess it using, for instance, DB_File module. See DB_HASH chapter of
http://perldoc.perl.org/DB_File.html#A-Simple-Example
-> you can have huge hash which is stored in the file but it behaves relatively transparently in Perl code, like as normal hash.
Or you can use one table in some SQL database, having just two columns, 'key' and 'val', with index on 'key'.
| [reply] |
Re: search a large text file
by tilly (Archbishop) on Feb 08, 2011 at 15:36 UTC
|
| [reply] |
|
| [reply] |
|
| [reply] |
Re: search a large text file
by sundialsvc4 (Abbot) on Feb 08, 2011 at 15:08 UTC
|
“5 gigabytes” is not what one would consider a particularly “large” file anymore. Simply create a database ... SQLite would be an excellent choice since it requires no server and runs everywhere ... and load the information into it using any one of several existing SQLite tools.
I suspect that you may have persuaded yourself that “I must write a (Perl) program to do this,” when in fact you probably don’t have to write anything at all. Load your reference data in one table, and your search into another, and do an INNER JOIN. Mission accomplished.
“To a hammer, everything is a nail.”
| |
|
I dont need to just load that into memory! I need to extract the info from the file and use it! so I'm working in perl, I want to search for specific string and extract the numbers that are assigned to the string stored in my so called 'big file' ! Since Im in have my code in perl I just want to write a function to do it for me!
| [reply] |
Re: search a large text file
by JavaFan (Canon) on Feb 08, 2011 at 13:53 UTC
|
CREATE TABLE foo (
column1 varchar(...),
column2 integer,
KEY(column1)
);
SELECT column1, column2 FROM foo WHERE column1 IN ('text1', 'text2', '
+text3');
In between of course, you need to populate the table. Once. | [reply] [d/l] |