http://qs321.pair.com?node_id=886929

perl_lover_always has asked for the wisdom of the Perl Monks concerning the following question:

dear Monks I have a very large text file (~5g) and I need to search each time for a string in the first column and retrieve the second column of the line. Eg:
text1 9 text2 2 text3 1 text2 3
I want to search for 'text2' and retrieve 2 and 3, possibly in the form of an array. I tried to sort my file and use file::seek or search::dict modules but the sorting process takes ages and I'm running out of time! any suggestion would be appreciated to handle this problem.

Replies are listed 'Best First'.
Re: search a large text file
by chrestomanci (Priest) on Feb 08, 2011 at 11:40 UTC

    Considering the size of the file, nothing will be fast, however, unless you need to do lots of searches, there is no need to sort the file. You are only interested in the lines that match you search term.

    If you don't have many searches to do, then the approach I would take would be to filter the file using unix tools to get the lines of interest. eg:

    grep 'text2' <filename>

    This will give you the lines containing the data you want in linear time. From there you easily turn it into a perl array.

    If you need to do lots of searches, then you will need to consturct an index of some sort, or ingest the whole thing into a database, and let the database engine worry about the index.

Re: search a large text file
by davido (Cardinal) on Feb 08, 2011 at 11:45 UTC

    No need to sort, given your description of the problem.

    my @found; while( <file> ) { next unless m/text2/; # Reject disqualified lines quickly. m/\s+(\d+)/ or die "Unexpected data format at file line $.\n"; push @found, $1; }

    Dave

      since my file is big and my searches are often and numerous, searching with this method over a large file is very slow.
Re: search a large text file
by erix (Prior) on Feb 08, 2011 at 17:27 UTC

    I put together an example in case you want to use PostgreSQL:

    The file I used is available here:

    ftp://ftp.ncbi.nih.gov/genbank/livelists

    It's similar to yours; but it has three columns.

    I unzipped it, and put it into postgres, in a table t; there are more than 223-million rows.

    $ ls -lh GbAccList.0206.2011 -rw-rw-r-- 1 aardvark aardvark 4.6G Feb 8 17:21 GbAccList.0206.2011 $ head -n 3 GbAccList.0206.2011 AACY024124353,1,129566152 AACY024124495,1,129566175 AACY024124494,1,129566176 $ time < GbAccList.0206.2011 psql -qc " create table t (c text, i1 integer, i2 integer); copy t from stdin csv delimiter E',';" real 3m47.448s $ time echo " create index t_i2_idx on t (i2); analyze t;" | psql -q real 5m50.291s

    Searches are now around a tenth of a millisecond:

    # 5 'random' searches like: echo "explain analyze select * from t where i2 = $gi;" | psql

    Just showing the timings of five searches:

    Index Cond: (i2 = 2017697) Total runtime: 0.157 ms Index Cond: (i2 = 6895719) Total runtime: 0.109 ms Index Cond: (i2 = 3193323) Total runtime: 0.119 ms Index Cond: (i2 = 8319666) Total runtime: 0.091 ms Index Cond: (i2 = 1573171) Total runtime: 0.119 ms

    Of course, performance depends on the hardware used.

    (a similar problem/solution here: Re^3: sorting very large text files (slander))

      Nice one again++ :)

Re: search a large text file
by BrowserUk (Patriarch) on Feb 08, 2011 at 12:48 UTC
    I have a very large text file (~5g)

    Is this a single 5GB file used over and over? Or a new 5GB file each time?

    I need to search each time

    How many searches do you need to do? How often? What is your target time.

    I want to search for 'text2' and retrieve 2 and 3,

    How long are the texts? Are they ascii or unicode?

    but the sorting process takes ages

    How long is "ages"?

    The more clearly you explain your task, the more likely that someone will see a viable solution.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      The file is a single file and it is created once and does not change anymore. so I need to use it as a kind of dictionary! it will not be updated.
      the searches are frequent. would be in the loops. so once I access my file I have source terms that I have to searh in the big file.
      The texts ate max two word length in Unicode format!
      I've run it for few days and still is running! I even split it to small portions (less than 200 mg) and after 24 hours they are still running.

        So in short you have a static 5G dataset, that you need to search frequently.

        I think your best bet would be use a database to index the data, and let it worry about how to create an optimised index.

        I would put the entire file contents into the database, and discard the original file. If each line also contains lots of other stuff that you will not be searching on, then I would still keep it in the database, but I would put it in a different collum without an index so as not to bloat the database to much.

        This really does sound like a perfect application for using a database. Especially of you are generating the file and can load it directly into the DB and cut out the middle man file.

        That said, loading the Db via the tools bulk loader is often faster than loading it via DBI one record at a time.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        This is the ideal application for a hash tied to a file. You might like to take a look at DBM::Deep. This is a well-tested and well-liked implementation of a disk based hash.

        Just use a script to generate your hash once (that will take a while), after that any search will be nearly as fast as a single disk access. Store multiple values either concatenated as a string or better use an array for that. Since DBM::Deep is multilevel, storing a HashofArrays is no further problem

Re: search a large text file
by fisher (Priest) on Feb 08, 2011 at 11:37 UTC
    1. What did you do exactly? Can we see your code?
    2. Is your file static or does it change permanently on the fly? And if yes, does it change in Append mode or in Random-Access?
Re: search a large text file
by Your Mother (Archbishop) on Feb 08, 2011 at 17:13 UTC

    I'm boostering for KinoSearch because I think it's undervalued or perhaps just not known well enough.

    That code gives these results (on a fairly modest *nix box)-

    I'm going to search as long as you give me input... text13 text13 --> 31 text13 --> 22 text13 --> 69 text13 --> 81 text13 --> 96 Found 10044 matches looking for "text13" Search took 0.002 seconds text99 text99 --> 66 text99 --> 76 text99 --> 11 text99 --> 59 text99 --> 26 Found 9964 matches looking for "text99" Search took 0.002 seconds text100 Found 0 matches looking for "text100" Search took 0.000 seconds

    It scales extremely well. It seems like it's not a conceptual or obvious match for your problem space but it has the goods and might be exactly what you need.

      I suspect that KinoSearch would work about as well as a database like SQLite or PostgreSQL for this. It's actually a decent conceptual match -- inverted indexers like KinoSearch, Lucene, Xapian, etc. are optimized for many reads and fewer inserts, as opposed to the typical B-tree indexes on databases which handle inserts a little better. The only thing that's odd is that the original poster doesn't seem to need the relevance-based ranking that inverted indexes do well.

      Regardless, the problem is straightforward and there are lots of good options for solving it.

        PostgreSQL does indeed have btree indexes, but also inverted indexes (GIN), and the excellent GIST index type. (it seems to me the btree type does well enough in this case; if you see my example below, where searching in a 223-million+ rows table takes a tenth of a millisecond).

        PostgreSQL index-type docs here.

        I'm just reacting to the juxtaposition of sqlite and postgres; really: SQLite, handy as it often is, can not be compared with a powerful database system like postgresql.

        (And I should really try & compare Your Mother's example with KinoSearch, and see if he is right; maybe in the weekend... )

Re: search a large text file
by pajout (Curate) on Feb 08, 2011 at 13:43 UTC
    You can preprocess it using, for instance, DB_File module. See DB_HASH chapter of
    http://perldoc.perl.org/DB_File.html#A-Simple-Example
    -> you can have huge hash which is stored in the file but it behaves relatively transparently in Perl code, like as normal hash.

    Or you can use one table in some SQL database, having just two columns, 'key' and 'val', with index on 'key'.

Re: search a large text file
by tilly (Archbishop) on Feb 08, 2011 at 15:36 UTC

      Doesn't hadoop require a cluster of servers and extensive software setup?

Re: search a large text file
by sundialsvc4 (Abbot) on Feb 08, 2011 at 15:08 UTC

    “5 gigabytes” is not what one would consider a particularly “large” file anymore.   Simply create a database ... SQLite would be an excellent choice since it requires no server and runs everywhere ... and load the information into it using any one of several existing SQLite tools.

    I suspect that you may have persuaded yourself that “I must write a (Perl) program to do this,” when in fact you probably don’t have to write anything at all.   Load your reference data in one table, and your search into another, and do an INNER JOIN.   Mission accomplished.

    “To a hammer, everything is a nail.”

      I dont need to just load that into memory! I need to extract the info from the file and use it! so I'm working in perl, I want to search for specific string and extract the numbers that are assigned to the string stored in my so called 'big file' ! Since Im in have my code in perl I just want to write a function to do it for me!
Re: search a large text file
by JavaFan (Canon) on Feb 08, 2011 at 13:53 UTC
    CREATE TABLE foo ( column1 varchar(...), column2 integer, KEY(column1) ); SELECT column1, column2 FROM foo WHERE column1 IN ('text1', 'text2', ' +text3');
    In between of course, you need to populate the table. Once.