Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: Efficient search through a huge dataset

by johnnywang (Priest)
on Oct 20, 2004 at 01:03 UTC ( [id://400724]=note: print w/replies, xml ) Need Help??


in reply to •Re: Efficient search through a huge dataset
in thread Efficient search through a huge dataset

I guess you meant to store the first file as database records in a table, and for each record in the second file, do a select from the table? That will be millions of selects, can a perl/batch be faster? Thanks.
  • Comment on Re^2: Efficient search through a huge dataset

Replies are listed 'Best First'.
Re^3: Efficient search through a huge dataset
by pg (Canon) on Oct 20, 2004 at 02:11 UTC

    Millions of records? Not true at all!

    You can do simply one query, say you want to find out all records that are in table1 but not table2, then you can either:

    select blah from table1 t1 where not exists (select 1 from table2 t2 where t1.blah = t2.blah )

    Or (if munus is supported):

    select blah from table1 minus select blah from table2

    It should be easy for you to modify the query a bit to satisfy your needs.

      Faster and better would be outer joins
      SELECT t1.blah FROM t1 LEFT OUTER JOIN t2 ON (t1.blah = t2.blah) WHERE t1.blah IS NOT NULL AND t2.blah IS NULL

      It's the correct way to rewrite the NOT EXISTS (about 100x faster in my tests) and is faster than the MINUS because it doesn't have to pull both result sets into memory, then do an expensive ARRAYCOMPARE on them.

      Being right, does not endow the right to be rude; politeness costs nothing.
      Being unknowing, is not the same as being stupid.
      Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
      Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

        SQLite doesn't do the full outer join thing, so you have to fake it with two left outer joins :

        -- in table2 but not in table1: SELECT r.id FROM table2 AS r LEFT JOIN table1 AS l ON r.id = l.id WHERE r.system = ? and l.id IS NULL and r.id IS NOT NULL -- in table1 but not in table2: SELECT l.id FROM table2 AS l LEFT JOIN table1 AS r ON l.id = r.id WHERE l.system = ? and l.id IS NOT NULL and r.id IS NULL SQL

        But other than that, SQLite is up to the task, as it includes a nice bulk import statement (SQLite 2.x) and a bulk loader program (SQLite 3.x).

        Update: Added the "full" to "full outer join", as a "left (outer) join" is an outer join too, but one which SQLite does...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://400724]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2024-04-25 12:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found