Re: Re: It works, it's fast, and it's scalable!

Sorry if I was unclear. I was trying to keep the question relatively short when I first posted it. But, I want to make sure there's at least one clear description, because people could use this solution to save LOTS of money. Who needs Oracle's $500,000 text search? :)

Here's another attempt at explaining the problem and solution:

I have 6000 classified advertisements - all in plain text flat files, with a few HTML comments to help pull the first date of publication. These files are updated once a day at 4:00 AM EST.

Just like you mention, I have an Oracle table which stores the relevant data for sorting. This table gets updated at 4:00 AM with the flat files.

The Oracle table looks a little like this:

TABLE SEARCH_CLASSIFIEDS
    FILENAME  VARCHAR(40)
    FIRST50CHARS  VARCHAR(50)
    PUB_DATE  DATE
[download]

Indexes are created on FIRST50CHARS and PUB_DATE for optimal sorting.

My CGI script conducts keyword searches using swish-e to return the list of filenames matching the user's input. So, that gives me a list of files, but it doesn't tell me anything relevant to sort them. That's where my Oracle table comes in. I toss Oracle that list of files matching the keyword search and ask it to return the list, sorted appropriately.

SELECT FILENAME FROM SEARCH_CLASSIFIEDS
WHERE FILENAME IN ( filename_list )
SORT BY FIRST50CHARS
[download]

Filename_list would normally be something like:

'0100/0102/203434523.html','0100/0103/303144563.html',...

Oracle limits the size of filename_list to 1000 elements. So for searches that return >1000 files, in order to pass the above query my full list of filenames, I have to do this:

CREATE TABLE CLS_TMP_$$
    FILENAME  VARCHAR(40)
NOLOGGING
[download]

Then the query above becomes:

SELECT FILENAME FROM SEARCH_CLASSIFIEDS
WHERE FILENAME IN ( SELECT FILENAME FROM CLS_TMP_$$ )
SORT BY FIRST50CHARS
[download]

Since that gets past the 1000 element limit.

My main objective to this thread was that I was looking for a way to populate CLS_TMP_$$ very quickly. I didn't want to do something silly like this:

foreach (@files) {
  # INSERT INTO CLS_TMP_$$ values('$_')
}
[download]

because each query would then do a COMMIT, making it VERY slow. So, I now use SQL*Loader to populate this temporary table. SQL*Loader is a command line program which reads in a text file and populates a table with the data from that file all in one shot.

Since my CGI script was already connected to Oracle, I was hoping that there was some hook into the Oracle DBD which would allow me to do this database load fast over that connection. But, calling the external program works well enough, and it scales very well.

Comment on Re: Re: It works, it's fast, and it's scalable! Select or Download Code

Replies are listed 'Best First'.
Re:x3 It works, it's fast, and it's scalable! by grinder (Bishop) on Jan 25, 2002 at 14:58 UTC
My main objective to this thread was that I was looking for a way to populate CLS_TMP_$$ very quickly. I didn't want to do something silly like this: `foreach (@files) { # INSERT INTO CLS_TMP_$$ values('$_') }` [download] because each query would then do a COMMIT, making it VERY slow. Are we still talking about DBI here? It so, why not just create a db handle with {Autocommit => 0} so that it doesn't perform a commit on each insertion. Insert the x thousand records, and do a single commit at the end. You might also want to drop the indexes before the insert, and then create the index after all records have been inserted. Of course, I'm sure you know all of this already. -- `g r i n d e r print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u';`	[reply] [d/l]
Re: Re:x3 It works, it's fast, and it's scalable! by joealba (Hermit) on Jan 25, 2002 at 20:00 UTC
I thought about that after I posted my response... But I still think that the overhead for all those transactions (even without the commit) would still be higher than a nice, fast bulk load. I'll benchmark it and post the results here -- after my coffee break. :)	[reply]
Re: Re: Re:x3 It works, it's fast, and it's scalable! by pmas (Hermit) on May 28, 2002 at 19:36 UTC
I am really curious - how were benchmarks? I am solving almost exactly same problem. Thank you for asking it. Isn't PM great? pmas To make errors is human. But to make million errors per second, you need a computer.	[reply]
Re: Re: Re: Re:x3 It works, it's fast, and it's scalable! by joealba (Hermit) on May 28, 2002 at 20:51 UTC


Your skill will accomplish what the force of many cannot
	PerlMonks

Re: Re: It works, it's fast, and it's scalable!

g r i n d e r print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u';

`g r i n d e r print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u';`