http://qs1969.pair.com?node_id=141063

joealba has asked for the wisdom of the Perl Monks concerning the following question:

Background: I have a CGI script which performs a search of 5000+ text files (classified ads from a newspaper) and displays the results (20 at a time). The user can choose to sort the results alphabetically or by the first date of publication.

The pub date and first 50 text characters are stored in an indexed Oracle table for quick sorting. The optimal keyword search is performed by swish-e.

So when the search runs, swish-e gives me the list of files matching the search criteria. Then I ask Oracle to return the same list of files, sorted by those indexed columns. Finally, I take the sorted list of filenames and print the appropriate 20 results by reading in each flat file.

Problem: I can't say select filename from search_classifieds where filename in (list o' filenames) order by... in cases where swish-e returns more than 1000 filenames. So, I need to create a temporary table and get the filename list using an embedded query.

I suppose I could output a file and use SQL*Loader, but I was hoping that someone hooked that functionality into DBI somewhere. Node 106999 shows a MySQL solution, but that doesn't work in Oracle 8. This thread references an Oracle 8 feature called "direct load insert", but I can't dig up any more information on how to access it through DBI.

My question: Does anyone know how I can do a quick insert of 1000+ rows to a temporary table in Oracle 8 through DBI?
  • Comment on DBI - Oracle 8 - Load many rows into a temp table w/o using SQL*Loader
  • Download Code

Replies are listed 'Best First'.
Re: DBI - Oracle 8 - Load many rows into a temp table w/o using SQL*Loader
by perrin (Chancellor) on Jan 24, 2002 at 06:17 UTC
    You may not be thrilled with this advice, but I think you're fighting a losing battle. You really need to have the sorting and the searching done by one engine or you will kill your performance moving large result sets around like this.

    If swish-e is not able to sort, maybe you should consider using a different search engine or doing the word search in Oracle. Besides the Oracle full-text search, there are also good alternatives like DBIx::FullTextSearch. Or you could keep the sorting data in a dbm file and then use a Schwartzian Transform to sort the swish-e results by it.

Re: DBI - Oracle 8 - Load many rows into a temp table w/o using SQL*Loader
by wardk (Deacon) on Jan 24, 2002 at 04:12 UTC
    well this isn't a DBI solution, but it's in "perl", and it works on my production system.... just set up the sql*load control files and run them from perl, capturing the output...
    my @aap_rc = `cd /web/$gsVersion/bess/db;$ENV{ORACLE_HOME}/bin/sqll +oad $ORACLE_ACCT/$ORACLE_PW CONTROL=/web/$gsVersion/bess/db/aap.ctl L +OG=/web/$gsVersion/bess/db/aap.log`;
      Heh. Thanks!

      I have this solution as a backup right now. But, I haven't given up on that "direct load insert" thing.

      When I find (or write) a way to make this happen, I'll post it here. :) But, I WILL make this work.
        How are you trying to do the direct load insert? It should be as easy as: insert /* +APPEND */ into tableA (name, address) values ("Joe", "123 Main St"); The APPEND hint tells Oracle to use direct load insert.
Re: DBI - Oracle 8 - Load many rows into a temp table w/o using SQL*Loader
by Ryszard (Priest) on Jan 24, 2002 at 09:11 UTC
    I Agree with perrin, the above solution wont scale.

    For a web app youre after performance ( sub 2s response for a good app), doing the temp table thing is going to take you over that (with I/O generated by the temp table creation). If you are in a high transaction environment there is the potential you may run out of memory and chew thru' disk (or become I/O bound) with the temp tables being created and destroyed.

    Are you able to ditch, or replicate the swish-e functionality?

    If you can, you could fetch the results (in one oracle parse) into an array, and set a 'last_record_seen' flag in a cookie. When the user selects the next page, you'd go get all the results again, and only display the elements in the array between 21-40.

    If you wanted, you could minimise the Oracle hits by using Storable.pm or Cache . Of course, you then have the trade off between stale information and the length of time to cache results. If new ads are appearing infrequently, it may not be such a hastle.

    Unfort, if you ditch Swish-e in favour of a like clause, it precludes the use of an index when you use %variable% which you could use in order to preform a comprehensive search. Have you considered writing a custom package you could call from a query?

    This sounds like a legacy system that is having the web bolted on. IMHO, if there are only a few (5+) thousand records, it may be worthwhile putting all the information into the database. This will scale further and be more flexible. You should only get into trouble when you start storing millions of records, the trade off then becomes flexibility v's performance. Typically for lots of content (ala yahoo, excite) there will be pointers in a database to files at the filesystem level, which is where you may be coming from.

    This is slightly off topic, but never store graphics in your database for web apps, you'll almost always get a faster result from storing graphics at the filesystem level.

    To answer your question directly, use an insert statement and use placeholders if you want to use DBI.

It works, it's fast, and it's scalable!
by joealba (Hermit) on Jan 24, 2002 at 23:56 UTC
    Thanks for the input everybody, but -- regardless of public opinion -- this solution works great! Searches that used to take 45 seconds under heavy load (to open 2000 flat files, alphabetize them, then print 20 results) now take no more than 3 seconds. And, the I/O load is MUCH happier too!

    I wound up using SQL*Loader for searches with >1000 results, which adds about 1 second of overhead. SQL*Loader supports "direct load insert", but it slows things down in my situation. Direct load insert works best when creating a table from a query to another table, and it is not specific to Oracle 8. Oracle 7 can do it too.

    Thanks again! If anyone decides to use my Alba Transform :) and wants some more specific code, let me know.

      I'm sorry, which solution works well?

      And why not just set up a cron job that opens and loads the files, with the relevant data you need for sorting and output, into a DB every couple of minutes? You can just pre-calculate whatever you need and avoid all the temp table overhead.

      Is this what SQL Loader does? I've never used it.

      -Any sufficiently advanced technology is
      indistinguishable from doubletalk.

        Sorry if I was unclear. I was trying to keep the question relatively short when I first posted it. But, I want to make sure there's at least one clear description, because people could use this solution to save LOTS of money. Who needs Oracle's $500,000 text search? :)

        Here's another attempt at explaining the problem and solution:

        I have 6000 classified advertisements - all in plain text flat files, with a few HTML comments to help pull the first date of publication. These files are updated once a day at 4:00 AM EST.

        Just like you mention, I have an Oracle table which stores the relevant data for sorting. This table gets updated at 4:00 AM with the flat files.

        The Oracle table looks a little like this:
        TABLE SEARCH_CLASSIFIEDS FILENAME VARCHAR(40) FIRST50CHARS VARCHAR(50) PUB_DATE DATE
        Indexes are created on FIRST50CHARS and PUB_DATE for optimal sorting.

        My CGI script conducts keyword searches using swish-e to return the list of filenames matching the user's input. So, that gives me a list of files, but it doesn't tell me anything relevant to sort them. That's where my Oracle table comes in. I toss Oracle that list of files matching the keyword search and ask it to return the list, sorted appropriately.
        SELECT FILENAME FROM SEARCH_CLASSIFIEDS WHERE FILENAME IN ( filename_list ) SORT BY FIRST50CHARS
        Filename_list would normally be something like:

        '0100/0102/203434523.html','0100/0103/303144563.html',...

        Oracle limits the size of filename_list to 1000 elements. So for searches that return >1000 files, in order to pass the above query my full list of filenames, I have to do this:
        CREATE TABLE CLS_TMP_$$ FILENAME VARCHAR(40) NOLOGGING
        Then the query above becomes:
        SELECT FILENAME FROM SEARCH_CLASSIFIEDS WHERE FILENAME IN ( SELECT FILENAME FROM CLS_TMP_$$ ) SORT BY FIRST50CHARS
        Since that gets past the 1000 element limit.

        My main objective to this thread was that I was looking for a way to populate CLS_TMP_$$ very quickly. I didn't want to do something silly like this:
        foreach (@files) { # INSERT INTO CLS_TMP_$$ values('$_') }
        because each query would then do a COMMIT, making it VERY slow. So, I now use SQL*Loader to populate this temporary table. SQL*Loader is a command line program which reads in a text file and populates a table with the data from that file all in one shot.

        Since my CGI script was already connected to Oracle, I was hoping that there was some hook into the Oracle DBD which would allow me to do this database load fast over that connection. But, calling the external program works well enough, and it scales very well.