kbeen has asked for the wisdom of the Perl Monks concerning the following question:

Hello.
(I mean that with all my heart)

I have a multi-part situation. The project I am working on now has users coming to my client's web-site, connecting to a partner company's database for a search function and displaying the results on my clients web-site. We have permission to do this, but only via http. I cannot connect directly to their database, I have to emulate their search form, have the users send the values to our server, where the script then connnects to the third companies website as a normal web-client would, gets the search reslult, parses the html, and sends it back to the original users in my client's look and feel (also in a different language which is the whole purpose of this).

The problems I am facing are speed. As you can see, the server in the middle doesn't help speed the process up, but there is one more aspect... In the results from the parter's web-site, there are those that my client can leagally display, and those that they can not. The only way to know if they can display it is to follow the resulting link, and then parse THAT page for a certain phrase. I have decided to create a database (just a text file) that records the ID of all photos that are leagal, so future queries can use this rather than connecting to the parter's server again. So, in effect, the process I looks like this.

1) Original user come to client's website.
2) Original user enters a search phrase
3) query sent to our server
4) Script takes needed info from Original query
5) Script sends http query to partner's web site
6) Partner's site sends results back to our server as HTML
7) Script parses results for needed result ids
8) Script must determine individually if each result id leagally acceptable
so...
9) Script checks each result against database file of OK/NOT OK entries
10) If the resutl is not in the file... reconnect to partner server to get next page
11) Parse the resulting page looking for phrase that tells me if I can use it or not
12) Record results (OK / NOT OK) in the database file for future.
13) Add OK results to OK list
14) Finally I am ready to display the OK results to the rginal client (40+ seconds later in some cases!!!)

I think that the database file will work nicely if people often search with the same keyword (in this case there is no problem) I will also stock the database file a much as I can with common queries before going live, but I really want to make the parsing of this file as efficient as possible, as it could grow to 100,000 or more entries, and until it reaches that stage, any results NOT in the file must be looked up via http, and this takes valuable time. Once it reaches that stage, I am afraid that my parsing method will be slow... Anyplace I can cut out even a second will be tremendously helpful.

Currently, the dbfile ($approve) has entires like follow:

Y12398983981293\n Y23981098310983\n N98230498209480\n Y23487289374987\n

Where Y means (Y)es, ok to use, and N is (N)o. The number is then the result id from the partner site.

Then this snippet:
open (IMAGE_LIST, "$image_list"); while(<IMAGE_LIST>) { if ($_ =~ /^Y/) { $approve = "$approve$_"; } if ($_ =~ /^N/) { $condem = "$condem$_"; } } close (IMAGE_LIST); open (IMAGE_LIST, ">>$image_list"); my $full_image; for ($a = 0; $a < $thumb_count; $a++) { if ($approve =~ /Y$thumb_id_list[$a]/) { print "already had $thumb_id_list[$a] on record!!!<BR>"; push (@display_thumbs,$thumb_id_list[$a]); } if (($approve !~ /Y$thumb_id_list[$a]/) && ($condem !~ /N$thumb_id +_list[$a]/)) { #This connects to the partner website to chack. my $confirmed_thumb = &check_image($session,$thumb_id_list[$a] +); if ($confirmed_thumb) { print IMAGE_LIST "Y$confirmed_thumb\n"; push (@display_thumbs,$thumb_id_list[$a]); } if (!$confirmed_thumb) { print IMAGE_LIST "N$thumb_id_list[$a]\n"; } } } close (IMAGE_LIST);

Is there a better way to do it... like loose the line breaks or a better mathing exp? The file will grow quite large, and saving a second or two will really help

One function I would like to add, is that while the top page is being displayed, the script will be gathering further results via http in the back-ground, so when the user hits the next page button, the results are already prepared. What is the best / safest way to make a proceess run in the back, where the orginal process doesn't wait for the back process to finish? Even if the user never hits the "next" button, I will have stocked valuable info in the database file, so I would like to do this.

In summary, my questions are...
1) If I want to save a second or two on the parsing of the datbase file, what would be the best format to write this file, and parse it?
2) Any advice or clues about the best way to start the backgound process while the original CGI process is ale to finish independantly.

Any suggestions regarding any aspect of this would be appreciated. If I am way offbase with my method above... please let me know... I still have time to make a total turnaround.

Kbeen.

Replies are listed 'Best First'.
Re: I need speed...
by dirthurts (Hermit) on Oct 07, 2001 at 09:33 UTC
    If you are stuck with using flat text files, you could probably split the files up some way, like hashing off the third and fourth digit in the number. Then you could store the results for item Y245624222348 in a file called 56 and search through that file. An even faster way might be to create individual files for each result, and store them in directories that are hashed the same way... so the above file would be called Y245624222348 in the /56/ directory. Your operating system is probably quicker looking for files in it's file system than searching for a line in a text file. Of course, you know how many files you might expect to have...these are just suggestions.

    Jay

      actually, I'm not "stuck" with a flat text file, but I cant find a really good way to split things up... for instance, if the user searches for "boy", making a file called "boy.resutls" with that info, would not work becuase some of the results from a "boy" search also come in a "girl" search, and a "child" search. But not all of them. If I did that I would have to open and search all the files anyway, defeateing the whoe purpose of splitting them up.

      If, as the results came in, I put them in seperate files based on the first digit, like a "1" file, a "2" file a "3" file, and then opened each file as it is needed in future searhces, would this be faster do you think?
      Is is faster to open one file with all entries and search it, or open various files with less entries and search each one individually as needed?

      Would it be faster to put them in Mysql and then for each one do a search like --"SELECT * FROM okphotos WHERE id LIKE '$id';-- for each result? I am trying all these as we speak, but your suggestions will save me from trying something in vain.. Thanks.

      kbeen.
        mySQL could/would cache that data in memory if the table is used often, however you lose speed if the MySQL server is on a different machine than the script. If you have a select for each ID you gain speed again though, specially when the cache becames larger. I think a SQL server is a good way to go.

        Don't use LIKE though, use =, although LIKE without % and ? might be optimized that way by the server... dunno.

        use numeric fields if possible, if not, then create a index on a substring of the id. for example,

        id char(10), Update: don't use varchar ...
        index index_id id(3)


        Tiago
Re: I need speed...
by blakem (Monsignor) on Oct 07, 2001 at 11:25 UTC
    If you're asking how to make a fast lookup table, you should explore the various dbm systems since thats essentially what they are. You create a tied hash, which behaves like a normal hash, but actually saves the data in a binary file designed for fast lookups. See DB_File for more information.

    Warning... untested code.

    #!/usr/bin/perl -wT use strict; use DB_File; my $dbmfile = '/tmp/lookuptable'; tie my %lookuptable, "DB_File", $dbmfile, O_RDWR|O_CREAT, 0640, $DB_HA +SH or die "Cannot open file '$dbmfile': $!\n"; # do whatever it is to get the ids... my $id = '123456789'; # check lookup cache: my $cachevalue = $lookuptable{$id}; if ($cachevalue eq 'Y') { # link is ok } elsif ($cachevalue eq 'N') { # link is bad } else { # determine if it is good or bad. my $isok = 'Y'; # add the calculated value to the cache $lookuptable{$id} = $isok }

    -Blake

      A good suggestion; but I'd be careful about which DBM module actually gets used, because most of them have pretty limiting restrictions. SDBM f.ex, the only DBM module that ships with Perl/Win32, can only store a very small number of keys, making it next to useless for anything beyond persistent configuration information or such.
        Good point. I've always found BerkeleyDB to be more than sufficient for anything like this. I'd recommend it over most of the other dbmish inplementations.

        -Blake

Re: I need speed...
by tstock (Curate) on Oct 07, 2001 at 10:46 UTC
    I can't answer questions 1 or 2, but you can gain some speed by replacing the regular expressions in your script by substr

    for example /^N/ to substr($a,0,1) eq 'N'
    extract a thumb_id by using $id = substr($a,1)

    you would probably do better performance wise to store the ID's in a hash with the ID as a key and the Y and N as the value, or have an array for approved and one for not approved, then grep to see if IDs are approved or not.

    Your &check_image function can also be called from a &check_images wrapper that accepts a list of images and forks several &check_image instances. That will really speed up things since it has to go over the web.

    Touching a bit on question 1 after all... if you store that hash or two perl arrays (Y and N) in a file using the FreezeThaw module you don't need to parse the data for approved/not everytime you read it in. Just an idea.

    Hope something up there was helpful...

    Tiago

    Update: as you mentioned, Mysql will probably be faster once a certain number of ID's is cached, and use numeric fields and enum for Y/N. a variable length table (varchar, glob, text) will not perform as well. LIKE is also evil in that sense.
Re: I need speed...
by pjf (Curate) on Oct 07, 2001 at 13:40 UTC
    G'day Kbeen,

    Everyone has given good advice above me, but I thought I'd add my personal experience as well. I've also worked on projects where speed was critical, and for the most speed-critical parts we used custom persistant caches to keep things speedy.

    The implementation of such a cache is fairly straightforward, pretty much a hash with a daemon running around it. During quiet moments the cache is cleaned with old nodes removed and/or flushed to disk/database. Connection is usually via a unix domain socket.

    If you have your front-end scripts populating the cache on misses, then you can have a very simple and very fast single-threaded cache. It can be used with any of the other strategies (databases, disk files, DBM files, etc) mentioned above. It can pre-populate itself with information if desired.

    I'd suggest you think twice before having your cache make HTTP connections to the remote server. While this seems very elegant from a design standpoint (ask cache, get back answer, don't care about hit or miss), it means you run the risk of your cache being blocked on HTTP, or seriously increasing the complexity of the code. A misbehaving cache can kill your overall performance when everything depends upon it.

    Hope that your project gets well and that your customers are blinded by its speed. ;)

    Cheers,
    Paul

Re: I need speed...
by Aristotle (Chancellor) on Oct 07, 2001 at 14:38 UTC

    I will only touch on Q1 since I have no experience with what you want to do in Q2.. however it might well turn out that you don't need to go to such superb lengths of twisting your arms as your Q2 suggests, if the following notes work well enough for you, and I believe you'll see a huge difference.

    First of all, a few bad mistakes in your code. :-( The biggest one being that you try to emulate a hash's functionality by using a string you run a regex over, instead of just using a hash. Rely on Perl to do the work for you; hash lookup is very fast and efficient and will do the job better than any crutch you may be able to think of.

    Sidenotes:
    • as has been mentioned, use substr() instead of a regex to extract the Y/N part
    • It is un-Perlish and 99.9% of the time a bad idea to use the three-expression for(;;){}
    • If you say if($confirmed_thumb){} if(!$confirmed_thumb){};, why not use an else?
    local $_; # should always do this unless you know why you don't want i +t my %id_cache; open (IMAGE_LIST, $image_list); # you didn't need quotes here.. while(<IMAGE_LIST>) { $id_cache{substr($_,1)} = substr($_,0,1) eq 'Y'; } close (IMAGE_LIST); open (IMAGE_LIST, ">>$image_list"); # although you do here for (@thumb_id_list) { unless(exists $id_cache[$_]) { # IDs that didn't appear in the file won't exist in the hash # therefor we have to connect to partner website for them $id_cache[$_] = &check_image($session,$_); print IMAGE_LIST $id_cache[$_] ? "Y$_\n" : "N$_\n"; } # $id_cache[$_] is now defined, # regardless of whether it was in the file or not push (@display_thumbs,$_) if $id_cache[$_]; } close (IMAGE_LIST);

    Now that's much shorter and clearer, no? It will also run a lot faster - even if from an algorithmic point of view it is still anything but satisfactory. (Note: you must take proper care of file locking, or you'll end up corrupting your image list file.) Since your IDs appear to be of fixed length, I could go into all sorts of optimizations possible with that knowledge, here, first and foremost being to store the keys in a binary file to get rid of the overhead of reading and parsing a textfile for every invocation.

    However, I won't. Because I think you're stepping down the wrong road. You will have a problem with that approach very soon because you're slurping the entire list into memory from scratch every time. That will become slow very soon, but the killer argument is that your memory usage will go through the roof - something you should avoid at all cost for medium or higher traffic CGIs. A partial solution to the speed problem would be to use Storable to put the %id_cache into a binary file; this will be at least an order of magnitude faster than anything you are likely to be able to code yourself.

    However, there's still the problem of memory usage. It would be better if you can check your keys from disk without reading the entire list into memory. Unfortunately that requires very clever storage in order to be fast, and coming up with a good way to do this is what the people at IBM and Oracle get paid a lot of money for - it is an extraordinarily tricky task. I would advise against the method that has been mentioned by others to abuse the file system as a database, because while that works, on just about every Unix filesystem the number of files you can create is limited; subdividing the key into a path with several small bits increases the problem a lot, while on the other hand keeping it as one single filename makes for very large directories that some filesystems are pretty slow to search.

    Really, you should use a database.

    Also, accumulate the unknown permission images rather than calling &check_image() on each, and use LWP::Parallel to check them at once. This should save your script a couple seconds of waiting for each reply in turn; just make sure you don't totally cripple the target server with a flood of requests (if memory serves, the module lets you define how many of the requests to fire off at once).

Re: I need speed...
by theorbtwo (Prior) on Oct 07, 2001 at 11:03 UTC

    Hm. If you've got the authority to do it, you might well want to change from running a CGI on your client's site to somthing more persistant (like a mod_perl handler thingy), so you don't have to parse some out-of-core version of your permissionlist on every search.

    Failing that (or even not failing that), you could also...

    Go the multiple-files idea one further, on the theory that your OS's FS layer is probably smarter about caching then you are: since apparently all you really need to do is lookup each 48-bit (or so) result ID and get an answer of Allowed, Not Allowed, or Donno. (In other words, a one-trigit result).

    Therefore, create a file like allowed-to-show-cache/12/39/89/83/98/12/93 with a length of 0 or 1 if you cannot or can show the file, and then use a -f test to see if you already know, and if that works, do a -s _ to see if you can forward the result's picture or not. This would be really good with somthing like reiserfs (IIRC), which is designed to be really efficent on directory-tree lookups like this by having some sort of natural tree structure. You never have to acatualy open the files at all. While this is probably larger then the other solution in terms of disk space, it involves less parsing.

    There's probably something else none of us are seeing... but I don't know what.

    Hmm...

      Don't write a file-hashing thing like that yourself; use Cache::Cache which does it for you.
Re: I need speed...
by kbeen (Sexton) on Oct 07, 2001 at 10:39 UTC
    A follow-up question...
    The way the script is above, in order to confirm the legality of using each result, I am downloading HTML from the partner company and chacking it or each result. This is happening one at a time. How do I NOT wait until the previous result is checked before checking the next one... connect for each result at the same time(or almost the same time)?...

    Kbeen
      Just be carefull with not pounding the other servers with more requests than what it can handle.

      I would say a Dual Pentium III 1000MHz can probably handle 50 hits per second as a top limit before active connections may start to accumulate, and eventually refuse new connections. This of course varies according to several factors, but it's probably a good gauge for a server running Apache on BSD or Linux.
Re: I need speed...
by ralphie (Friar) on Oct 07, 2001 at 20:53 UTC
    we really do numbers on ourselves, don't we? i'd urge you to make sure you've done everything you can on an organizational level to try to define what you can get to vs. what you can't. there must be some sort of pattern that determines that. if you've only been dealing with the intervening people i'd suggest trying to get to the source body to see if you can find that answer.

    i know i may be stating the obvious, but sometimes i have to do that to myself. (grin)