TedYoung has asked for the wisdom of the Perl Monks concerning the following question:

Good Day Monks,

I am getting ready to write a Java Applet that will suck data from a many local database servers and publish it to a central Perl/CGI content management system. This publishing process won't be done more than a couple of times a month per server, but each time will involve the transmission of well over half of a GB of data.

I am trying to decide The Best Way ™ to communicate the database data to the server. Here are the options I have come up with:

- Just post all of the data to a CGI in one request. I have never attempted to post this much data in a single HTTP request. Can it even be done?

- Have the applet post individual records at a time. Since it would be posting to a CGI each time (upwards of a million requests), this would cause a lot of overhead on the server.

- Have the applet post individual records to a mod_perl script. This seems like the best HTTP based option so far. There would be a small amount of data on each request, and the resident state of mod_perl would avoid most of the overhead involved in the second option (but there still just less than a million requests).

- Don't use HTTP, open a single socket to some server process and do all of the communication through that. This would be the most flexible and powerful option, and would probably be the most efficient. For example, as the applet is writing data to the socket, the server could be reading and processing it at the same time. But this also entails a lot more work.

I definitely won't need an RPC framework (like RMI, CORBA or SOAP). That would be overkill for the kinds of processing I will be doing.

I also have the option of submitting several records in each request. In general, I have complete control over both ends of this.

So, has anyone else had to do something like this? What have you tried? How well has it worked for you?

Thanks for your time,

Ted

Ted Young

($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)

Replies are listed 'Best First'.
Re: Submitting large data to CGI
by izut (Chaplain) on Mar 13, 2006 at 20:14 UTC

    I don't think http was made to handle this amount of data. Why not using FTP or another kind of secure file transport method?

    Igor 'izut' Sutton
    your code, your rules.

Re: Submitting large data to CGI
by jhourcle (Prior) on Mar 13, 2006 at 20:23 UTC

    I've done similar batch work -- it's just a tuning issue, and it's going to be different for every application that does this sort of work.

    When I've done such stuff in the past, I've left a configuration variable in my code, that tells it how many records to push at a time -- that way, I can easily adjust the number to push single records, or to an insanely high number to push all of the records at once, or a number in the middle to do lots of medium sized blocks.

    You can then make an educated guess as to what will be best, and adjust if need be down the road, should you need to tune this aspect of the process.

    oh -- and as for the mod_perl comment -- I'd probably avoid it for this application -- if you're only going to be making a few calls against it a month, it's not worth keeping resident in memory except for the times when it's being run. _Maybe_ if you're going to run single records, it'll be worth it, but my gut tells me that the costs will outweigh any benefits for the one load.

    (if the time to do the bulk processing is more important than the normal processing done by the machine, then maybe mod_perl, but if it were that important, you'd have done the 'don't use HTTP' path, without question.)

Re: Submitting large data to CGI
by samtregar (Abbot) on Mar 13, 2006 at 20:06 UTC
    Don't try to do a 500MB post. That's likely to cause memory problems and it'll be hell to debug if your process fails halfway through. Do use mod_perl, of course! Don't write your own server - Apache developers did lots of good work for you already!

    -sam

Re: Submitting large data to CGI
by TedYoung (Deacon) on Mar 14, 2006 at 17:49 UTC
    Update:

    Thank you for your time. Thanks to your feedback, it seems like I am thinking on the right lines.

    It turns out that the processing of the records on the server isn't time critical. It can be done later. So, I am going to look into having the applet submit the records as files (compressed, probably several hundered records per file) through SFTP to the server. Than the server can process those files when it sees fit.

    This minimizes the work the server has to do when it receives the data, gives me fairly good control when handling modem issues, and keeps me from having to write yet another server.

    Thanks,

    Ted Young

    ($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)