dragonchild has asked for the wisdom of the Perl Monks concerning the following question:

I'm in the process of writing a CGI that will display data for a given thing. Now, there are 5 categories of thing and about 200 in each category. A thing has about 10 attributes, and those attributes are always the same at any given time, but it's very likely what those attributes are will change every 3-4 months. (Not my choice.) However, the number of things isn't going to increase very much. Maybe to 300 in 6 categories.

Now, I've got to figure out what the underlying data structure is going to be. Before people give a knee-jerk response of "Just use MySQL/Oracle/other RDBMS", please read further.

I'm not going to be maintaining this thing. In fact, it's very likely that someone with no understanding of RDBMS's, but some understanding of Perl/CGI, will be the one handed this. That person will be responsible for making changes to the attributes of the things in the future. This CGI will probably live for about 10 years (or so), but has to be as low maintenance as possible.

I'd like to avoid using an RDBMS, if possible, primarily due to two reasons:

  1. The relatively small number of data points
  2. The fact that the data is highly UN-normalized
#2 is important, because it means that whoever takes this over needs to understand RDBMS's and normalization to correctly update the DB for changes in the thing attributes.

So, what I'm finally getting at is this - is there a way I can have this data in some process just sitting out there that another CGI process can query? I don't want to have to do a file containing all the data every time I want to get at one data point. It'd be really nice to have all that data in memory (as it only is updated once per day in a batch process I will be writing) and not have to re-create it every time a CGI submit button is pushed.

Any thoughts?

Oh - I have been given enough time to implement the correct solution, which is a rather nice luxury. So, time is not a (major) concern.

Replies are listed 'Best First'.
Re (tilly) 1: CGI and static data...
by tilly (Archbishop) on Jul 20, 2001 at 18:36 UTC
    Given your estimated data requirements, I would suggest not trying to optimize this. The attempt to optimize is going to be a needless complexity. The complexity of managing several processes is going to require maintainance work, if performance is not critical I would avoid that work.

    But suppose that performance matters. Well then my solution would be to write a module which takes requests for a data element and returns just that data. Inside that module if it is not initialized, or if it was initialized from an older file than you see out there, then do the do. Otherwise return the data from the memoized file.

    And now use mod_perl. Now if the page is heavily hit, on most requests that value will be memoized already and so it merely has to check that the file hasn't been updated and then return the current value. Sure, the data isn't shared between processes, but you get most of the way to ideal performance very easily. And if the memory requirements do get out of hand, then you can revisit the design later quite easily since the necessary code is divided between a configuration file and a simple module.

(ichimunki) Re: CGI and static data...
by ichimunki (Priest) on Jul 20, 2001 at 18:02 UTC
    While I don't think it is worthwhile to do this in memory (unless there is just HUGE amounts of data, and I mean HUGE, and how that's going to be useful in a single HTML page I don't know since it would have to be a hundred pages long before I considered it huge), you should read the manual page for Perl interprocess communication and anything you can find about mod_perl (since you are trying to speed up data access, but server instantiation is going to be the bottleneck if I understand your description correctly. And now we have completely baffled the non-gurus who will be maintaining this thing after you.

    I would set this up as a series of plain old .txt files or use Data::Dumper if you want to keep it simple and easily maintained.
Re: CGI and static data...
by spudzeppelin (Pilgrim) on Jul 20, 2001 at 20:05 UTC

    How about making sure you create a unique key for each point, then serializing the data points (Freeze() is one way, Conway's OOP book lists several), and shoving them in a tied DBM structure?

    The caveat here is making sure you know WHICH DBM your product will be using: "Bucket-size" limitations vary greatly, and with 1000 objects, if you are using (for example) the "standard" DBM (nDBM, IIRC) on Solaris 2.6, you're going to either have 65-byte objects or run out of space; other DBM implementations have much larger bucket-size limitations, and at least one is much smaller.

    The big advantages of DBM are that it's easy to implement/maintain (there's even the explicit-but-deprecated dbmopen() command, if tie() is too heady for your future maintainers), data manipulation is easy (just regular hashes), and it's fairly quick (YMMV, again depending on which version of DBM you're using).

    The big drawback (apart from the bucket size problem described above) is that there is no internal locking mechanism in the DBM specification; if you need locks (will multiple scripts be rewriting the same object simultaneously?), you have to implement them yourself.

    That said, there are some other games you can play to push the performance even higher. One is that if you have a ramdisk, or a system that supports a shared swap//tmp filesystem (many commercial versions of Unix will overlay swap space with the /tmp filesystem -- if the machine is not swapping, /tmp is memory-mapped; if it is, swap data starts getting written into /tmp), you can put the DBM in the memory-mapped space (good idea, though, to copy the file somewhere more static every so often as a backup), so that reads/writes with the tied DBM never actually touch disk.

    As far as optimizing the CGI (and serializing the DBM connections), one approach might be to have a separate daemon which babysits the DBM, and communicates back and forth with the CGI instances via named pipes. This way, you wouldn't have to worry about the overhead of the CGI actually opening the DBM each time.

    Of course, you never specified what platform you were implementing this on. All of my advice with this goes for naught if you aren't on a Unix or Linux (or OS X??) system.

    Spud Zeppelin * spud@spudzeppelin.com

Re: CGI and static data...
by Rhandom (Curate) on Jul 20, 2001 at 19:48 UTC
    I can understand your pain... Often it is nice to be able to just connect to an already running process, give it a keyword and get back the answer that has already been loaded. The way that would probably be easiest to do this would be with a client/server model. In essence, that is what you have with a DBI/RDBMS solution and that is what you get with a LWP/mod_perl solution.

    Sometimes, a solution like that may too much overhead (maintaining your Apache, trying to maintain the database). There are a couple of solutions that you could look at to do the client server for you. There is RPC::XML, which would give you a nice procedural interface and has the option of not using Apache at all. Another possibility is to use Net::Server, on which you can write your own interface. It has the ability to be HUPed (restarted) which would allow for rereading of configuration files. And, if you run into a high load situation you can choose to do a Net::Server::PreFork which manages child processes in the same way that Apache would -- except that its pure Perl.

    There are many options, it just depends on what you need.

    my @a=qw(random brilliant braindead); print $a[rand(@a)];
Re: CGI and static data...
by Xanatax (Scribe) on Jul 20, 2001 at 18:29 UTC
    well, i haven't worked with Data::Dumper, but it looks promising...

    i was thinking you should look at XML... there are modules to do most of the work, it is very standard(or rather is a standard), it will handle arbitrary data sets quite well, and it will handle changing the structure as well.
    Xan.
      I was going to suggest XML also. The structure that comes to mind is along the lines of creating an XML file that describes the categories, and includes a directory name for each category. Then in each category directory, a file for each thing, possibly with an index XML file that lists all the things. Depending on your comfort level with XML and your comfort level with plain text, you might find it easier to do the same thing with plain text. You can pick different storage methods for each level (part) of the data structure, depending on what that data needs. (For example, you didn't specify how much data is associated with each category, so if it's just a name, there might be no need for a description of each category, just a set of directories, like /usr/data/Category1, /usr/data/Category2, etc.
Re: CGI and static data...
by princepawn (Parson) on Jul 20, 2001 at 21:37 UTC
Re: CGI and static data...
by bikeNomad (Priest) on Jul 20, 2001 at 22:13 UTC
    BerkeleyDB may be a good solution for you, in conjunction with Storable for serialization. Assuming that your entities don't have to point to each other (you don't say whether there is advanced intra-object connection or not), this has several advantages:
    • Can run using a remote RPC server if you wish
    • Allows for arbitrarily large objects with diverse structure to be saved and retrieved
    • Simple interface, without requiring RDBMS overhead, limitations, or complexity
    • Good performance, locking, etc.
Re: CGI and static data...
by InfiniteSilence (Curate) on Jul 20, 2001 at 17:54 UTC
    The quick and painless answer to this question is yes. You can use the Data::Dumper module to push out data an read it back in. Check the perldoc. If that is not to your liking, just push this stuff out to a series of text files and read it back in. Write your own access methods. Why did you even submit this question to perlmonks? You seem to be answering your question throughout the question.

    Celebrate Intellectual Diversity

      Data::Dumper, Storable and friends are not always the right answer. Yes, he knows that he can do it that way, but he would prefer not to for performance reasons.

      That is a perfectly legitimate question. Algorithmically it is really stupid to have to parse a potentially large data file when you only want one item out of it. While my gut response is to not worry about performance, that is something which can easily become a bottleneck, and it is good to have some knowledge of how to handle the situation if it does become a problem and you do not have a structure which is suited to an RDBMS.

      However your rudeness is another story entirely...