Hero Zzyzzx has asked for the wisdom of the Perl Monks concerning the following question:

Dearest Monks,

I'm seeking validation and the experiences of others.

I'm creating a content management system and having difficulty with SQL stuff, mainly I have so many properties that I have to take into account that the SQL statements are wicked complicated, and I've spent the last couple of days working on them to no avail. I was trying to do everything with 1 statement, and it just wasn't happening.

I'm creating a complicated reference based data structure, to be passed to a recursive HTML::Template loop. I then got to thinking that I'm going about it the wrong way. Incurring a DBI hit, or multiple DBI hits, would top out my script in terms of performance no matter what.

My thoughts are to create my data structures every five minutes or so with a cron-tabbed script, storing them with Storable, and then opening them in my display module, saving on the DBI overhead.

I'm expecting a bunch of concurrent connections. Assuming I use file locking properly, will this be OK? The Cookbook has a great description of how to get this to work, and I'll probably generalize it into a subroutine.

My questions:

  1. If you've done something similar, what was your experience?
  2. What should I be looking out for, in terms of traps/pitfalls?
  3. Is there a better way to store this than using Storable?
  4. Am I making this too complicated?

-Any sufficiently advanced technology is
indistinguishable from doubletalk.

Replies are listed 'Best First'.
Re: mySQL hits or Storable retrieves?
by perrin (Chancellor) on Oct 02, 2001 at 18:41 UTC
    Let me re-state your question: "What's a good way to efficiently load a complex but infrequently changing data structure that I want to display in my mod_perl application?"

    MySQL is fast enough for most uses like this, but if you have an unusual amount of traffic or a poorly designed schema and your data does not need to be 100% up-to-date, caching is tremendously helpful. The approach you described will work just fine. I would suggest you use Cache::Cache or MLDBM::Sync instead of rolling your own filesystem database though. The biggest problem with this approach is that your data set may become too large to efficiently regenerate all of it every 5 minutes.

    I used a load-on-demand approach with caching for the eToys.com product data. This avoids problems stemming from a large data set, but does result in a slower load time for the first user. Here's a description:

    • Use Perl objects to represent your data. Give them a load() method that takes a unique ID.
    • Inside the load method, check the local cache (I used BerkeleyDB, but you can use one of the modules I suggested above) and retrieve this object if it's there.
    • If it isn't in the cache, load it from the database. Put the loaded data in the cache.
    You can vary how much your cached data gets out of sync by adjusting the time-to-live for objects in the cache.

    Caching is a big subject, and there are lots of tricks you can use (lazy reloading, pre-filling the cache with hot objects, etc.).

      Wow. Thanks for the very helpful response. Part of my problem here is I think I've normalized my data too much (if that's possible), it's split across so many tables that writing the SQL properly is getting beyond me.

      Basically, I'm at the point where I need to run the entire universe of records through a set of filters, akin to:

      1. What documents are in this category/subcategory?
      2. Is the user public, staff, probono attorney, admin, etc.?
      3. What organization is this user from, and what docs are specifically excluded from this organization at this access level?

      So, it would be so much easier if I didn't have all this conditional crap. I'd probably do it with one SQL join, but I can't here, I need more than one statement (at least in my head) and I think caching the results is the only way I'm going to get a useable, fast suite of sites.

      -Any sufficiently advanced technology is
      indistinguishable from doubletalk.

        Creating a few views might help you to get your head around the queries...

        A database view is a sort of virtual table. You create it with a SELECT statement and then access it like a regular table.

        CREATE VIEW view_red_cross_docs AS SELECT <code here> CREATE VIEW view_pro_bono_docs AS SELECT <code here> SELECT <stuff> FROM <joined views>


        email: mandog
        Object-level security? It sounds like you want SPOPS, or even the app server built on top of it, OpenInteract. It has hooks to add caching as well.
Re: mySQL hits or Storable retrieves?
by pjf (Curate) on Oct 02, 2001 at 18:15 UTC
    It's past midnight, so my mental capacity is going to rapidly approach that of a pumpkin, so take this all with a sprinkle of salt.

    The answer to question 4 (are you making it too complicated?), is probably yes. Grab your HTML::Template man-page, and do a search for "file_cache". Here's one I prepared earlier:

    file_cache - if set to 1 the module will store its cache in a file using the Storable module. It uses no additional memory, and my simplistic testing shows that it yields a 50% performance advantage. Like shared_cache, it will work in a CGI environment. Default is 0.
    Now isn't that neat? HTML::Template already knows about the Storable module, and the comments underneath make note about it using flock() to make sure everything behaves. (Regular caveat about file-locking being broken on NFS applies.)

    Hope that you find the above useful.

    Cheers,
    Paul

      One other clarification: I'm using HTML::Template's cache features, storing parsed templates in memory. BTW, I think the caching provided by HTML::Template only stores the parsed templates, not the data, into memory. This doesn't help me, I'm talking about storing the data structures.

      I wouldn't want HTML::Template to store my data in memory. I'd rather manage that, and when it needs to change, myself.

      -Any sufficiently advanced technology is
      indistinguishable from doubletalk.

        When I put up the warning about my mental capabilities approaching that of a pumpkin, I meant it. I woke up this mroning thinking exactly the same thing (you want to cache the data, not the template). Doh!

        The suggestions I would have given if I had been thinking have already been mentioned, so I won't repeat them here.

        Cheers,
        Paul

      That stuff in the HTML::Template docs is only referring to the cache used for the compiled templates. That cache is not available through a public interface in HTML::Template. However, there's no shortage of good caching modules on CPAN.
Re: mySQL hits or Storable retrieves?
by George_Sherston (Vicar) on Oct 02, 2001 at 18:13 UTC
    Just a thought - I'm working on a MySQL-based CGI which gets lots of hits, and in retrospect I slightly wish I'd worked with PostGreSQL, or at least used DBD instead of DBI, because this would have allowed me to use transactions to manage conflicts between different users at the same time. The alternative I settled on, file locking, feels a bit clunky. this thread has more if you want it.

    § George Sherston
      or at least used DBD instead of DBI

      That is meaningless.

      When you use DBI you use a DBD as well - namely the Database Driver for the database system you use.

        Sorry not to be more clear. I intended to contrast what I understand to be two separate modules, namely DBI and DBD::mysql. At the moment I use the former, which is v straightforward and easy to learn; but (so far as I can make out) unlike DBD::mysql it doesn't support MySQL transactions. If DBI does support MySQL transactions I'd be glad to hear of it! But as you can see from the thread I linked to, my interim conclusion is that it doesn't, whereas DBD:mysql does. My main point is, that if I start out on a project like this in future, I'll make sure I use a DBMS / driver combo that lets me do transactions.

        § George Sherston
Re: mySQL hits or Storable retrieves?
by toma (Vicar) on Oct 02, 2001 at 19:08 UTC
    We used the approach that you describe and it worked fine. In our case, we had a fully-normalized schema which required a six table join to get to useful data. The queries weren't terribly slow, but we were expecting lots of them.

    If you want to be buzzword compliant, call your system a 'data warehouse.' Commercial database companies will sell you this same type of solution starting at six figure USD prices. I sat through a long sales pitch. As far as I could tell, your idea is the only database improvement involved in data warehousing.

    I was able to get the mod_perl processes to share the memory used by the Storable data structure. This way the many http daemons didn't consume too much RAM. We had three data structures using a total of about 100M bytes of RAM.

    You have to set the operating system process size limit large enough so that you don't run out of memory for the data structure.

    One very minor gotcha we ran into was that the Storable data was OS dependent. It wasn't a network order problem; we just upgraded our web server OS and somehow it wouldn't read Storable data from the old OS any more. It shouldn't have had that problem, so there was probably a bug somewhere.

    It should work perfectly the first time! - toma

      OH MAN IS THAT SWEET!

      I just got the "stored array of arrays" thing working and it's so damn fast! My god!

      This is soo cool! Basically, I'm going to pre-create each of my accesslevel/category combinations and then store them on disk with Storable. Then on each invocation, I just need to retrieve the structure, do a little massaging, and VOILA! I've got my complex data structure, created with multiple DBI queries, off disk.

      My preliminary testing shows that it's at least 50% faster, and this is just when I'm doing 1 DBI query(the whole reason I needed to do this was because I'm going to need to do more than 1 query) and storing the structure on disk, not in memory.

      I'll still use DBI for selecting individual documents and lookups, but for the complicated, multi-query stuff I'm going Storable!

      -Any sufficiently advanced technology is
      indistinguishable from doubletalk.

        One simple question: Why are you storing your structure on disk? Why not use a CLOB or BLOB (the datatype is called TEXT in MySQL I think, but I forget) and store it in a table?

        You may have to think some about how to physically structure the database doing this, but you wouldn't have to worry about file locking and other issues the database can abstract you away from.

        Just a thought ....

Re: mySQL hits or Storable retrieves?
by Hero Zzyzzx (Curate) on Oct 02, 2001 at 18:07 UTC

    Why can't you edit your post sometimes? Strange.

    A couple of clarifications- I'm developing under mod_perl and using persistent DBI connections via Apache::DBI.

    -Any sufficiently advanced technology is
    indistinguishable from doubletalk.

Re: mySQL hits or Storable retrieves?
by tstock (Curate) on Oct 02, 2001 at 21:59 UTC
    Consider, if possible, to actually store the HTML results, and only call the CGI if the html file is not there. I do this on my site where data is only updated once a day on the DB, using the errordocument 404 directive on htaccess to call the CGI.

    A cron job deletes all the stored html files when new data is available.

    I also store a gzip'ed version of the HTML and serve it if the browser accepts .gz encoding (using mod rewrite), this saves beaucoup du bandwidth... but thats neither here nor there I guess :-)

      Yeah, I thought about this option initially, but the system I'm creating uses a pretty complex templatting scheme and conditional output of pages. I use cron-created HTML on another system I built and it's pretty sweet, I just regenerate the HTML when there is a change to a document.

      This project is a little more intense. I'm making it so that I can have multiple non-profits with multiple domain names, with different HTML templates, all share the same database of docs. Each program also has the ability to exclude documents based on specific properties (document type, subject matter, etc.) and has multiple access levels. This could quickly get into an absurd amount of static HTML pages, and I only have a T1 (for now. . .) to host it on. As long as I keep my page generation (in bits per second) above the speed of the T1, I think I'll be OK.

      If it turns out down the road that I'm processor bound (I don't think it will) I'll revisit this option again.

      You know one thing I do use is HTML::Clean, this module, at a slight cost of speed, about 2% in my testing, and at it's lowest level, gives me a 20-30% savings in total size of my HTML. It's worth it! I may look into the gzip option too, I have a post-processing sub I could easily plug this into.

      -Any sufficiently advanced technology is
      indistinguishable from doubletalk.

        Though it's not pretty, you may want to look into "componentizing" your pages using SSI. Since "virtual" includes via SSI can call CGI or whatever else you want, you can cache HTML "chunks" of an overall page while having other portions of the page dynamically-generated.
Re: mySQL hits or Storable retrieves?
by cyberscribe (Acolyte) on Oct 03, 2001 at 08:05 UTC

    Hi Hero Zzyzzx,


    There are many variables that go into answering your questions, but the one that sprang to my mind is: "does the flavor of SQL you are using support temp tables"? I know MySQL does not, but Informix for example does.

    Temp tables are a great way to store data between queries inside SQL itself, which can be an ideal out-of-memory storage solution for many applications. By using temp tables properly, you can split up complicated queries into several simple queries and even perform queries, functions, and sorts that would be impossible in a single SQL statement with a given table layout.

    So, that's my advice. Any time I hear about in-memory storage that could incur a hit and cron-tabbed scripts to pick up the slack, I take a look at the given tool set for more elegant methods. In the case of SQL, there are many, and formulating a query for an application can save you a lot of time on the front end with other data structures.

    You could say my answer is off-topic to Perl, but in a way I believe it is "perlish" in philosophy:

    The elegant and the appropriate, sufficiently understood and applied, have the power to transform the compound and confusing into the concise and beautiful.

    -cyberscribe

      Maybe I'm confused,(in fact I am confused), but maybe I'm specifically confused...

      MySQL has got TEMP tables, which are in memory, SQL accessible, databases. I've used these for small data warehouse style problems in the past. They're very fast. Is this what you mean??

      -- Anthony Staines

        The problem is that each "view" if you will, which is a combination of the users' accesslevel and category they are requesting, filtered down by the domain that the request is coming from, requires me to run multiple queries. My understanding was that mySQL had a memory-caching feature for regular tables, in addition to the RAM-only HEAP tables that your alluding to.These views could also get large down the road. (I'm migrating a Lotus Domino (yuck!) NSF that currently has 2,700 documents int it.)

        I guess I just thought that it'd be faster and more elegant to run the queries that create the data structures I need periodically, store them to disk with Storable and then retrieve them when I need them, rather then keep two different sets of SQL, one for selecting from a memory-only HEAP table and one for updating to the main database. It's actually far simpler for me to solve this problem using multiple SQL queries, too.

        I'll post example code soon. It's suprisingly (at least to me) simple!

        -Any sufficiently advanced technology is
        indistinguishable from doubletalk.