in reply to Re: redesign everything engine?
in thread redesign everything engine?

The speed issues may be fixable without a rewrite, but frankly the Everything code is not very well suited to its current use on PerlMonks. It has some fundamental design decisions (keeping the code in the database, doing most of the data as generic blobs of XML) which are a major cause of slowness and race conditions, but more importantly make it really hard for most people to contribute to the code and make testing nearly impossible. How could anyone profile this code effectively? Just retrieving it to run on your own system requires significant work (because you have to get it out of the PerlMonks database and test data is not available).

It's very cool that the system was designed flexibly enough to work in this way, but a more focused codebase that works specifically for PerlMonks would be able to run much more efficiently. PerlMonks is essentially a separate codebase now, since it branched off the Everything codebase a long time ago and is not able to take updates from that code unless someone manually merges them in.

I would like to believe that a gradual process of rewriting could fix these issues, but I'm not sure it will because the things that need to be changed are so fundamental to the current design. Your point about all the accumulated knowledge in this code is a very good one though, and not to be dismissed lightly. Rewriting would be a lot of work and it would be hard to get all of the current functionality right. Migrating the data would be REALLY hard.

Caching with mod_perl, on the other hand, is trivial. I gave a talk about it at OSCON last year and I'd be happy to help if you have questions about it. Tye was concerned that using shared caching anywhere other than the nodelets would make the race conditions worse, but doing the caching itself is simple. (Of course caching across a cluster is hard, but that has nothing to do with mod_perl and may not be required.)

Replies are listed 'Best First'.
Re: Re: Re: redesign everything engine?
by chromatic (Archbishop) on Jan 28, 2003 at 21:09 UTC

    I profiled the current CVS in single-user mode on my laptop this weekend. I'm not really concerned about any one specific site, just the framework and general behavior. If I can speed that up, I'll have met my goal.

    I'm not terribly concerned about the XML, though using XML::DOM is a performance killer. That's mostly during the installation, though, so it's a low priority along the performance axis. The only place it's really used internally in the live system is in the workspacing code, and I don't think there's any of that on Perl Monks at the moment.

    Caching is complicated by the fact that the current CVS has subrefs in it. That's why I'm betting my managed-forking approach will have better performance in certain circumstances.

    The performance killers, as I see them:

    • pages are optimized for writing -- parsing links every time, processing page templates on every hit. This is ameliorated somewhat by code caching in the 1.0 series
    • nodelets are cached for the whole site or not at all -- they could be cached per user for a speed improvement
    • nodes have a custom inheritance scheme to deal with nodemethods, which was between 10 and 20% of the profiled time in my tests -- this could be reduced further
    • inefficient database queries, fetching hashrefs when bound scalars and explicit column names are 20-50% faster -- I'm working on this
    • inefficient code -- we're better programmers now

    I'm working on all of these, but it's at the tip of CVS. My goal is to make migrating to Everything 1.2 an attractive option for Perl Monks.

      By grabbing the latest Everything from CVS, you're kind of high-lighting the problem here: there is no current CVS of PerlMonks, because the code is kept in the database. There is no convenient way to get the latest code, let alone branch it for a major revision. It also makes the task of incorporating updates from Everything that much harder. This is why I think storing code in the database is not a good idea at this point. I'm sure there were reasons for it at the time, but it is counter-productive now.

      When I referred to XML, what I was really thinking of was the way nodes are stored and the resulting update problems (some of them are described here). I don't think this would be such a problem with a more normalized database schema and a codebase that allowed for finer-grained locking during updates.

      About the cache: subrefs are okay as long as Storable can handle them. Objects that can't be serialized can't be cached between processes at this point. At the moment, Perl threads are not very good at sharing objects so mod_perl 2 may not solve this issue any time soon. I'm not sure what your managed-forking idea is, but I don't see why it wouldn't have to deal with exactly the same issues mod_perl does.

      I don't want to sound like I'm just whining about the code. I am grateful for the existence of this site and your part in creating the code that made it happen. I do think that some of the design ideas have not scaled well though, and that it will be hard to fix it completely without fundamental changes.

        I do appreciate your comments, perrin, and you're the first person I'll ask about an inter-process cache.

        All of the code for the base install of the system is stored in CVS, though -- including the core nodes. It would be nice to do this with Perl Monks as well. (There'd probably be three or four specific nodeballs.) I'm planning to revise the XML format slightly so it's even easier to see changes between node revisions.

        An inter-process cache with its own locking mechanism could help, but there are other ways to avoid it. I'm inclined to propose a rule that all updates are commited to the database at the end of a request.

        Any suggestions to improve the normalization of the database are welcome. For speed reasons, I'm tempted to move the doctype doctext field to a separate table. I'm definitely going to fix the hacky settings by making a one to many table for individual settings. That's another post 1.0 change.

        The problem with caching subrefs is that you'll still pay the eval() penalty. I'd prefer to cache any calculated field, though, as we do many times more reads than writes. That seems like a web-side enhancement, but if we have an interprocess cache, we can avoid many database hits, which will help.

        My managed-forking updates the parent process whenever the cache changes, so the cache is always in the parent. This includes code. All forked children share that memory. I've not found a way to do that with Apache.

        Finally, I agree about fundamental changes. That's my plan. I'm just changing the existing code, not starting over.

      I would love to help out on the code (as might more people on perlmonks) but i don't want to download and install apache, mod_perl, MySQL and everything.

      Could it be a good idea if you, chromatic posted pieces of everything code for us monks to review? I am sure we can come up with some improvements which you could decide (not to) implement on that piece of code.

      carefull readers might understand by now that i would do anything to make this site faster, except really delve into the everything engine ;-)

        Most of the performance problems are architectural. The act of finding appropriate snippets to post means finding bottlenecks. I'm not sure that'll help, because once I find a bottleneck, I can usually fix it. The code's reasonably well-factored now.

        Besides that, we're working from a substantially newer version of the code than runs Perl Monks. nate added a stricter nodeversion caching system, I added code caching, and so forth that Perl Monks doesn't have.

        On the other hand, I almost have a DBD::SQLite backend ready to go, so, if you munge the install process just a bit, you can install the core system without Apache, mod_perl, or MySQL.

Re^3: redesign everything engine?
by tye (Sage) on Jan 28, 2003 at 23:07 UTC

    No, I said that making the node cache shared between processes would be bad (prevent improvements that would reduce the impact of its race conditions and reduce database server load).

    I mentioned some types of caching besides nodelets and noted that they wouldn't likely be a big win. I didn't say anything about "anywhere other than".

    I also note that chromatic appears to mostly be talking about load that we'd see on the web servers, which, last I checked, wasn't where the main problem is.

                    - tye
      Sorry, I did bowdlerize your comments a little. No harm intended.

        No problem. I just wanted to correct that for the sake of clarity. (:

                        - tye