in reply to XML and file size

Keep in mind that appending to a text file, and "appending" to an XML file, are not exactly the same, because the XML file will (should?) have some internal structure. For example, suppose you choose one file per month. Then a journal file might look something like this:

<entries month="2003-01"> <entry date="2003-01-01"> <p>Hung over.</p> </entry> <entry date="2003-01-02"> <p>Still hung over.</p> </entry> <entry date="2003-01-03"> <p>Better today. Phew.</p> <p>Wish I didn't have to go to work, though.</p> </entry> </entries>

When you create your next entry element, you'll be sticking it inside the entries element, rather than at the end of the file. So instead of opening the file for appending, you can either:

  1. Do some creative file scanning and rewriting (bad!); or
  2. Slurp the XML into a DOM-like structure, modify that tree, and write it all back out.

There's nothing inherently wrong with choice 2, but it'll get slower and more expensive as you add more entries. Choice 1 will get increasing more difficult and crufty as soon as you try to do something other than append entries.

As chromatic (I believe it was) mentioned, you also need to think about what you want to do with these entries. Search them? Display arbitrary sets based on date/subject/keywords/etc? If you ever want groupings other than the one you're thinking about for storage, then you may prefer to store each entry separately.

HTH!

--roundboy

Replies are listed 'Best First'.
Re: Re: XML and file size
by gjb (Vicar) on Jan 07, 2003 at 02:23 UTC

    Rather than slurping the whole XML into a DOM for appending information, a SAX approach can be used. Simply pass through everything but the closing root tag. On encountering it, emit the new node and then the closing root tag.
    This is much faster and much more memory friendly.

    But yes, appending to XML is expensive.

    Just my 2 cents, -gjb-

      Thanks, a very good point. A SAX parser is the robust way to implement "creative file scanning", and I just didn't think of it. But the point regarding this alternative remains true: namely, that as you start doing additional tasks beyond reading and appending, it'll get progressively harder to get it done.

      Regardless, since the goal of the project is to learn new technologies, maybe the best approach would be this: do a little reading, and a lot of thinking, about how XML document types can be used to represent various structures, and then consider what kinds of structural relationships will exist within the journal data. Then choose a data representation, and write a schema or DTD (even if no validation is needed, it's good practice). Finally, play with the various tools including both kinds of parsers. I'd even suggest poking around with a q&d "parser" that builds on something like

      my ($tag, $attrs, $body) = /<(\w+)\s+(.*?)>(.*?)</\1>/s;
      to see why it is disrecommended by so many people.

      --roundboy