Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

XML and file size

by silent11 (Vicar)
on Jan 07, 2003 at 00:20 UTC ( [id://224801]=perlquestion: print w/replies, xml ) Need Help??

silent11 has asked for the wisdom of the Perl Monks concerning the following question:

I'm wanting to get some practice with perl/XML/XSL so I'm writing a journal program for myself. I once wrote a journal program with each entry stored in a txt file with the date as the file's name. It worked just fine, but I'd like to implement some other technologies for the sake of gaining experience and practice.

As I contemplate how to store my data in this new project I ask you, my monk friends, what is the most efficent way to store theses journal entries in XML files?
  • one file per entry?
  • one file per month?
  • one file per year?
  • one file?
I don't expect each entry to be that long, 10k tops. The truth be told I'm really only using perl to combine the XML with the XSL, but I'm courious as to what your thoughts are as how to store my XML data.


Replies are listed 'Best First'.
Re: XML and file size
by roundboy (Sexton) on Jan 07, 2003 at 02:12 UTC

    Keep in mind that appending to a text file, and "appending" to an XML file, are not exactly the same, because the XML file will (should?) have some internal structure. For example, suppose you choose one file per month. Then a journal file might look something like this:

    <entries month="2003-01"> <entry date="2003-01-01"> <p>Hung over.</p> </entry> <entry date="2003-01-02"> <p>Still hung over.</p> </entry> <entry date="2003-01-03"> <p>Better today. Phew.</p> <p>Wish I didn't have to go to work, though.</p> </entry> </entries>

    When you create your next entry element, you'll be sticking it inside the entries element, rather than at the end of the file. So instead of opening the file for appending, you can either:

    1. Do some creative file scanning and rewriting (bad!); or
    2. Slurp the XML into a DOM-like structure, modify that tree, and write it all back out.

    There's nothing inherently wrong with choice 2, but it'll get slower and more expensive as you add more entries. Choice 1 will get increasing more difficult and crufty as soon as you try to do something other than append entries.

    As chromatic (I believe it was) mentioned, you also need to think about what you want to do with these entries. Search them? Display arbitrary sets based on date/subject/keywords/etc? If you ever want groupings other than the one you're thinking about for storage, then you may prefer to store each entry separately.



      Rather than slurping the whole XML into a DOM for appending information, a SAX approach can be used. Simply pass through everything but the closing root tag. On encountering it, emit the new node and then the closing root tag.
      This is much faster and much more memory friendly.

      But yes, appending to XML is expensive.

      Just my 2 cents, -gjb-

        Thanks, a very good point. A SAX parser is the robust way to implement "creative file scanning", and I just didn't think of it. But the point regarding this alternative remains true: namely, that as you start doing additional tasks beyond reading and appending, it'll get progressively harder to get it done.

        Regardless, since the goal of the project is to learn new technologies, maybe the best approach would be this: do a little reading, and a lot of thinking, about how XML document types can be used to represent various structures, and then consider what kinds of structural relationships will exist within the journal data. Then choose a data representation, and write a schema or DTD (even if no validation is needed, it's good practice). Finally, play with the various tools including both kinds of parsers. I'd even suggest poking around with a q&d "parser" that builds on something like

        my ($tag, $attrs, $body) = /<(\w+)\s+(.*?)>(.*?)</\1>/s;
        to see why it is disrecommended by so many people.


Re: XML and file size
by chromatic (Archbishop) on Jan 07, 2003 at 01:35 UTC

    It depends on what kind of data you want to store. What is a journal entry? What kind of operations do you need to perform on your entries? It's hard to tell you what your best option is when you haven't yet decided what you want to do.

Re: XML and file size
by PodMaster (Abbot) on Jan 07, 2003 at 01:51 UTC
    chromatic gives excellent advice, but here's my spin on it: use a single file per entry. If you update an entry, you can even add a revision history if you want. Now you (a user) have a directory full of xml files, i mean journal entries, and you can manipulate them in any way shape or form you want.

    I'd use a date/time string to name the files, and maybe stuff a month's worth into it's own subdirectory.

    As for XSL, check out ;)

    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: XML and file size
by dws (Chancellor) on Jan 07, 2003 at 06:37 UTC
    I'd like to implement some other technologies for the sake of gaining experience and practice. ... I ask you, my monk friends, what is the most efficent way to store theses journal entries in XML files?

    If your primary goal is to gain experience, don't worry about efficiency just yet. Try things. Measure. Get a feeling for the strengths and weaknesses of various approaches. Play around.

Re: XML and file size
by Matts (Deacon) on Jan 07, 2003 at 08:36 UTC
    My personal preference would be either to have one XML file per entry and generate indexes offline (either in a cron job, or every time you create/edit an entry), or do the whole thing in a database.

    Generally I tend to go with the database route, because at the end of the day they're great for storing lots of bits of similar data, and I always turn stuff into XML anyway for output generation via axkit. That way I get the best of both worlds.

Re: XML and file size
by osama (Scribe) on Jan 07, 2003 at 05:18 UTC

    if your aim is only to practice XML/XSL then by all means go ahead and use either "one file per entry" or "one file"... I don't see any advantages for the rest.

    Remember: You don't have to use XML just because it's a HOT Buzzword. I see many people are using new technologies just for the sake of using it.- Appending XML is expensive as a poster said, but searching in XML is much more expensive!!!!

      searching in XML is much more expensive!!!

      I'm not sure that I agree with that. If "searching" means checking whether a word or phrase occurs in the file then the time required to search would be almost identical for XML versus plain text - assuming you use the same code for each (save for the fact that the XML file will be a bit more verbose so extra I/O might be required in some cases).

      On the other hand if you want to do semantic searches (eg: does this word or phrase occur within <title> ... </title> tags?) then sure that will take more CPU cycles than a plain text match but that is merely extra cost for extra power.

        I have nothing against XML, and it can be used to store your data in some cases, but I think it's better suited for data interchange/SOAP/Having different formats for the same data.I'm actually comparing XML files to a database, to which they are frequently offered as an alternative, storing XML in a database is another thing.

        I never heard of anybody saying "I'll use XML files instead of text files"... It's mostly "Use XML and you don't need a database", I just cannot Imagine a search in 200,000 XML files looking for text in <title> tags. but imaginig "select body from pages where title like '%text%'" is easy.

        I think storing you data in any type of files XML/text/CSV... is a waste of time if you have lots of data (>1000 records? le ss? more?)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://224801]
Approved by gjb
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (4)
As of 2024-04-22 02:41 GMT
Find Nodes?
    Voting Booth?

    No recent polls found