in reply to More than one way to skin an architecture

One thing to keep in mind is that “memory” is, at least potentially, “a disk file.”

The actual degree to which it is so depends entirely upon the amount of RAM and the other workload that the computer is doing, but as a matter of general principles if you're dealing with a large amount of information you need to be mindful of just how large it is.

“Large in-memory hashes” can be problematic because of the nature of the hashing-algorithm. It can cause widely-scattered memory references and this can make your “working set” large, in other words, excessive paging. On the other hand, if you know that the target machine has gobs of chips and not much else to do, it might be a non-issue. (“Just throw silicon at it.”   “If you've got it, flaunt it.”)

As previously mentioned here, Perl offers the tie mechanism which actually allows you to specify how a “hash” is actually stored:   for instance, you can tie it to a Berkeley-DB file. So the syntax is that of a hash, but the implementation is disk file-access. The key issue here is to be aware of what your chosen implementation is going to do, and how it's going to behave on your hardware.

Replies are listed 'Best First'.
Re^2: More than one way to skin an architecture
by BrowserUk (Patriarch) on Mar 19, 2008 at 00:11 UTC

    Swap files are not persistant, so they don't really help the OP in persisting his "A day's data is about 10KB so this isn't going to get very large, even after a month; which is about all I want to save."


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Yeah. I generally agree with that statement. I also like Jethro's comment about maintaining the data in a human readable form (YAML comment above). I'm now thinking that persistence in a human readable form is the real need.

      The value of all this, at least to me, is that I get to sound out the questions against a body of people who can give me some answers and act as a sounding board. My wife just can't answer these questions.

      For that, I thank you all.

        If you go the human readable route, take a close look at your options before commiting. Personally, I don't find YAML easy to follow or maintain.

        By way of example, here are the outputs of a moderately complex randomly generated structure from 3 contenders: Data::Dump, Data::Dumper and YAML. See which you understand best, and prefer to maintain.

        I also think that human-readable is a double-edged sword that should only be weilded if there is a definite need.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: More than one way to skin an architecture
by mcoblentz (Scribe) on Mar 19, 2008 at 05:01 UTC
    Frankly, that question is the very reason I asked this question. Even though the dataset might not get very large, and therefore I could probably afford a "sloppy" approach to this problem, it seems prudent to really think this one through. Who knows? I might decide to keep a year's worth and then where would I be?

    I running this on my "work" laptop. It's got all the usual Office apps running - Outlook, Word, PPT, etc., plus Firefox (which seems to be a memory hog, if you ask me). This planet program is grabbing the USGS every 20 minutes, airplane data every 5 (if I turn that on) and updating the day/night terminator line every 5. Since this script is going to run once a day (well, ships actually update about 4x/day but I don't know if I care that much), I don't want to drag this thing down just because I was sloppy about hunting down cruise ships.

    The resources for this problem are finite; I don't have all the silicon I would like (I wish I did!); I work in an enterprise software company and bad architecture just offends me.

    I didn't know that large hashes can cause a memory problem. It would be great to hear more about that kind of thing. And your comment,

    "The key issue here is to be aware of what your chosen implementation is going to do, and how it's going to behave on your hardware"
    is spot-on; except that I don't know what a given implementation might do - I'm new to Perl and therefore the question.
      you could always use a DBM file (AnyDBM) standard modules. they behave just like a hash, are dead easy to use and fast. They are simple key->value databases stored on disk. I used one with 250,000 price references once and was dead fast. hash version was very slow to load. updating with new CSVs would be easy too.