mcoblentz has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

As part of my personal journey to enlightenment, I have started a personal project that just seems to keep on growing. So, knowing that in good software "Architecture is not an Afterthought", I think I need to take this opportunity to solicit some advice.

The project is to create a near-real time picture of the events on the earth and to display that as the screen background on my laptop (Windows XP SP2). Currently this stage of the project is trying to plot the known locations of cruise ships and other maritime vessels on the seas (and possibly their track - previous locations over x period of time).

We may assume that the current days' data is available as a .csv file. Recently other monks have helped me work through the whole page scraping, table-extracting thing. I save that data as a .csv file for subsequent querying via DB-style select statements in the same script, so I already have two I/O's to marshal/un-marshal the data. (Not very efficient, I think). I want to save (persist) the days' data for subsequent recall (for example, if asked for the ships' track) in another session at a later date, so I use a .csv file.

My question is: would it be more performing (performant?) over time to try and merge the day's data into a large hash (keyed on ship's name, for example)? Can I save a hash back to the file system and recall it later? (I might not run this script every day). Or would it be better to just save the daily .csv files and (re)assemble the hash in memory? A day's data is about 10KB so this isn't going to get very large, even after a month; which is about all I want to save. It doesn't seem like I could trim a hash of old data by day when the key is by ship name, though.

It also seems to me that treating this like a database (of .csv data) would be an easy thing to query (just build the select statement and go). DBI::CSV would work well here. I'm not quite sure how to fetch multiple .csv files and append them together to get one large db, however. Can I use DBI on a hash like that or should I use something different? I know there are many ways to do this and the (holy) documentation tells me so, but which one would work well? Since I'm not a programmer by trade or training, my approaches to this project tend to be somewhat of a "hack". I'm learning Perl as I go. As a hobby, at least my wife knows that I'm not just surfing for "inappropriate content".

Thanks, Matt

Note: This "earth" drawing program takes input in the form of: a simple marker file: <lat>, <long>, <ship_name>. I can also create an "arc file" for the ship's historical track (<lat1>, <long1>, <lat2>, <long2>). I already plot earthquakes (NRT from the USGS); volcanoes; satellites; clouds and storms; and even some airplanes in flight. So to construct these input files, I just query and print by row. Voila!

  • Comment on More than one way to skin an architecture

Replies are listed 'Best First'.
Re: More than one way to skin an architecture
by plobsing (Friar) on Mar 17, 2008 at 22:08 UTC
    First, let me say that that idea is supremely cool.

    Questions about performance without code to benchmark are (almost if not entirely) meaningless.

    My advice would be to implement your program in the easiest way possible while maintaining the ability to plug in alternatives later without much difficulty. For example, if you're thinking of storing a hash to disk using DBI, why not use a Tie module to do it automagically for you? It may be suboptimal (maybe not), but its really easy.

    Once you've got it working right, then worry about getting it working efficiently.
      Thanks for the sentiment. The project has started to take on a life of it's own. My friends and co-workers are the encouragement here.

      You make a very good point about performance being something to optimize after you have at least a prototype - I guess that's the nature of iterative development after all - but I was trying to think my way through this one first. (Not my usual technique, I admit). Since I'm not too skilled yet at hashes and managing them, I'm a tad leery of what the query/select statements will look like and what_not. I know to bind columns to a csv table and query from that; not sure what the equivalent construct would be for a hash. Any thoughts there?

      I like the Tie idea. The module seems reasonable enough for a novice like me.

      As for optimization after the prototyping, well, there are the Monks, aren't there? ;)

        Do you understand how to use a hash in Perl? They aren't exactly like databases (although the underlying idea is essentially the same). In any case there are no query/select statements in the SQL sense.

        I'm not a database guy, so I don't know the voodoo to set up your data model optimally for a database, but when I look at your problem, I see a hash mapping boat identifiers (names probably) to arrays of coordinate pairs in chronological order (HoAoA).

        But then I think that that kind of structure maps perfectly to netcdf, on which I cut my teeth. So I'm probably horribly biased on this.

        As for optimization, super search, profiling, and maybe low-level formats (for numeric data) are your friends.
Re: More than one way to skin an architecture
by BrowserUk (Patriarch) on Mar 18, 2008 at 04:47 UTC

    For keeping arbitrary datastructures on disk, DBM::Deep takes some beating. If you know how to use perl's hashes and arrays, then use the tie interface. Adding a single line to the top of your program and whatever you do to the hash ends up on disk ready to be retrieved next time.

    No need to re-parse your data every time. No obscure interfaces to learn. No need to force fit your data onto the relational model. And it's surprisingly fast. One line, do whatever Perl lets you do with your data, and it just works.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: More than one way to skin an architecture
by okram (Monk) on Mar 18, 2008 at 07:34 UTC
    I assume that you have the following:
    • Current day's data as CSV
    • Historical data as CSV or other input
    In order to display the data, I assume that you're going to do something like this, Foreach Ship:
    • Display current day's position as RED dot or similar
    • Foreach historical data, display that position as BLUE dot or similar -- if you have it
    IMHO the most simple thing to do would be:
    • Prepare an HASH from the historical data -- if you have it -- like this:
      • Keys are ships names
      • Values are an ARRAY REF of the historical position, each unit like:
        • {datetime => '...', position => '...'}
    Once you've plotted today's data, all you need to do is go through all of today's data (foreach ship name) and:
    • Push today's data to the historical hash of arrayrefs -- if you have no historical data, simply create the hash the first time
    • save for later reference
    Then, whenever you start the program up, you'll only have to:
    • Load historical data
    • see above for displaying
    • see above for merging
    • see above for saving
    Now, for the hash part. The hash I have in mind would be like:
    # today's data -- your starting point my %SHIPS = ( SHIP1 => [ {datetime => '20080318-0730', latitude => '10.33', longitude => '- +05.45'}, ], # other ships );
    You can then save that via Data::Dumper:
    use Data::Dumper; print Data::Dumper::Dumper([\%SHIPS],[*SHIPS]); # save to file, don't +print
    This specific invocation of Data::Dumper::Dumper will give you exactly what you have in the hash.
    The cool thing is that at the subsequent start of the program all you have to do is to "do" the file, and Perl will get the contents of the %SHIPS hash you had at the previous run:
    my %SHIPS; do 'dumped_file';
    As above, all you'd then have to do is display the data, merge it with the historical, and save again.

    The "merge" (assuming you are reading through a CSV on <>) is something like:
    while (<$CSV>) { chomp; # assume csv is SHIP1,DATETIME,LATITUDE,LONGITUDE. # modify accordingly.. my ($shipname, $datetime, $latitude, $longitude) = split(/\,/$_); my %today_details = (datetime => $datetime, latitude => $latitude, l +ongitude => $longitude); # this is what will get pushed on the arrayr +ef my @arr = @{$SHIPS{$shipname} || []}; # get curr contents push (@arr, \%today_details); # push today's data $SHIPS{$shipname} = \@arr; # push back } # that's you done with the merge
    Really, even if it does get VERY long... it's not like your PC can't handle a 10MB file after some time..
    Or maybe you'd want to implement a "dump older than one month details".. you'll simply go through the hash, then through the arrayref, and discard all the entries which are older than a specified time.. Exercise left to the reader ;)
    Hope this has helped :)
      I think that this is the general direction I will try first. One of the things that occurred to me during this conversation is that the most likely historical query would be to pick a ship and plot its track over time. I don't think I would ever query by a geographical location (lat/long box or circle search) ever. That doesn't make any sense to me (at least right now. If I have to do it later, I could always just dump the file into a DB and query it that way.

      I like the idea of being able to trim the historical data by ship/date, because I don't have to trim all ships to the same period.

Re: More than one way to skin an architecture
by roboticus (Chancellor) on Mar 17, 2008 at 22:57 UTC
    mcoblentz:

    That sounds like a really cool project.

    Just one thought: If you're wanting to have persistence, and you don't mind using DBI, why not just use a database (even a small one like SQLite)?

    ...roboticus

      Hi roboticus,

      Fair question. My initial, instinctive response was because 1) I don't have anything installed on the laptop and didn't really want learn another program (I'm okay at SQL but not great by any means). And setting up primary and foreign keys, etc., really is out of my league at the moment. 2) I was thinking that I would move all of this to my website eventually and run it there, so text processing and perl would be better; but hey, I have MySQL on the site already, so...

      ... it's not a bad idea and I really should think about it. Queries would be just as simple; porting to a web host would be fairly easy; and dropping old data wouldn't be that hard.

      Food for thought.

Re: More than one way to skin an architecture
by jethro (Monsignor) on Mar 18, 2008 at 04:27 UTC
    If you are looking for simple hack-friendly solutions, you can serialize (i.e. store) arbitrary data structures to disk and read them again the next time with something like YAML::LibYAML.

    As long as your program needs to access the whole database anytime it wants to draw the earth you might as well slurp the entire hash in one go from file to memory (or use one file per day if you want a nice way to segment the data or invalidate or overwrite old data) instead of using a database.

    The best thing, the data on disk is human-readable, so there is always the possibility to use emacs grep and less to edit, view or search your data.

    Performance should be a bit slower than Tie, but you don't have to think about how to store more complicated data structures like hashes of arrays. YAML does that for you.

    Also this solution is a bit low on scalability, but as long as your data doesn't go into tens of megabytes you are on the safe side. And you can always upgrade to a database, if you need to scale up or need a more 'random access' on the data.

Re: More than one way to skin an architecture
by chrism01 (Friar) on Mar 18, 2008 at 06:36 UTC
    Given that you are talking about multiple data types eg boats, "earthquakes (NRT from the USGS); volcanoes; satellites; clouds and storms; and even some airplanes", and that you say you've got MySQL already, I'd go with that.
    It'll payoff and it's not difficult to do a simple database.
    Don't use foreign keys for now, just use some indexes.

    Cheers
    Chris

Re: More than one way to skin an architecture
by locked_user sundialsvc4 (Abbot) on Mar 18, 2008 at 23:08 UTC

    One thing to keep in mind is that “memory” is, at least potentially, “a disk file.”

    The actual degree to which it is so depends entirely upon the amount of RAM and the other workload that the computer is doing, but as a matter of general principles if you're dealing with a large amount of information you need to be mindful of just how large it is.

    “Large in-memory hashes” can be problematic because of the nature of the hashing-algorithm. It can cause widely-scattered memory references and this can make your “working set” large, in other words, excessive paging. On the other hand, if you know that the target machine has gobs of chips and not much else to do, it might be a non-issue. (“Just throw silicon at it.”   “If you've got it, flaunt it.”)

    As previously mentioned here, Perl offers the tie mechanism which actually allows you to specify how a “hash” is actually stored:   for instance, you can tie it to a Berkeley-DB file. So the syntax is that of a hash, but the implementation is disk file-access. The key issue here is to be aware of what your chosen implementation is going to do, and how it's going to behave on your hardware.

      Swap files are not persistant, so they don't really help the OP in persisting his "A day's data is about 10KB so this isn't going to get very large, even after a month; which is about all I want to save."


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Yeah. I generally agree with that statement. I also like Jethro's comment about maintaining the data in a human readable form (YAML comment above). I'm now thinking that persistence in a human readable form is the real need.

        The value of all this, at least to me, is that I get to sound out the questions against a body of people who can give me some answers and act as a sounding board. My wife just can't answer these questions.

        For that, I thank you all.

      Frankly, that question is the very reason I asked this question. Even though the dataset might not get very large, and therefore I could probably afford a "sloppy" approach to this problem, it seems prudent to really think this one through. Who knows? I might decide to keep a year's worth and then where would I be?

      I running this on my "work" laptop. It's got all the usual Office apps running - Outlook, Word, PPT, etc., plus Firefox (which seems to be a memory hog, if you ask me). This planet program is grabbing the USGS every 20 minutes, airplane data every 5 (if I turn that on) and updating the day/night terminator line every 5. Since this script is going to run once a day (well, ships actually update about 4x/day but I don't know if I care that much), I don't want to drag this thing down just because I was sloppy about hunting down cruise ships.

      The resources for this problem are finite; I don't have all the silicon I would like (I wish I did!); I work in an enterprise software company and bad architecture just offends me.

      I didn't know that large hashes can cause a memory problem. It would be great to hear more about that kind of thing. And your comment,

      "The key issue here is to be aware of what your chosen implementation is going to do, and how it's going to behave on your hardware"
      is spot-on; except that I don't know what a given implementation might do - I'm new to Perl and therefore the question.
        you could always use a DBM file (AnyDBM) standard modules. they behave just like a hash, are dead easy to use and fast. They are simple key->value databases stored on disk. I used one with 250,000 price references once and was dead fast. hash version was very slow to load. updating with new CSVs would be easy too.
Re: More than one way to skin an architecture
by Anonymous Monk on Mar 20, 2008 at 21:57 UTC
    This seems to me to be more a question of program architecture than of implementation. Being about 2 years of full time perl programming in, I try to think about my data structures first before I design the code. Given that you are taking input from planes, ships, earthquakes, weather, etc. the DATA MODEL that will be the core of your design - and persistent data store should be a single table of all of the common factors - lat, long, time - and some "type" that uniquely specifies the special rendering of that item. KISS! Two things come up here - this is a single table - probably 2-D. And you can implement the VIEW as a set of queries on this table (row by row?) - with specific routines to render the different types. I am struggling to think of a better conceptual way to do this than: SELECT name, type FROM global_data WHERE lat = 45% # ie iterate row by row AND long = 15% AND time < 28 days SORT BY time DESCENDING into two arrays @event_name = []; @event_type = []; ... err and just display item '0' from the array (you can ony render the top / most recent). Icons, colours etc can be chosen to reflect you desired view. Then you can build filters that get / scrape data to load the table and that display data according to its type. So - even if you did not have any SQL experience, I think that this would be the first port of call to help Structure your Query. And then I would bear in mind that this needs (i) One well designed table (ii) No transactions, not performance limiting, no indexes And, of course, great DBI implementations and MySQL for PCs, Linux, etc, etc.