Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Complex Data Structures

by random (Monk)
on Jun 18, 2002 at 16:11 UTC ( #175409=perlmeditation: print w/replies, xml ) Need Help??

Recently, while doing summer research in Bioinformatics (I'm an undergrad dual-majoring in Computer Science and Biology) I had cause to explore Perl's references. Up until now, I had never needed an especially complex data structure. Perl's basic data structures have always served my needs adequately, and to be honest, I found Perl's referencing and dereferencing a bit confusing. Even the structure posted below isn't exceedingly complex, but it did spark my curiosity. What was the most complex data structure you ever tried constructing, and why? Another thing I've noticed is that it's often fun to use complex structures, but possible (and even simpler) to use several simpler ones. Anyone else noticed this? How much time / effort have you expended creating a huge structure, only to realize that it makes much more sense to break it up?

use strict; my ($key, $value, %massive, $element, $x, $y); open(CONSERVE, ""); while(<CONSERVE>) { my @params; chomp($_); @params = split /\t/, $_; $massive{$params[0]}->[int($params[1] / 1000000)] = \@params; } close(CONSERVE); open(PARAM, ""); while (<PARAM>) { my @params; chomp($_); ($key, $value, undef, @params)= split /\t/, $_; push(@{$massive{$key}->[int($value / 1000000)]}, @params); } close (PARAM); open(TSC, ""); while (<TSC>) { chomp($_); ($key, $value, undef, $element)= split /\t/, $_; push(@{$massive{$key}->[int($value / 1000000)]}, $element); } close(TSC); open(NIH, ""); while (<NIH>) { chomp($_); ($key, $value, undef, $element)= split /\t/, $_; push(@{$massive{$key}->[int($value / 1000000)]}, $element); } close(NIH); open(FILEOUT, ""); foreach $x(values(%massive)) { foreach $y(@{$x}) { print FILEOUT join("\t", @{$y}) . "\n"; } } close(FILEOUT);

I have, of course, stripped system-specific information from this, but it was a script I wrote recently for combining multiple data sets into one coherent file. Though I probably won't be revising it, as its usefulness has passed, comments on the code -in addition to comments about the questions above- are welcome.


Replies are listed 'Best First'.
Re: Complex Data Structures
by thelenm (Vicar) on Jun 18, 2002 at 17:00 UTC
    When my data structures start getting too complex, I usually see if there's a way I can pull some of the data manipulation (and even the data itself) into a module. Many times I've come back to a program after a few months and said, "Well, it looks like a hash of arrays of hashes of hashes, but what the heck does that mean?" I find it helpful to use functions or object methods to manipulate those data so that I can say something like
    $purchase_table->add_purchase($customer, $basket, $purchase_id, $value +);
    instead of
    $purchases->{$customer}->[$basket]->{$purchase_id} = $value;
    That way, I not only understand exactly what's going on, but I'm also less likely to screw myself up by manipulating the large structure differently in different places in the code. And I can change the actual implementation without changing the function/method calls.

    And as jeffa mentioned, Data::Dumper is a lifesaver. :-)

    -- Mike


Re: Complex Data Structures
by FoxtrotUniform (Prior) on Jun 18, 2002 at 16:20 UTC

    One of the in-house modules I work with (and wrote) has arrayrefs and hashrefs nested about six deep in places. It's daunting when you look at it as a whole, but if you deal with it as an aggregate of dependent parts, rather than as a monolithic whole, it's quite manageable.

    I tend to find that keeping track of several parallel "simple" data structures is more complex and error-prone than using a single, "complex" one. Inevitably, I modify one of the related structures without modifying the others in sympathy, and everything falls apart. (Maybe I'm misunderstanding your point?)

    The hell with paco, vote for Erudil!

      but if you deal with it as an aggregate of dependent parts, rather than as a monolithic whole, it's quite manageable.

      I tend to find that keeping track of several parallel "simple" data structures is more complex and error-prone than using a single, "complex" one.

      I concur. Having parallel data structures is hard to maintain (a maintainer coder wont know everywhere to change) and is confusing usually. So long as there is a consistant and logical structure to the data its complexity is usually not an issue. Although I have to admit that I did write Data::BFDump to dump complex and self-referential data strutures in a more intuitive way as when viewed with Data::Dumper even a fairly simple data structure can end up looking like a plate of spaghetti.

      Yves / DeMerphq
      Writing a good benchmark isnt as easy as it might look.

(jeffa) Re: Complex Data Structures
by jeffa (Bishop) on Jun 18, 2002 at 16:39 UTC
    I really can't recall the most complicated data structure that i have beaten up worked with off hand, i just would like to comment about Data::Dumper. I always use it when dealing with data structures. Always! :)


    (the triplet paradiddle with high-hat)
Re: Complex Data Structures
by samtregar (Abbot) on Jun 18, 2002 at 18:00 UTC
    I think the most complex data-structure I've ever created was for HTML::Template. The module has two main structures - an array of ops including variables, text, loops and conditionals (@stack) and a hash of parameter names to the corresponding variable or loop (%map). The trick is that both the map and the stack point to the same underlying storage. This allows me to do something like this to set the value for a variable:

    ${$map{var_name}} = "text for var_name";

    Then when I get to the place where I want to use the variable on the stack I just do:

    $output .= ${$stack[$i]};

    Since the same scalar is referenced from both @stack and %map I can get access to them both quickly and with no copying required.

    Now, setting up this structure is indeed a royal pain in the ass but the payoff in output speed makes it all worthwhile. (Of course, these days all the real speed demons have moved on to HTML::Template::JIT!)


Re: Complex Data Structures
by hossman (Prior) on Jun 18, 2002 at 21:05 UTC
    Another thing I've noticed is that it's often fun to use complex structures, but possible (and even simpler) to use several simpler ones. Anyone else noticed this? How much time / effort have you expended creating a huge structure, only to realize that it makes much more sense to break it up?

    Let's consider a hypothetical scenerio: I'm going to impliment a Geneology application in java. I'm going to have things like "Person" and "Place", and relationships like "child of" and "married to" and "born in". I decide to impliment this application, using new classes called "Person" and "Place". To handle the parent/child relationship, I'm going to have a getParentId() and getListOfChildIds() methods in my Person class, and you can use those Ids to look up the Object in a global hash table of all people -- which must constantly be kept up to date. This is kind of a pain in the ass, but I'm going to do it this way, because I don't want to have a lot of refrences. It's easier if all of my objects just store Ids, which I use as keys in simple global lookup tables.

    Almost any java programer should realize this is ridiculous.

    Perl is the same way. Regardless of wether or not you are useing true "blessed" objects, or just using hashes or arrays to represent your entities, using hashesrefrences to connect those entities together and show relationships makes a lot of sense. Is it easy to read code like thi? ...
    $yak = $${$foo{'bar'}}{'baz'}[0]; but it's not much worse then this:
    Integer yak = (foo->get("bar")->get("baz"))[0];

    Which is why you don't write code like that if you want people to read it. Instead, you write code that derefrences things in to temporary variables with names that indcate what they are, and how they are used. And if that's stil not readable enough, you start to bless your hashes/arrays of nested refrences, and write some methods that treat them like objests so your code is even more readable.

    Updated: stupid typo.

Re: Complex Data Structures
by Chmrr (Vicar) on Jun 18, 2002 at 20:24 UTC

    I think it's either seven or eight nested hashes and arrays -- t'was for the project mentioned here, as well as here. I was loading the arbitrarily-deep XML file into memory once at startup, and therafter running over it, munging it appropriatly. For me, the huge data structure was really the most convenient layout -- tough, by the end, I was getting a funny feeling it might have been a tiny bit easier in parts (and more buzzword compliant) if I'd used some OO. Oh, well.

    perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

Re: Complex Data Structures
by rattusillegitimus (Friar) on Jun 19, 2002 at 13:28 UTC

    Lately, I've been dealing with large, complex data structures using XML. Mostly I'm trying to teach myself the mysteries of XML and XSLT and determine which of the various XML-related Perl modules I like the best ;) And because most of my programming provides a back-end for dynamic web pages, the XML/XSLT combination has proven ideal.

    One of my current projects involves a monstrous database of authors and books including with just about any information relating to them I can imagine. I'm pulling the data from several different tables and smashing it together into a single XML structure that can be relatively easily made into a web page using XSLT. For me, using XPath to pull out pieces of the data structure to manipulate feels more intuitive than a complex set of hashes and/or references.


Re: Complex Data Structures
by snafu (Chaplain) on Jun 19, 2002 at 16:22 UTC
    I sooo agree with everyone that stated the use of Data::Dumper. Don't leave vim without it!

    For me? Well, complex data structures have been something I've been getting to know intimately for this last project I've been working on. Mine haven't been nearly as daunting as others' here probably, but they have been challenging.

    I've been developing a script the purpose of which is to stage literally hundreds of thousands of files to a temporary location to be archived off to CD. The tough part was that these files are stored in locations that describe where we got the files (I work for a telco so by this I mean as far as which city they came from, what kind of carrier they are, what kind of billing we did on this file, etc). Additionally, there are files in those directories that were used to process the files we really need to archive off. Now, the most important part was that these files usually have datestamps embedded in the file names. Oh, but we can't have a universal naming scheme...nooo thus the date formats differ from file to file and indeed, there are even some files that have no datestamps in them at all. We stage the files based on where the files come from (or file type) and what the date of the file is.

    Alright, so my complex data structures come from two things. The first structure comes from the config file I invented for the script to read in order to know where to find the files, what kind of files they are, and where the files should go. That structure is a hash ref of a hash of hashes of arrays. The next structure I used is the actual file data such as filename, source, dest, date, and type which is only a hash ref of a hash of arrays.

    The data structure was not nearly as daunting as code I had to write to be able to keep the date formats for the destinations uniform. Since the dates in the filenames were not uniform, I had to devise a way to be able to read the dates in and match them up to a date format specified in the config file on a per RE basis. I then took that format and that date and reformatted the date to conform to a date mask we use when we create the directories for archiving. Geez, that took me a long time to work out. It works great in the context of the script but lacks (easy to break) if used from command line.

    Here is an example of the config file:

    ## Set up default environment ## allowed options for define{} blocks # source # destination # unkdest # rule_once define { source = "/u20/home/gvc/gvc_dtfr" # Location of source p +orts. destination = "/u90/gvc_archive/new" # Destination for all +ports. unkdest = "/u90/gvc_archive/tmp" # Where should files go + if # a destination cannot + be # determined or create +d for # a port. default_mask= "mmddyyyy" # Default date mask fo +r the # dest directories. # If this rule is not NULL then only the rules # specified in this variable will be run and # the rest of the rules will be ignored. # See the conf.doc for more on this. # This next should be a rule or a # list of rules delimited commas. rule_once = "5" } #####*******##### ## default macros #####*******##### ## allowed options for rules! # source # destination # unkdest # test_only # regex # port # macro # Arbor AMA files. # matches F*-P*.####.ama # datefield is the number between P*. and .ama macro arbor_ama { regex = "F.*?-P.*?\.(\d+)\.ama:::$1:::mmdd" } rule 5 { port = "150,152" macro = "arbor_ama,usl1,usl2,uslnull,rpt,arbor1_1"; }
    Then part of the code where I am working through one of these structures is:
    # Walk the hash of complex data structures called $macros # and $rules. # starting the traversal of %$rules while ( my ($rule_key,$rule_val) = each(%$rules) ) { # Now, breaking out the hash references from $rule_val while ( my ($nkey,$nval) = each(%$rule_val) ) { # Simple enough, if the key is "macro" then we have found # our macros in the complex data structure. if ( $nkey eq "macro" ) { # Now, we need to start to traverse the %$macros structure # and we will do the merge. while ( my ($macro_key,$macro_val) = each(%$macros) ) { # Now, walk the array reference that was contained in the # reference $nval (which is a reference to an array) for ( @$nval ) { # Now, if the key from the macro hash matches the # rule that is referencing a macro then... if ( $macro_key eq $_ ) { # Replace the macro name with the actual regex from the # macro. map { push(@{$$rules{$rule_key}{regex}},$_); } @{$$macros{$_}{regex}}; # Now that we have the macros mapped to the rules # we can drop the macros from the rules hashes # since they are dead weight now anyway.. delete(%{$rule_val}->{macro}); } } } } } }
    Well, this is probably way more than you asked for but this project is way fresh on my mind and so I couldn't stop myself. :)

    To sum up I have come to really respect complex data structures. I have found that they can seriously shorten a task if they are used properly. I really can't come up with words to express how much I appreciate complex data structures. They are great!

    _ _ _ _ _ _ _ _ _ _
    - Jim
    Insert clever comment here...

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://175409]
Approved by VSarkiss
Front-paged by jarich
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2023-12-04 10:36 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (25 votes). Check out past polls.