anonymized user 468275 has asked for the wisdom of the Perl Monks concerning the following question:

At a unix operations department which serves development projects, work arrives as a word document. When the work is done, the word document needs to be updated to reflect the actual solution applied. I had a pile of 76 documents with sufficiently similar requirements that I could create a list of 76 lines of CSV and write a Perl script to do the 76 pieces of work. Afterwards there remains about 10 days of paperwork to be done (about 1 hour per request). So I was given the choice: either sit and type for ten days or write more Perl scripts to do it for me.

One of the requirements is to add a row to the change history table in the document. Having found out that .docx files are a zipped set of mainly xml files, but that Archive::Zip cannot extract to Windows, I used Archive::Zip::Member to open each file as a filehandle and for the file containing the document body:- Document.xml, (the others just get written into a new directory for rearchiving later into a new docx file), I fed its contents to XML::Parser, getting an array of deeply nested arrays and hashes back. Several arrays deep in the structure comes a "w:tbl" tag followed by an array reference to where the table starts. But I need to match a string several further levels deeper to know whether I have found the table I want to update.

So I have a fledgeling docx package with a trivial new method and two recursive traversal routines that check the ref() for how to traverse each element found. I devised an addressing mechanism as : { treetop => $arrayref, immedparent => $someref, ixork => $ArrayIndexOrHashKey } and the recursive routines aim to return the addresses for any matches to a search string.

It didn't work first time (although it does compile and do some kind of traversal already) and owing to the ridiculous amount of empty air in these structures, it takes ages to step my way through with the debugger to find out why my search routines don't do anything useful.

Meanwhile, while I headbang my way through this for another day or two, I wondered if anyone has a better idea (e.g. module(s) suggestion) for how to traverse an arbitrarily deep and baggy array of arrays and hashes looking for two matches and recording a) the reference of the next array element after the first match, b) the reference of the parent of the second match and the c) index or key within parent of the second match.

Thanks in advance!

-S

One world, one people

  • Comment on Traversing arbitrarily deep and baggy structures

Replies are listed 'Best First'.
Re: Traversing arbitrarily deep and baggy structures
by ikegami (Patriarch) on Feb 16, 2011 at 18:17 UTC
    There's already a query language that's well suited for searching XML: XPath. I use XML::LibXML for XML parsing, and it supports XPaths via findnodes and the like.
      Although this has plenty of bells and whistles, in my opinion it doesn't add enough crucial stuff to justify the slow performance. Larry Wall's XML::Parser already creates a perfectly good tree. The chosen task is to traverse that tree.

      One world, one people

        Slow? XML::Parser is a turtle in comparison! XML::LibXML is 20x faster than XML::Parser via XML::Simple. (XML::Parser alone doesn't create a tree.)

        Either way, there is a generic XPath module that can be adapted to any tree structure.

Re: Traversing arbitrarily deep and baggy structures
by ELISHEVA (Prior) on Feb 16, 2011 at 18:47 UTC

    I'd like to second ikegami's suggestion and recommend that you take a serious look at Xpath and XSLT. There isn't much point in writing your own pattern matcher when there is a really good one out there already. XSLT can be used to transform the document once you've found the nodes of interest. However, in order to use them, you'll need to understand your document's structure, so perhaps that is putting the cart before the horse.

    I devised an addressing mechanism as : { treetop => $arrayref, immedparent => $someref, ixork => $ArrayIndexOrHashKey }

    I'm not sure I understand your addressing scheme - are you pairing elements (node, immediate parent)? When I traverse arbitrarily deep structures I usually use a simple array to store the descent path, like this:

    sub recurse { my ($param1, $param2, ...., $aPath) = @_; $id = getIdOfCurentNode(); .... do some stuff ... if ($bNeedToRecurse) { push @$aPath, $id; recurse($param1, $param2, .... $aPath); pop @$aPath; } elsif ($bSomeProblemFound) { print "Error Blah found at node=<@$aPath>\n"; } }

    it takes ages to step my way through with the debugger to find out why my search routines don't do anything useful.

    Although I feel a bit awkward about pushing my own modules, I'd also like to suggest a module that I recently released on CPAN, Exception::Lite. I wrote it because I do a lot of recursive programming and needed a better way to see the stack when there is an error in my code. I too found walking through a debugger tedious and way too time consuming.

      The full path to the match isnt needed - but I like your module and will give it serious thought.

      One world, one people

Re: Traversing arbitrarily deep and baggy structures
by SuicideJunkie (Vicar) on Feb 16, 2011 at 18:09 UTC

    Some sample data would come in handy.

    I would try a search function that takes a ref and uses ref to see if it is a hash, array or scalar. Then use that info to decide if it is what you're looking for, and if not, recurse on the contents. Eventually, return either undef, or a reference to the element in the structure you wanted and go from there.

    For debugging, I'd suggest Dumpering the structure to a file, and building a string in the recursion to show where you are at the moment for printing useful comments.

    It can also be handy to indent your prints to follow the recursion.

    findthing($root, '$root->', ''); sub findthing { my $node = shift; my $debugString = shift; my $indent = shift; if (ref $node eq 'ARRAY') { for (0..$#$node) { my $result = findthing($node->[$_], $debugString . "\[$_\]", $in +dent .' '); return $result if defined $result; } print $indent . "No luck in array at $debugString\n"; }elsif (ref $node eq 'HASH') { for (keys %$node) { my $result = findthing($node->{$_}, $debugString . "\{$_\}", $in +dent . ' '); return $result if defined $result; } print $indent . "No luck in hash at $debugString\n"; } ... return undef; }
      I think much more than this suggestion is already outlined in the OP. As for sample data - the OP also explains how to get it - it really is too nasty to post!

      One world, one people

Re: Traversing arbitrarily deep and baggy structures
by locked_user sundialsvc4 (Abbot) on Feb 16, 2011 at 21:55 UTC

    ++ on the idea of using XSLT in this case.

    You said yourself that this was an “arbitrary and baggy structure.”   You said that you are looking for a match “anywhere in the tree.” Well, that problem has already been solved by XSLT, in the form of a search-path that begins with '//'.   For example.   So, in my humble, the “inefficiency” (for the already blisteringly-fast electronic computer) of using the generic solution, handily defeats its alternative:   Perl code that you have to write and to constantly maintain.   The time that we seek to save here is not the computer’s ... it is yours.   JM2CW.   Your Mileage May Vary.™

Re: Traversing arbitrarily deep and baggy structures
by anonymized user 468275 (Curate) on Feb 17, 2011 at 12:52 UTC
    I looked at the XSLT idea that three people suggested. I even went to various XSLT websites. It seems to represent an extraordinary deviation from my requirement with no clear way back to it. Also, I just found XML::Traverse::ParseTree, which might well be the answer.

    One world, one people

      XSLT is a template system. It's not for data extraction. The useful part of XSLT is its use of XPaths, but there's no reason to use XSLT to use XPaths.