gmagklaras has asked for the wisdom of the Perl Monks concerning the following question:

The story

I am working on a prototype DSL and I have crafted a small-ish converter that reads us validated XML and throws up transformed SQL code based on the source markup.

You can find the entire code for reference at the LUARM/ITPSL SVN repository


A typical XML input markup is given below:
<?xml version="1.0"?> <itpslsig xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <itpslheader> <signid> 5938724b6b41a834ac695529dd104ed0 </signid> <signdate> <year>2010</year> <month>12</month> <day>20</day> </signdate> <ontology> <reason>intentional</reason> <revision>1.0</revision> <user_role>ordinary_users</user_role> <detectby>multi</detectby> <multihost>no</multihost> <hostlist>proteas,dionisos,slart,cn1,panoptis</hostl +ist> <weightmatrix>3,10,20,70</weightmatrix> <os>linux</os> <osver>2.6</osver> <keywords>DoS software install DoS loiq </keywords> <synopsis> This signature predicts the usage of the +Low Orbit Ion Cannon tool for DDoS attacks. </synopsis> </ontology> </itpslheader> <itpslbody> <mainblock> <mainop>as_a_result_of</mainop> <subblock> <subop>AND</subop> <fileexists> <filename>loiq</filename> <type>executable</type> <location>OR (#userhome#/*,/site/*,/tmp/*,/tem +p/*)</location> <singlefile>yes</singlefile> <ownedbyuser>johnc</ownedbyuser> </fileexists> <fileexists> <filename>loiq.pro</filename> <type>textdata</type> <location>OR(#userhome#/*,/site/*,/tmp/*,/temp +/*)</location> <ownedbyuser>johnc</ownedbyuser> <singlefile>yes</singlefile> </fileexists> <fileexists> <filename>loiq.qrc</filename> <type>textdata</type> <location>OR(#userhome#/*,site/*,/tmp/*,/temp/ +*)</location> <singlefile>yes</singlefile> <ownedbyuser>johnc</ownedbyuser> </fileexists> </subblock> <subblock> <subop>single</subop> <userexec> <username>johnc</username> <name>OR (file-roller,tar,bunzip2)</name> <path>OR(/usr/bin/,/usr/local/bin)</path> <singleprocess>yes</singleprocess> <argumentlist>loiq*.bz2</argumentlist> <pattern>any</pattern> </userexec> </subblock> <subblock> <subop>single</subop> <fileexists> <filename>*</filename> <type>any</type> <location>OR (#userhome#/.mozilla/*,#userhome# +/.opera)</location> <singlefile>yes</singlefile> <withcontents> <stringsearch>"http://sourceforge.net/proj +ects/loiq"</stringsearch> </withcontents> <ownedbyuser>johnc</ownedbyuser> </fileexists> </subblock> </mainblock> </itpslbody> </itpslsig>

In short, the program creates an XML::Twig::XPath-ed structure with handlers. The problems start with calling the 'parsesubs' subroutine:
my $twig = new XML::Twig::XPath( TwigHandlers => { #ITPSL header parsing data "/itpslsig/itpslheader/ontology/weightmatrix" => \&getwm, "/itpslsig/itpslheader/ontology/detectby" => \&getdetectmethods, "/itpslsig/itpslheader/ontology/os" => \&getos, "/itpslsig/itpslheader/ontology/osver" => \&getosver, #ITPSL body parsing data "/itpslsig/itpslbody/mainblock/mainop" => \&getmainop, "/itpslsig/itpslbody/mainblock" => \&getnoofsubblocks, "/itpslsig/itpslbody/mainblock/subblock" => \&parsesubs, }); # parse, handling nodes on the way $twig->parsefile( shift @ARGV );

Part of the 'parsesubs' subroutine calls via if statements other subroutines to localize the parsing of specific markup directives. The anonymous array reference ($sblockstack) is my crude way of having a stack of parsed directives (an array of arrays, each row representing the parsed directives per sub-block).
The problem
When I execute the code on a Linux based Perl 5.12 system, what I get (excluding harmless warnings and with the help of Data::Dumper) is the following:
Called parsesubs Directive is fileexists Pushing to sblockstack [0] [0] Directive is fileexists Pushing to sblockstack [0] [1] Directive is fileexists Pushing to sblockstack [0] [2] Called parsesubs Directive is userexec Pushing to sblockstack [1] [0] Called parsesubs Directive is fileexists Pushing to sblockstack [2] [0] $VAR1 = [ [ '##STARTOP:fileexists', '##operand:type:executable', '##operand:location:OR (#userhome#/*,/site/*,/tmp/*,/temp/ +*)', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##operand:filename:loiq.pro', '##operand:type:textdata', '##operand:location:OR(#userhome#/*,/site/*,/tmp/*,/temp/* +)', '##operand:ownedbyuser:johnc', '##operand:singlefile:yes', '##operand:filename:loiq.qrc', '##operand:type:textdata', '##operand:location:OR(#userhome#/*,site/*,/tmp/*,/temp/*) +', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##ENDOFOP##', '##operand:filename:loiq', '##operand:type:executable', '##operand:location:OR (#userhome#/*,/site/*,/tmp/*,/temp/ +*)', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##operand:filename:loiq.pro', '##operand:type:textdata', '##operand:location:OR(#userhome#/*,/site/*,/tmp/*,/temp/* +)', '##operand:ownedbyuser:johnc', '##operand:singlefile:yes', '##operand:filename:loiq.qrc', '##operand:type:textdata', '##operand:location:OR(#userhome#/*,site/*,/tmp/*,/temp/*) +', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##operand:filename:loiq', '##operand:type:executable', '##operand:location:OR (#userhome#/*,/site/*,/tmp/*,/temp/ +*)', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##operand:filename:loiq.pro', '##operand:type:textdata', '##operand:location:OR(#userhome#/*,/site/*,/tmp/*,/temp/* +)', '##operand:ownedbyuser:johnc', '##operand:singlefile:yes', '##operand:filename:loiq.qrc', '##operand:type:textdata', '##operand:location:OR(#userhome#/*,site/*,/tmp/*,/temp/*) +', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##operand:filename:loiq', '##operand:type:executable', '##operand:location:OR (#userhome#/*,/site/*,/tmp/*,/temp/ +*)', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##operand:filename:loiq.pro', '##operand:type:textdata', '##operand:location:OR(#userhome#/*,/site/*,/tmp/*,/temp/* +)', '##operand:ownedbyuser:johnc', '##operand:singlefile:yes', '##operand:filename:loiq.qrc', '##operand:type:textdata', '##operand:location:OR(#userhome#/*,site/*,/tmp/*,/temp/*) +', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##operand:filename:*', '##operand:type:any', '##operand:location:OR (#userhome#/.mozilla/*,#userhome#/. +opera)', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc' ], [ '##STARTOP:fileexists', '##ENDOFOP##' ], [ '##STARTOP:fileexists', '##ENDOFOP##' ] ]; $VAR2 = [ [ '##STARTOP:userexec', '##operand:name:OR (file-roller,tar,bunzip2)', '##operand:path:OR(/usr/bin/,/usr/local/bin)', '##operand:singleprocess:yes', '##operand:argumentlist:loiq*.bz2', '##operandpattern:any', '##ENDOFOP##' ] ]; $VAR3 = [ [ '##STARTOP:fileexists', '##ENDOFOP##' ] ];

Obviously the Data::Dumper output tells me that I have a problem in parsing the whole thing properly. Each directive is not terminated properly ( '##ENDOFOP##' string) and certain directives are not entered into their proper place on the array of arrays. A proper output I would expect/want to have by Data::Dumper would be like the following:
... Directive is fileexists Pushing to sblockstack [2] [0] $VAR1 = [ [ '##STARTOP:fileexists', '##operand:type:executable', '##operand:location:OR (#userhome#/*,/site/*,/tmp/*,/temp/ +*)', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##operand:filename:loiq', '##operand:type:textdata', '##ENDOFOP##' ], [ '##STARTOP:fileexists', '##operand:type:executable', '##operand:location:OR (#userhome#/*,/site/*,/tmp/*,/temp/ +*)', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##operand:filename:loiq.pro', '##operand:type:textdata', '##ENDOFOP##' ], [ '##STARTOP:fileexists', '##operand:type:executable', '##operand:location:OR (#userhome#/*,/site/*,/tmp/*,/temp/ +*)', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##operand:filename:loiq.qrc', '##operand:type:textdata', '##ENDOFOP##' ] ]; $VAR2 = [ [ '##STARTOP:userexec', '##operand:name:OR (file-roller,tar,bunzip2)', '##operand:path:OR(/usr/bin/,/usr/local/bin)', '##operand:singleprocess:yes', '##operand:argumentlist:loiq*.bz2', '##operandpattern:any', '##ENDOFOP##' ] ]; $VAR3 = [ [ '##STARTOP:fileexists', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc', '##operand:filename:*', '##operand:type:any', '##operand:location:OR (#userhome#/.mozilla/*,#userhome#/. +opera)', '##operand:singlefile:yes', '##operand:ownedbyuser:johnc' ] ];

The Question
What am I doing wrong with this program structure? Can somebody suggest a structure that works?
Many thanks!
GM

Replies are listed 'Best First'.
Re: XML::Twig blues
by graff (Chancellor) on Jun 28, 2011 at 03:46 UTC
    I followed the link to your svn repository. It's good that you decided not to post all 1300+ lines of code here, but it's sad that you also decided not to boil things down to a minimal snippet of code to demonstrate the problem.

    I was able to run the code on the sample data provided in the OP, and got the same actual output as shown in the OP, but I think that your presentation of the expected/desired output is not quite consistent with the sample data. At least, if there is a way to get from the OP xml sample to the expected output as posted, it's quite unclear how that could happen.

    Probably the core of the problem is that you assume an array called @fexassembly gets freshly created every time "sub fileexists" is invoked, but since you have this array being used via closure in several other subroutines, I think those subs are using just the first instance of this array every time they get called. This is related to the "harmless warnings" you mentioned, which say things like "Variable "@fexassembly" will not stay shared at itpslc.pl line ..."

    In general, looking over the 1300+ lines of code, I think you're making this much harder than it needs to be. If the goal is to shove xml data into a DB table, you just need to decide which xml elements will constitute distinct rows, and set up your parser handlers so that you do a single insert every time you hit the end tags of appropriate elements.

    XML::Twig is already creating a (rather massive, bulky) data structure for you from the xml, so you shouldn't need to build additional structures containing the same data. You should also be able to do at least some amount of abstraction or generalization regarding the transform from xml elements to DB table rows. You shouldn't need a separate subroutine for every distinct low-level tag, since you end up doing basically the same operations for all of them.

    Actually, the fact that you have my $twig = new XML::Twig::XPath( TwigHandlers => ... in nine different places in your code probably indicates that there is a basic misunderstanding about how the task as a whole should be addressed.

    Bottom line: you seem to be doing so many things wrong here that you're probably better off ditching this attempt and starting over from scratch. Start with a description of the task that's as simple and direct as possible -- something like 'for every element of type ..., insert a row into table ...' If you lay out the plan for mapping container elements to rows, and their contained elements to columns, you'll find ways to generalize across the details. (Whether you actually connect to a DB and do inserts, or simply print SQL statements as output is your choice.)

    XPath might not be the best tool; a basic parser that fills slots in a data structure and inserts each completed structure to the database is what you want, I'm guessing.

    (updated to fix typo in next-to-last paragraph)

    Another update: Here's a simple approach using XML::LibXML; the same can presumably be done with other parsers, but maybe not so compactly (both in lines of code and in memory footprint)... The output might not be exactly what you were aiming for, but I think it's close enough that the differences are trivial.

    #!/usr/bin/perl use strict; use warnings; use XML::LibXML; use Data::Dumper 'Dumper'; my $xml = XML::LibXML->new(); my $doc = $xml->parse_file( "j.xml" ); my $pth = XML::LibXML::XPathContext->new( $doc ); my @subblocks; for my $sbnode ( $pth->findnodes( "/itpslsig/itpslbody/mainblock/subbl +ock" )) { for my $sbchild ( $sbnode->childNodes ) { next unless ( $sbchild->nodeName =~ /fileexists|userexec/ ); my %feitem = ( STARTOP => $sbchild->nodeName ); for my $fechild ( $sbchild->childNodes ) { $feitem{$fechild->nodeName} = $fechild->textContent; } my @features = map { "$_ => $feitem{$_}" } sort grep /^\w/, ke +ys %feitem; push @subblocks, [ @features ]; } } print Dumper( \@subblocks );
    Obviously you'll want to add stuff to cover lots of other issues, but if you follow the basic approach, you should end up with far fewer than 1300 lines of code, and it'll be a lot easier to maintain. (updated one last time to fix code tags)
      Many thanks for your answer and the time you spent to take a look at the code.

      I pasted quite a lot of code simply because Sourceforge (or Sourceforge(t) I should say) had, an outage in my area yesterday and I was not sure if people would get to the repo, being afraid that this will put people off for an answer.

      In terms of the problem, I have located the issue in 'parsesubs'. However, I do like your simpler approach, which does populate the 'subblocks' array with everything, which is not really what I want. No (R)DBMS access is required either. So, on the XML::LibXML code example you posted, how can I modify the loop so I can obtain three scalars, each containing separately the XML string of each subblock? This is what my main problem boils down to. (I know I should have done my reading, but I must admit I am not an accomplished LibXML-er.)

      Many thanks for any response!
      GM
        I know I should have done my reading,

        That's right, and (assuming you have XML::LibXML installed), you can still do it -- it's never too late to learn.

        but I must admit I am not an accomplished LibXML-er.

        Neither am I. I just spend some time looking at the docs, and I try stuff out till I get it working.

        However, I do like your simpler approach, which does populate the 'subblocks' array with everything, which is not really what I want.

        In a case like this, it's not really a problem to have more structural elements in a hash than you really need. There's less effort and more efficiency in treating all elements the same way, and then just using whichever ones you really need.

        how can I modify the loop so I can obtain three scalars, each containing separately the XML string of each subblock?

        If that's really what you want to do, eliminate the inner "for" loop, which goes over the nested child nodes, and use the value returned by $sbchild->textContent (or something to that effect -- I don't quite understand what you're asking for, really).

        Just RTFM, try some different stuff, see what happens, lather/rinse/repeat till you figure out what you really want. Or figure out what you really want first, then try different stuff till you get that. Whatever. Good luck.