Friend has a script which is parsing XML file using XML::Twig. 10MB input file with 2000 structures that could be parsed in parallel. He has even bigger files, up to 100MB and script is parsing it even for 30 hours. I already advised him to check why is it taking so long but he wants anyway to add threads to this script and speed it up. And he ended up with script producing error Free to wrong pool 3080610 not 589260 at C:/Perl64/lib/XML/Parser/Expat.pm line 432.

We have minimized the script to following one:

#!perl -l use XML::Twig; use threads; use Thread; $t= XML::Twig->new(twig_roots => {managedObject => \&handle_fasade}); $t->parsefile('inputFiles/wcel3g.xml'); sub handle_fasade{ my $currentTh = Thread->new( \&thrsub ); $currentTh->join; } sub thrsub{ }

If I comment out join or parsefile or even replace my $currentTh = Thread->new( \&thrsub ); with my $currentTh = Thread->new( { return 0 } ); error does not occur. What is wrong in this code?

Actually the longer I prepare this node the clearer it is to me that this approach is senseless. Is it even possible to parse XML in parallel? I would rather say XML parsing must be done in one thread and afterwards processing data can be done in parallel. Am I right?

update: fake input xml:

<?xml version="1.0" encoding="UTF-8"?> <blah version="2.1" xmlns="blah.xsd"> <someData type="actual" name="ActualConfiguration" id="1"> <header> <log dateTime="2012-05-08T10:10:10" action="export"/> </header> <managedObject class="NOKFLF:FLF" distName="MNE-PET/FLF-1000" id="6666 +666000000093362" timeStamp="2012-04-16T18:17:50" vendor="XXX" version +="S14"> <extension name="system_parameters"> <p name="$modifier">UNAUTHENTICATED</p> <p name="$state">operational</p> </extension> <list name="FLFOptions"> <p>0</p> <p>1</p> <p>2</p> <p>3</p> <p>4</p> <p>5</p> <p>7</p> <p>8</p> <p>10</p> <p>12</p> <p>13</p> <p>16</p> <p>17</p> <p>20</p> <p>24</p> <p>25</p> <p>29</p> <p>31</p> <p>32</p> <p>34</p> <p>35</p> <p>36</p> <p>37</p> <p>41</p> <p>42</p> <p>45</p> <p>46</p> <p>47</p> <p>48</p> <p>50</p> <p>51</p> <p>54</p> <p>56</p> <p>61</p> <p>62</p> <p>68</p> <p>69</p> <p>72</p> <p>73</p> <p>74</p> <p>88</p> <p>96</p> <p>107</p> <p>108</p> <p>109</p> <p>117</p> <p>118</p> <p>120</p> <p>123</p> </list> <p name="name1">31</p> <p name="name2">31</p> <p name="name">BRLE8</p> <p name="name4">25</p> <p name="name5">50</p> <p name="name6">10</p> <p name="name7">80</p> <p name="name8">20</p> <p name="name9">100</p> <p name="nameA">20</p> <p name="nameB">2</p> <p name="xyz">1</p> <p name="dbf">0</p> <p name="battery1">30</p> <p name="cpu2">150</p> <p name="FLFType">10</p> <p name="lower">40</p> <p name="upper">60</p> <p name="releaseLimit">4</p> <p name="delay">5</p> <p name="connection1">14</p> <p name="connection2">7</p> <p name="connection3">12</p> <p name="connection4">12</p> <p name="connection5">14</p> <p name="disableExt">0</p> <p name="disableInt">0</p> <p name="frPenalty">3</p> <p name="emerC">1</p> <p name="extraXLSNumber">6</p> <p name="extraBSW">64</p> <p name="RelPri">1</p> <p name="epHoUse">0</p> <p name="frTchim">30</p> <p name="freeDowngrade">95</p> <p name="freeUpgrade">4</p> <p name="freqMeas">30</p> <p name="xCalc">0</p> <p name="param1">4</p> <p name="param2">5</p> <p name="param3">0</p> <p name="param4">0</p> <p name="param5">30</p> <p name="param6">0</p> <p name="param7">10</p> <p name="param8">127</p> <p name="param9">1</p> <p name="param10">0</p> <p name="param20">255</p> <p name="param30">0</p> <p name="dparam1">150</p> <p name="dparam4">100</p> <p name="dparam6">186</p> <p name="dparam8">512</p> <p name="dparam10">30</p> <p name="cparam3">120</p> <p name="cparam5">50</p> <p name="cparam7">50</p> <p name="cparam9">384</p> <p name="cparam11">384</p> <p name="sparam1">21</p> <p name="sparam2">26</p> <p name="sparam3">30</p> <p name="sparam4">20</p> <p name="sparam5">25</p> <p name="sparam6">30</p> <p name="sparam7">24</p> <p name="sparam8">29</p> <p name="sparam9">120</p> <p name="sparam0">60</p> <p name="sparama">60</p> <p name="sparams">240</p> <p name="sparamd">4</p> <p name="sparamgf">1</p> <p name="sparamh">255</p> <p name="sparamh">10</p> <p name="sparamj">30</p> <p name="sparamk">3</p> <p name="sparami">18</p> <p name="sparamu">0</p> <p name="sparamy">8</p> <p name="sparamt">0</p> <p name="sparamr">1</p> <p name="sparamer">1</p> <p name="sparame">9</p> <p name="sparamw">7</p> <p name="somanyparams1">10</p> <p name="somanyparams2">90</p> <p name="somanyparams3">10</p> <p name="somanyparams4">70</p> <p name="somanyparams5">90</p> <p name="somanyparams5">20</p> <p name="somanyparams6">20</p> <p name="somanyparams7">1</p> <p name="somanyparams0">1</p> <p name="somanyparams8">1</p> <p name="somanyparams9">20</p> <p name="somanyparamsa">120</p> <p name="somanyparamss">120</p> <p name="somanyparamsd">1</p> <p name="somanyparamsf">400</p> <p name="somanyparamsg">100</p> <p name="somanyparamsh">200</p> <p name="somanyparamsj">25</p> <p name="somanyparamsk">1</p> <p name="somanyparamsl">66947</p> <p name="somanyparamso">66947</p> <p name="somanyparamsi">66947</p> <p name="somanyparamsu">8</p> <p name="somanyparamsy">0</p> <p name="somanyparamst">65535</p> <p name="somanyparamsr">5</p> <p name="anotherparam1">0</p> <p name="anotherparam2">5</p> <p name="anotherparam3">3</p> <p name="anotherparam4">1</p> <p name="anotherparam5">5</p> <p name="anotherparam6">3</p> <p name="anotherparam7">1</p> <p name="anotherparam8">3</p> <p name="anotherparam9">2</p> <p name="anotherparam0">4</p> <p name="anotherparamq">3</p> <p name="anotherparamw">12</p> <p name="anotherparame">6</p> <p name="anotherparamr">3</p> <p name="anotherparamt">6</p> <p name="anotherparamy">9</p> <p name="anotherparamu">12</p> <p name="anotherparami">20</p> <p name="anotherparamo">10</p> <p name="anotherparamp">5</p> <p name="anotherparama">20</p> <p name="anotherparams">10</p> <p name="anotherparamd">5</p> <p name="anotherparamf">20</p> <p name="anotherparamg">10</p> <p name="anotherparamh">5</p> <p name="anotherparamj">20</p> <p name="anotherparamk">10</p> <p name="anotherparaml">5</p> <p name="anotherparamz">20</p> <p name="anotherparamx">10</p> <p name="anotherparamc">5</p> <p name="anotherparqamb">30</p> <p name="anotherparqamn">0</p> <p name="anotherparqamm">0</p> <p name="anotherparqamas">4152</p> <p name="anotherparqams">0</p> <p name="anotherparqamd">15</p> <p name="anotherparqamf">1</p> <p name="anotherparqamg">10</p> <p name="anotherparqamh">5</p> <p name="anotherparqamj">128</p> <p name="anotherparqamk">0</p> <p name="anotherparqaml">127</p> </managedObject> </someData> </blah>
Simple line preparing huge fake input xml (replace number 100 with appropriate value):
perl -p0e "s/<managedObject.*?>.*<\/managedObject>/$& x 100/se" testin +small.xml > testin.xml
Script doing the job (to be optimized):
use XML::Twig; $inputFile = 'testin.xml'; $outputFile = 'testout.xml'; $loop = 563; $netType = "MNE-1v1"; $mx2G = "002"; $my2G = "02"; $objID = 1; $bID = $firstElementID = 1; $managedObjectsAmount = 0; $someID = 0; $segmentID = 0; $header = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE raml +SYSTEM 'blah.dtd'>\n<blah version=\"2.1\" xmlns=\"blah.xsd\">\n<someD +ata type=\"actual\" name=\"ActualConfiguration\" id=\"1\">\n<header>\ +n<log dateTime=\"2012-05-08T10:10:10\" action=\"export\"/>\n<log date +Time=\"2012-05-08T10:10:10\" action=\"ConfigurationHeaderBackup.id\"> +1</log>\n<log dateTime=\"2012-05-08T10:10:10\" action=\"Configuration +HeaderBackup.name\">ActualConfiguration</log>\n</header>\n"; $root = "<managedObject class=\"CommonStuff:ABCD\" version=\"1.0\" dis +tName=\"$netType\" id=\"12341234\" vendor=\"XXX\" timeStamp=\"2012-04 +-26T15:18:07\">\n<defaults name=\"System\" id=\"2\"/>\n<extension nam +e=\"system_parameters\">\n<p name=\"\$state\">operational</p>\n</exte +nsion>\n</managedObject>"; $ending = "\n</someData>\n</blah>"; my ($sec,$min,$hour,$day,$month,$yr19,@rest) = localtime(time); open(OUT, ">", $outputFile) or die "cannot open dataOut.txt: $!"; print OUT $header; print OUT $root; for $i(1 .. $loop) { $t= XML::Twig->new( twig_roots => { managedObject => \&handle_mana +gedObject}); $t->parsefile($inputFile); print "\nIteracja: $i / $loop \t-> OK\n"; $bID++; $someID = 0; } print OUT $ending; close (OUT); print "\n----------------\nObjects managed: $managedObjectsAmount \n\n +"; my ($sec2,$min2,$hour2,$day2,$month2,$yr192,@rest2) = localtime(time); printStartTime(); printEndTime(); sub handle_managedObject { my ($t, $element) = @_; @fields = split(/\//, $element->{'att'}->{'distName'}); # distName="MNE-PET/*" - OK if ($fields[0] ne $netType) { $fields[0] = $netType; } # distName="MNE-PET/FLF-1000..1064" - OK if ($fields[1] =~ /^FLF/) { $fields[1] = "FLF-".$bID; if (!$fields[2]) { $element->first_child('p[@name="name"]')->set_text($fields +[1]); } } # distName="MNE-PET/FLF-*/WTF-1..65" -> / FLF if ($fields[2] =~ /^WTF-\w+/) { $fields[2] = "WTF-".$someID; if (!$fields[3]) { $fields[2] = "WTF-".++$someID; $element->first_child('p[@name="name"]')->set_text($fields +[2]); } } # distName="MNE-PET/FLF-*/WTF-*/XLS-1..6" -> /WTF if (($fields[3] =~ /^XLS-\w+/) && (!$fields[4])) { @fieldsFLF = split(/-/, $fields[1]); @fieldsWTF = split(/-/, $fields[2]); @fieldsXLS = split(/-/, $fields[3]); $cId = $fieldsWTF[1].$fieldsXLS[1]; $element->first_child('p[@name="name"]')->set_text($fields[3]) +; $element->first_child('p[@name="cId"]')->set_text($cId); $element->first_child('p[@name="locAreaId1"]')->set_text($fiel +dsFLF[1]); $element->first_child('p[@name="locAreaId2"]')->set_text($mx2G +); $element->first_child('p[@name="locAreaId3"]')->set_text($my2G +); if ($fieldsXLS[1] == 1) { $element->first_child('p[@name="masterWTF"]')->set_text(1) +; $element->first_child('p[@name="segmentId"]')->set_text(++ +$segmentID); } else { $element->first_child('p[@name="masterWTF"]')->set_text(0) +; $element->first_child('p[@name="segmentId"]')->set_text($s +egmentID); } } $element->{'att'}->{'distName'} = join ('/',@fields); $element->{'att'}->{'id'} = $objID++; $element->set_pretty_print( 'indented'); $element->print(\*OUT) or die "Failed to write managedObject to ou +tput XML file:$!\n"; $managedObjectsAmount++; } sub printToFile { $element->set_pretty_print( 'indented'); $element->flush(\*OUT) or die "Failed to write element output XML +file:$!\n"; } sub printStartTime { print "START Time:\t".sprintf("%02d",$hour).":".sprintf("%02d",$mi +n).":".sprintf("%02d",$sec);###To print the current time print "\t$day-".++$month. "-".($yr19+1900)."\n"; ####To print date + format as expected } sub printEndTime { print "END Time:\t".sprintf("%02d",$hour2).":".sprintf("%02d",$min +2).":".sprintf("%02d",$sec2);###To print the current time print "\t$day2-".++$month2. "-".($yr192+1900)."\n"; ####To print d +ate format as expected }

In reply to XML::Twig and threads [solved] by grizzley

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.