manunamu has asked for the wisdom of the Perl Monks concerning the following question:

Monks:

Salutations! Here's my question.

I have an xml file which is as follows:


<?xml version="1.0" encoding="UTF-8" ?> <CustomerInvoice> <XSDVersion>1.5.7</XSDVersion> <RecordType></RecordType> <OutFileName></OutFileName> <DateTime></DateTime> <InvoiceDetails> <CompanyInfo> <CompanyName>XXX Ltd</CompanyName> <RegistrationID>REG</RegistrationID> <TaxInvoiceGSTRegistrationNo>DDDD</TaxInvoiceGSTRegi +strationNo> <Comment>Timbuctu</Comment> <TermsAndCondition></TermsAndCondition> </CompanyInfo> <InvHeader> <CustId>000</CustId> <CustCode>1111</CustCode> <CuInvRefNum>abc</CuInvRefNum> <CuInternalRefNum></CuInternalRefNum> <CuOriginalInvoiceNo></CuOriginalInvoiceNo> <CuInvoiceDate>XXX</CuInvoiceDate> <CuBillStartDate>XXX</CuBillStartDate> <CuBillEndDate>XXX</CuBillEndDate> <CuTotalDeposit></CuTotalDeposit> <CuNonTaxableAmount>0.00</CuNonTaxableAmount> <CuTotalTax>5.05</CuTotalTax> <CuTaxArray> <CuTaxDescription>YYY</CuTaxDescriptio +n> <CuTaxPercentage>7.0000</CuTaxPercenta +ge> <CuTaxableAmount>72.1168</CuTaxableAmo +unt> <CuTaxAmount>5.05</CuTaxAmount> </CuTaxArray> <CuTotalAdjustments></CuTotalAdjustments> <CuInvoiceAmount>77.17</CuInvoiceAmount> <CuInvoiceCurrency>XXX</CuInvoiceCurrency> <CuTaxableAmountBase></CuTaxableAmountBase> <CuTotalTaxBase></CuTotalTaxBase> <CuBaseCurrency>XXX</CuBaseCurrency> <CuPreviousBalance></CuPreviousBalance> <CuPaymentAmount></CuPaymentAmount> <CuBalance>77.17</CuBalance> <CuDueDate>XXXX</CuDueDate> <CuOverdueAmount></CuOverdueAmount> <CuPaymentMode>Cash</CuPaymentMode> <CuPaymentTerm>Payment received on or after YYY</CuP +aymentTerm> <CuDeliveryNo></CuDeliveryNo> <CuDODate></CuDODate> <CuDeliveryDate></CuDeliveryDate> <CuPOReference></CuPOReference> <CustomerInfoField1>WWWW</CustomerInfoField1> <CustomerInfoField2>ZZZZ</CustomerInfoField2> <CustomerInfoField3></CustomerInfoField3> <CustomerInfoField4></CustomerInfoField4> </InvHeader> <AddressInfo> <AddressLine1></AddressLine1> <AddressLine2>MR SAMPLE_CUST</AddressLine2> <AddressLine3> GROSS AVENUE 1</AddressLine3> <AddressLine4>#9999-9999</AddressLine4> <AddressLine5>Timbuctu XXXX</AddressLine5> <AddressLine6></AddressLine6> </AddressInfo> <ExecSummary> <SummaryDesc>Market Summary</SummaryDesc> <DivCostFlatTotalCharges>72.1168</DivCostFlatTotalCh +arges> <SubscriberLevel> <SubscriberName>MR SAMPLE_CUST_</Subsc +riberName> <Market> <Name>XXX</Name> <OneTimeCharge>10.00 +00</OneTimeCharge> <RecCharge>42.0968</ +RecCharge> <UsageCharge>20.0200 +</UsageCharge> <OCCCharge>0.0000</O +CCCharge> <SubsidyCharge></Sub +sidyCharge> <TotalDiscounts>0.00 +00</TotalDiscounts> <TotChargeBeforeTax> +72.1168</TotChargeBeforeTax> </Market> </SubscriberLevel> </ExecSummary> <ServiceInfo> <CoServiceLabel>0</CoServiceLabel> <CoPackageDescription>GGG</CoPackageDescription> <CoServiceNumber>999999</CoServiceNumber> <CoServiceAddress>MR SAMPLE_CUST</CoServiceAddress> <CoServiceName></CoServiceName> <CoTotOccCharges></CoTotOccCharges> <CoServiceTotal>72.1168</CoServiceTotal> <ServiceInfoField1></ServiceInfoField1> <ServiceInfoField2></ServiceInfoField2> <ServiceInfoField3></ServiceInfoField3> <ServiceInfoField4></ServiceInfoField4> <ServiceInfoField5>XXX</ServiceInfoField5> <ServiceInfoField6></ServiceInfoField6> <ServiceInfoField7>GGG</ServiceInfoField7> <ChargeType> <ServicesD +etails> <SDPackageName></SDP +ackageName> <SDTaxIndicator>G</S +DTaxIndicator> <SDServiceName>GGG80 +0</SDServiceName> <SDQuantity></SDQuan +tity> <SDPrice></SDPrice> <UOM></UOM> <SDChargedAmount>10. +0000</SDChargedAmount> <Comment></Comment> </Services +Details> <SDChargeType>GGG</SDChargeType> <SDChargeDesc></SDChargeDesc> <SDChargeTypeSubTotal>10.0000</SDCharg +eTypeSubTotal> </ChargeType> <ChargeType> <ServicesD +etails> <SDPackageName></SDP +ackageName> <SDTaxIndicator>G</S +DTaxIndicator> <SDServiceName>GGG80 +0</SDServiceName> <SDQuantity></SDQuan +tity> <SDPrice></SDPrice> <UOM></UOM> <SDChargedAmount>13. +0968</SDChargedAmount> <Comment></Comment> + <SDChargedPeriod> + <ChargeStartDate>9999-10-01</ChargeStartDate> + <ChargeEndDate>9999-10-14</ChargeEndDate> + </SDChargedPeriod> + </ServicesDetails> <ServicesD +etails> <SDPackageName></SDP +ackageName> <SDTaxIndicator>G</S +DTaxIndicator> <SDServiceName>GGG80 +0</SDServiceName> <SDQuantity></SDQuan +tity> <SDPrice></SDPrice> <UOM></UOM> <SDChargedAmount>29. +0000</SDChargedAmount> <Comment></Comment> + <SDChargedPeriod> + <ChargeStartDate>9999-10-15</ChargeStartDate> + <ChargeEndDate>9999-11-14</ChargeEndDate> + </SDChargedPeriod> + </ServicesDetails> <SDChargeType>Charges</SDChargeType> <SDChargeDesc></SDChargeDesc> <SDChargeTypeSubTotal>42.0968</SDCharg +eTypeSubTotal> </ChargeType> <ChargeType> <ServicesD +etails> <SDPackageName>purch +ase1</SDPackageName> <SDTaxIndicator>G</S +DTaxIndicator> <SDServiceName>purch +ase1</SDServiceName> <SDChannelSelection> +</SDChannelSelection> <SDQuantity></SDQuan +tity> <SDInstalmentCounter +></SDInstalmentCounter> <SDPrice></SDPrice> <UOM></UOM> <SDChargedAmount>10. +020</SDChargedAmount> <Comment></Comment> </Services +Details> <ServicesD +etails> <SDPackageName>Purch +ase2</SDPackageName> <SDTaxIndicator>G</S +DTaxIndicator> <SDServiceName>Purch +ase2</SDServiceName> <SDChannelSelection> +</SDChannelSelection> <SDQuantity></SDQuan +tity> <SDInstalmentCounter +></SDInstalmentCounter> <SDPrice></SDPrice> <UOM></UOM> <SDChargedAmount>10. +000</SDChargedAmount> <Comment></Comment> + <SDUsageDetails> + <UsageTaxIndicator>G</UsageTaxIndicator> + <UsageDesc_2>1633</UsageDesc_2> + <UsageDesc_1>XXX XXX</UsageDesc_1> + <UsageDesc>purchase1</UsageDesc> + <PrimQuantity>2</PrimQuantity> + <PrimUOM>Unit</PrimUOM> + <SecQuantity></SecQuantity> + <SecUOM></SecUOM> + <TerTairyQuantity></TerTairyQuantity> + <TerTairyUOM></TerTairyUOM> + <ChargedAmount>10.0200</ChargedAmount> + </SDUsageDetails> + <SDUsageDetails> + <UsageTaxIndicator>G</UsageTaxIndicator> + <UsageDesc_2>1633</UsageDesc_2> + <UsageDesc_1>XXX XXX</UsageDesc_1> + <UsageDesc>Purchase2</UsageDesc> + <PrimQuantity>1</PrimQuantity> + <PrimUOM>Unit(s)</PrimUOM> + <SecQuantity></SecQuantity> + <SecUOM></SecUOM> + <TerTairyQuantity></TerTairyQuantity> + <TerTairyUOM></TerTairyUOM> + <ChargedAmount>10.0000</ChargedAmount> + </SDUsageDetails> + </ServicesDetails> <SDChargeType>Charges</SDChargeType> <SDChargeDesc>9999-01-01 to 9999-01-31 +</SDChargeDesc> <SDChargeTypeSubTotal>20.0200</SDCharg +eTypeSubTotal> </ChargeType> <CallGroupInfo> <CallGroupDescription>purchase1</CallG +roupDescription> <CallGroupTotalAmount></CallGroupTotal +Amount> <CallGroupRemarks></CallGroupRemarks> </CallGroupInfo> <CallGroupInfo> <CallGroupDescription>purchase1</CallG +roupDescription> <CallGroupTotalAmount></CallGroupTotal +Amount> <CallGroupRemarks></CallGroupRemarks> <CallDetai +lByService> <CallDetai +lByService> + <CallDetailSubTotals> + <Description>Sub-Total for Purchase Details</Description> + <Amount>20.0200</Amount> + </CallDetailSubTotals> + <CallDetailInfo> + <TaxIndicator>G</TaxIndicator> + <RoamingPartner></RoamingPartner> + <Country></Country> + <Date>9999-10-04</Date> + <Time>17:23:41</Time> + <TelNumber></TelNumber> + <Duration1></Duration1> + <Unit1></Unit1> + <Duration2></Duration2> + <Unit2></Unit2> + <Amount>10.0000</Amount> + </CallDetailInfo> + <CallDetailInfo> + <TaxIndicator>G</TaxIndicator> + <RoamingPartner></RoamingPartner> + <Country></Country> + <Date>9999-10-04</Date> + <Time>18:00:30</Time> + <TelNumber></TelNumber> + <Duration1></Duration1> + <Unit1></Unit1> + <Duration2></Duration2> + <Unit2></Unit2> + <Amount>0.0200</Amount> + </CallDetailInfo> + <CallDetailInfo> + <TaxIndicator>G</TaxIndicator> + <RoamingPartner></RoamingPartner> + <Country></Country> + <Date>9999-10-02</Date> + <Time>10:24:43</Time> + <TelNumber></TelNumber> + <Duration1></Duration1> + <Unit1></Unit1> + <Duration2></Duration2> + <Unit2></Unit2> + <Amount>10.0000</Amount> + </CallDetailInfo> <Description></Descr +iption> <TelNumber></TelNumb +er> <Amount></Amount> <Description></Descr +iption> <TelNumber></TelNumb +er> <Amount></Amount> </CallDeta +ilByService> </CallGroupInfo> </ServiceInfo> </InvoiceDetails> <NumInvoice>1</NumInvoice> <TotAmount>77.17</TotAmount> <TotCust>1</TotCust> </CustomerInvoice>

The perl program I am running is as follows:


use strict; use warnings; use XML::Parser; use Data::Dumper; use List::Util qw(max); my $file_name; my @file_name1; my $fp1; my $xml_string; my $pkg_txt; my $got_ebook=0; my $call_grp_info = 0; my $CallDetailByService_cntr=0; my $output_file_fp; sub handle_end_xcd { my ( $expat, $element, %attrs ) = @_; my $line = $expat->current_line; my @parent_item_array=$expat->context; my $parent_tag=$parent_item_array[-1]; print $output_file_fp " IN END Element Found $element\n"; if ($element eq 'CallGroupInfo') { if ($got_ebook == 1) { print STDOUT "$xml_string\n"; } } else { print STDOUT "$xml_string\n"; } } sub handle_start_xcd { my ( $expat, $element, %attrs ) = @_; my $line = $expat->current_line; my @parent_item_array=$expat->context; my $parent_tag=$parent_item_array[-1]; if ( ($element eq 'CallGroupInfo') && ($got_ebook > 1 ) ) { $call_grp_info++; print $output_file_fp "in IF $call_grp_info\n"; if ($call_grp_info == 1) { print STDOUT "$xml_string\n"; } } elsif ($element eq "CallDetailByService") { $CallDetailByService_cntr++; if ($CallDetailByService_cntr > 1) { print STDOUT "$xml_string\n"; } } elsif ($element eq "CallGroupDescription") { print STDERR "XXX\n"; } else { print STDOUT "$xml_string\n"; } } sub char_handler { my ($p, $data) = @_; print $output_file_fp "Char handler", $p->current_element, "\n"; if ($p->current_element eq "SDPackageName") { chomp $data; # print STDERR "Package Data is:$data\n"; $got_ebook++ if ( ($data eq "eMagazine Purchase") || ($data eq + "eBook Purchase") ) ; print STDOUT "$xml_string\n"; } else { print STDOUT "$xml_string\n"; } } # End char_handler @file_name1=glob("org1.xml"); my $output_file_name="test.xml"; open ($output_file_fp,">", $output_file_name); my $p1 = XML::Parser->new ( Handlers => { Start => \&handle_start_xcd, Char => \&char_handler, End => \&handle_end_xcd } ) ; foreach (@file_name1) { $file_name = $_; chomp $file_name; open($fp1,"<",$file_name); while(<$fp1>) { chomp; $xml_string=$_; #print $output_file_fp "$xml_string"; chomp $xml_string; if ( grep ( /SDPackageName/, $xml_string) || grep ( /Cal +lGroupInfo/ ,$xml_string) || grep ( /CallDetailByService/ ,$xml_strin +g ) || grep ( /CallGroupDescription/ ,$xml_string )) { $p1->parse($xml_string); } else { print STDOUT "$xml_string\n"; } } }

when I run the program, it gives me an error which is as follows:

no element found at line 1, column 51, byte 51 at D:/Dwimperl/perl/ven +dor/lib/XML/Parser.pm line 187

Have I encountered a bug in XML::PARSER?

Appreciate all help!

Thanks, Manu

Replies are listed 'Best First'.
Re: Bug in XML::Parser (maxims)
by davido (Cardinal) on Oct 22, 2013 at 16:49 UTC

    From Good Advice and Maxims for Programmers:

    • #11907 Looking for a compiler bug is the strategy of LAST resort. LAST resort.
    • #11943 Ah yes, and you are the first person to have noticed this bug since 1987. Sure.
    • #11958 The bug is in you, not in Perl.

    And in case you think I'm just being rude, I'll share this secret with you; remembering these (and the other) maxims has actually helped me to avoid looking in the wrong place for problems... many times. The presentation may be gruff, but keeping it in mind will often shorten the path to finding the root cause.

    Bugs in Perl, and in CPAN modules do turn up from time to time. But they're the issue far less frequently code we've touched, or data input we're accepting.


    Dave

      Perlmonks: Salutations! All points that you have made about me being extremely shabby etc. are valid. Appreciate the answers as well. Many thanks for the same. I should have done the basic check of XML for validity. Work pressure dominated - shouldn't have, but it did. I did not mean to be pompous when I used that subject line in the mail, nor was it an attempt to show-off that I found a bug. It was just my reaction to being vexed. Anyway, will certainly be careful next time and yes, will ensure that any questions that I post will not be in any way presumptuous. Cheers, Manu

        Well I hope you found the humor in those Maxims, because (I believe) they're intended to be humorous, and it was certainly my intent to enjoy them for that reason, while at the same time acknowledging the sound advice that they provide. ;)


        Dave

        I think "shabby" might be a bit overly critical; thus, more so for "extremely shabby". As if we haven't all taken a drink from the "Oops" jar from time to time.  :-)

        That said, you see the light on the error, which is what you came here for, and perhaps will take away a small bit of wisdom to not let work pressure interfere with basic diagnostic process -- another sin we have all surely committed from time to time. I know I have.

Re: Bug in XML::Parser
by McA (Priest) on Oct 22, 2013 at 16:22 UTC

    Hi,

    now, after you've formatted the xml-file, someone lazy like me can press the download link to get the xml source. I took the xml file and did the following:

    xmllint --format file.xml

    This gives me the following:

    xmllint --format file.xml file.xml:259: parser error : Opening and ending tag mismatch: CallDeta +ilByService line 207 and CallGroupInfo </CallGroupInfo> ^ file.xml:260: parser error : Opening and ending tag mismatch: CallGrou +pInfo line 203 and ServiceInfo </ServiceInfo> ^ file.xml:261: parser error : Opening and ending tag mismatch: ServiceI +nfo line 80 and InvoiceDetails </InvoiceDetails> ^ file.xml:265: parser error : Opening and ending tag mismatch: InvoiceD +etails line 7 and CustomerInvoice </CustomerInvoice> ^ file.xml:266: parser error : Premature end of data in tag CustomerInvo +ice line 2 ^

    That proves what Anonymous Monk said early with his X-ray-view. The xml file is not valid.

    So, my advice: On Linux the program xmllint is a valuable tool which you can find regularly in the libxml2 package.

    Regards
    McA

Re: Bug in XML::Parser
by aaron_baugher (Curate) on Oct 22, 2013 at 12:53 UTC

    When you pressed the preview button and looked at your post to preview it, didn't you think, "Wow, that's an unreadable mess, I must have missed something. I'd better figure out how to fix that, or no one will be able to help me"? See Markup in the Monastery for the basics of how to separate your text paragraphs and code/data sections, or click the Writeup Formatting Tips link on the posting page for more details.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: Bug in XML::Parser
by zork42 (Monk) on Oct 22, 2013 at 13:27 UTC
    Hi manunamu, welcome to perlmonks!

    As McA said, but in a little more detail:
    1. put one <code> tag before the start of your XML
    2. put one </code> tag after the end of your XML
    3. put one <code> tag before the start of your code
    4. put one </code> tag after the end of your code
    5. remove all "&lt;Code>" from your code
    6. replace all "&lt;" with "<" (inside <code>...</code> tags you do not need to make any conversions like this)
    7. One line of your XML has <tab> characters. Replace them with spaces to ensure it is indented correctly
Re: Bug in XML::Parser
by graff (Chancellor) on Oct 23, 2013 at 04:24 UTC
    As pointed out previously, your xml input file is bad - there's either an "extra" <CallDetailByService> open tag at line 208, or else you're missing a second </CallDetailByService> close tag at line 258. One way or the other, it's an easy thing to fix.

    Apart from that, your main "foreach" loop indicates that you don't have a proper understanding yet of how to use XML::Parser. You should not be reading an xml file one line at a time and passing certain lines to the parser. That is absolutely the wrong way.

    Use the parser to read (and parse) the entire file, and use the various handler subroutines to do what needs to be done as you encounter the elements of interest in the data. For example, if you want to print the contents of <Amount> elements to STDOUT, you could do something like this (after you fix your xml file):

    #!/usr/bin/perl use strict; use warnings; use XML::Parser; my $current_element = my $current_amount = ""; my $p = XML::Parser->new( Handlers => { Start => \&handle_start, Char => \&handle_text, End => \&handle_end } ); $p->parsefile( "org1.xml" ); sub handle_start { my ( $xp, $element, %attr ) = @_; $current_element = $element; # keep track of where we are } sub handle_end { my ( $xp, $element ) = @_; if ( $element eq 'Amount' ) { # did we just close an "Amount" t +ag? print "$current_amount\n"; $current_amount = ""; } $current_element = ""; } sub handle_text { my ( $xp, $string ) = @_; # do stuff here depending on where we are now: $current_amount .= $string if ( $current_element eq 'Amount' ); }
    Now, isn't that a lot simpler? That's the whole point of using an XML parser - to make things simpler.
      Thanks graff. Appreciate the help. However, (and I should have explained this in my earlier post itself) I am deliberately trying to parse one XML line after another rather than the whole xml file at once. The reason was to find if there are any mismatched tags and then replace the tags and do other corrections. I am still stumped at the ability of XML parser to detect badly-formed XML in spite of the fact that I am not parsing the whole file at once. My assumption is that since I am parsing line by line, XML::Parser has no knowledge of the what is coming next and therefore, it should not be able to detect a badly formed XML. The fact that it does is indeed fantastic albeit completely confounding.
        If you are trying "to find if there are any mismatched tags", that sounds like you are looking for errors that would cause an XML parser to fail (and it appears that the sample xml data you posted has this kind of problem, so I understand your goal now).

        But what that really means is that you can't really use an XML parser at all to solve this problem. As pointed out above, it's easy enough to check for xml errors using xmllint, although the error reports you get can sometimes be difficult to interpret, and the actual problem can still be hard to spot.

        I would be inclined to use a regex-based diagnosis - something like this:

        #!/usr/bin/perl use strict; use warnings; my $infile = shift; # get input file name from @ARGV open( my $fh, "<:utf8", $infile ) or die $!; local $/; # slurp the whole file in the next line $_ = <$fh>; s/^<\?.*>\s+//; # ditch the "<?xml...?>" line, if any my %open_tags; my %close_tags; for my $tkn (split/(?<=>)|(?=<)/) { # split on look-behind | look-ahe +ad for brackets if ( $tkn =~ m{^<(\/?)(\w+)} ) { if ( $1 eq '' ) { $open_tags{$2}++; } else { $close_tags{$2}++; } } } for my $tag ( sort keys %open_tags ) { if ( ! exists( $close_tags{$tag} )) { warn sprintf( "%s: open tag %s is never closed in %d occurrenc +e(s)\n", $infile, $tag, $open_tags{$tag} ); } else { if ( $close_tags{$tag} != $open_tags{$tag} ) { warn sprintf( "%s: element %s has %d open tags but %d clos +e tag(s)\n", $infile, $tag, $open_tags{$tag}, $close_tags +{$tag} ); } delete $close_tags{$tag}; } } for my $tag ( keys %close_tags ) { warn sprintf( "%s: close tag %s has no open tags in %d occurrence( +s)\n", $infile, $tag, $close_tags{$tag} ); }
        That will at least give you a clear tally of imbalances (if any) in the open/close tag inventory for a given xml file. You should be able to use this information, together with the line numbers from the xmllint reports, to locate the problems.

        So, when you find these mismatched tags, isn't the next step to look at the process that is creating the xml files, and fix that? (These xml files aren't being created by manual editing, are they??)

        (Update: BTW, I forgot to mention... this new information in your reply makes your OP even more egregiously obtuse. If you had said at the beginning, "I have this xml file that has an error in the tags, and I need to figure out how to find the problem," then the discussion would have been more effective. I know, you already feel bad about the OP, and I shouldn't pile it on, but it needs to be said.)

Re: Bug in XML::Parser ( my file )
by Anonymous Monk on Oct 22, 2013 at 11:48 UTC

    Have I encountered a bug in XML::PARSER?

    No, the bug is in your xml file

Re: Bug in XML::Parser
by McA (Priest) on Oct 22, 2013 at 11:50 UTC

    Hi,

    please reformat your post.

    One <code>-Tag at the beginning and one </code>-Tag at the end.

    Regards
    McA

Re: Bug in XML::Parser
by graff (Chancellor) on Oct 23, 2013 at 04:43 UTC
    Actually, depending on what you're actually trying to accomplish, something other than XML::Parser might be a better fit. In particular, XML::LibXML is vastly superior (IMHO). There's more documentation to look at, but in my experience, it's worth the effort, especially when you get to the XPath stuff.

    Here's how I'd use XML::LibXML for that same simple example (to print out just the contents of "Amount" tags):

    #!/usr/bin/perl use strict; use warnings; use XML::LibXML; my $parser = XML::LibXML->new; my $doc = $parser->parse_file( "org1.xml" ); my $xpath = XML::LibXML::XPathContext->new( $doc ); for my $node ( $xpath->findnodes( '//Amount' )) { print $node->textContent, "\n"; }
    It doesn't get much simpler than that!