Paulux has asked for the wisdom of the Perl Monks concerning the following question:

Hi to Every one, I have just read the guide XML::Parser by mirod but I'm having some problem witht he handle_char that cut some text.
I'm buffering like is showed in the code, but when I empty the variable $Number='' in the handle start and handle_end in the next elaboration the variable remains empty. But when I don't empty the variable, SOMETIMES, the parser paste two or more values.
#!/usr/bin/perl use XML::Parser; @files = <$plrepository/*.xml>; foreach $xmlfile (@files) { #something is omitted $p2 = new XML::Parser(Handlers => {Start => \&handle_start, End => \&handle_end, Char => \&handle_char}); $p2->parsefile($xmlfile); } sub handle_start { my ($pkg,$element,%attr) = @_; $current_element = $element; if ( $element =~ /Header/i ) { $Number=$attr{Number}; open (OUT, ">$outputfile") or die "No file"; } $Number=''; # I empty the variable for the next Number in the fil +e #something is omitted } sub handle_end { my ($pkg,$element,%attr) = @_; if ( $element =~ /Header/i ) { print OUT $Number,"$separator\n"; print "\tNumber ". $Number . "\n"; close (OUT); } #something is omitted $Number=''; # I empty the variable for the next Number in the fil +e } sub handle_char { my $text = $_[1]; if ( $current_element =~ /^Number$/i ) { ($text !~ /^\s*$/) && ($Number .= $text); #|-> buffer text } #something is omitted }
At the state of the art, I can't change my parsing module, so I have to solve this problem. B/R and thanx a lot.

Replies are listed 'Best First'.
Re: Another problem with XML parser
by toolic (Bishop) on Nov 10, 2009 at 13:50 UTC
    Since you didn't provide any input data, it is impossible for anyone to re-create your problem. Thus, I can only offer generic advice.

    Firstly, use strict and warnings. Perhaps some of your variables are not scoped properly.

    Secondly, sprinkle print statements liberally throughout your code to make sure your variables contain what you expect them to contain. See also Basic debugging checklist.

    At the state of the art, I can't change my parsing module
    It would be a good idea to have a plan in place to retire XML::Parser in favor of a more user-friendly (in my opinion) and better supported parser, such as XML::Twig.
      Thanx a lot for the suggestions, I'd like to provide some example test, but the cutting of the text is random, so I don't have a real case...it happens...
Re: Another problem with XML parser
by gmargo (Hermit) on Nov 10, 2009 at 17:18 UTC

    Do you have multiple elements with a substring of "Header"? You could add anchors to the element match: $element =~ /^Header$/i.

    Can you have a "Number" element that resides outside of a "Header" element? That could be why you see double numbers. Try adding a flag so that the "Char" routine only checks for "Number" while inside a "Header".

    Can you have nested "Header" elements?

    You are opening your output file in one subroutine, and then writing to it and closing it in another. What is the purpose of spliting this up? I would keep the open/write/close together.

    And, purely for entertainment purposes, here is my version of your code, with most of the above ideas, reformatted a bit while I was trying to understand it. It compiles but is untested.

    #!/usr/bin/perl -w use strict; use warnings; use diagnostics; use XML::Parser; my $plrepository = "."; my @files = <$plrepository/*.xml>; foreach my $xmlfile (@files) { #something is omitted my $p2 = new XML::Parser(Handlers => { Start => \&handle_start, End => \&handle_end, Char => \&handle_char }); $p2->parsefile($xmlfile); } my $current_element; # global, shared with start,char my $Number; # global, shared with start,end,char my $inHeader = 0; # global, shared with start,end,char sub handle_start { my ($pkg,$element,%attr) = @_; $current_element = $element; if ( $element =~ /^Header$/i ) { $Number=$attr{Number}; $inHeader = 1; } } my $separator = ","; my $outputfile = "numbers.txt"; sub handle_end { my ($pkg,$element,%attr) = @_; if ( $element =~ /^Header$/i ) { # Are we overwriting the same file for every Header? open (OUT, ">", $outputfile) or die "No file"; print OUT $Number,"$separator\n"; print "\tNumber ". $Number . "\n"; close (OUT); $inHeader = 0; } } sub handle_char { my ($pkg,$text) = @_; if ( $inHeader && $current_element =~ /^Number$/i && $text !~ /^\s*$/ ) { $Number .= $text; #|-> buffer text } }
      Here is an example of my xml file (there are over 29000 like this):
      <Header> <IpNumber>AC_1234</IpNumber> </Header> <ContentElement> <IdNumber>yyyyyyyy-yy</IdNumber> <InstanceNumber>001463010000016</InstanceNumber> </ContentElement> <ContentElement> <IdNumber>zzzzzzzz-zz</IdNumber> <InstanceNumber>0000000000000000</InstanceNumber> </ContentElement> <ContentElement> <IdNumber>xxxxxxxx-xx</IdNumber> <InstanceNumber>111111111111111</InstanceNumber> </ContentElement> <ContentElement> <IdNumber>aaaaaaaaa-aa</IdNumber> <InstanceNumber>222222222222222</InstanceNumber> </ContentElement>
      the code i wrote was just a little part, but i have multiple istance of ContentElement and I have to solve the problem on all the tags of xml. But I'll try to modify my code with the your. I have splitted the open/close in write because when i started to implement the code i was a newby (maybe I'm still newby).

        toolic had a really good piece of advice that might have been glossed. XML::Parser is not newbie friendly. XML::Twig or XML::LibXML are likely what you want to work with.

        I'm not sure I followed your example code in your question. Now that you've given some sample data, could you give a description of what desired output/outcome is? You might well get an example solution in Twig and libxml.