Hena has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, After using the XML::Parser module I've encountered a problem. I seem to get a result which is confusing. So is there any way to find out which row I'm currently in? I would think that the expat string which I get (XML::Parser::Expat=HASH(0x819a53c)) might tell that, but I didn't see any reference to it in perldoc.

I used the parser like this. The file I'm parsing is 161M gzipped so no snippets of data here.
# field we want my $FIELD = 'comment'; my $TEXT = 'text'; # flag && stuff my $com = ""; my $text = 0; my %UNIQ; my $i = 0; my $str = ""; sub start (@) { shift @_; if ($_[0] eq $FIELD) { $com = $_[2]; } elsif ($com ne "" && $_[0] eq $TEXT) { $text = 1; } } sub end (@) { shift @_; if ($_[0] eq $FIELD) { $com = ""; # $i=0; $str = ""; } elsif ($_[0] eq $TEXT) { $text = 0; } } sub text (@) { # shift @_; if ($text && $_[1]=~/\S/) { # $UNIQ{$com}{$_[1]}++; # $i++; if ($str) { print "$_[0]\n'$str','$_[1]'\n$com\n";exit;} $str .= $_[1]; } } my $parser = new XML::Parser(Handlers => { Start => \&start, End => \&end, Char => \&text, }); $parser->parse(*STDIN);

Replies are listed 'Best First'.
Re: XML::Parser problems
by PodMaster (Abbot) on Jul 01, 2005 at 06:35 UTC
    I would think that the expat string which I get (XML::Parser::Expat=HASH(0x819a53c)) might tell that, but I didn't see any reference to it in perldoc.
    String? That's the default string representation of an object (my $obj = XML::Parser::Expat->new(...); print "$obj";). What you're looking for is the XML::Parser::Expat methods current_line, current_column, and current_byte.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

      I was quessing that it would've been a hash. Well not enough experience with perls objects :).

      But there were the functions somewhere. Thanks. Well this gets more complicated. My text function is now like this.
      sub text (@) { # shift @_; if ($text && $_[1]=~/\S/) { # $UNIQ{$com}{$_[1]}++; # $i++; if ($str) { print XML::Parser::Expat::current_line($_[0]),",",XML::Parser::E +xpat::current_column($_[0]),"\n"; print "'$str','$_[1]'\n$com\n";exit; } $str .= $_[1]; } }
      And the result of the run is this.
      $ zcat data/uniprot_sprot.xml.gz | ./get_sp_fields.pl
      26745,17
      'Involve','d in the presentation of foreign antigens to the immune system'
      function
      
      And the rows from xml lines 26744-26746 are.
        <comment type="function">
          <text>Involved in the presentation of foreign antigens to the immune system</text>
        </comment>
      
      So is there a bug in XML::Parser? Since the text section is split into two calls of subfunction text. Or am I missing something here...

        Sorry to reply with a RTFM, but this is what the FM reads (emphasis added):

        Char (Expat, String)
        This event is generated when non-markup is recognized. The non-markup sequence of characters is in String. A single non-markup sequence of characters may generate multiple calls to this handler. Whatever the encoding of the string in the original document, this is given to the handler in UTF-8.

        Note that AFAIK all XML parsers behave like this, to allow you to parse documents even if they contain chunks of texts are bigger than the available memory.

        Also the XML::Parser review mentions this, and give you a way to get all the data.

        Update: the Perl XML FAQ also mentions this.

Re: XML::Parser problems
by murugu (Curate) on Jul 01, 2005 at 07:08 UTC
    Hi Hena,

    you are printing the object as string. That is why you are getting 'XML::Parser::Expat=HASH(0x819a53c)'.

    If you want to get the current row(i.e line) use method current_line.

    Regards,
    Murugesan Kandasamy