in reply to Re: XML::Parser problems
in thread XML::Parser problems

I was quessing that it would've been a hash. Well not enough experience with perls objects :).

But there were the functions somewhere. Thanks. Well this gets more complicated. My text function is now like this.
sub text (@) { # shift @_; if ($text && $_[1]=~/\S/) { # $UNIQ{$com}{$_[1]}++; # $i++; if ($str) { print XML::Parser::Expat::current_line($_[0]),",",XML::Parser::E +xpat::current_column($_[0]),"\n"; print "'$str','$_[1]'\n$com\n";exit; } $str .= $_[1]; } }
And the result of the run is this.
$ zcat data/uniprot_sprot.xml.gz | ./get_sp_fields.pl
26745,17
'Involve','d in the presentation of foreign antigens to the immune system'
function
And the rows from xml lines 26744-26746 are.
  <comment type="function">
    <text>Involved in the presentation of foreign antigens to the immune system</text>
  </comment>
So is there a bug in XML::Parser? Since the text section is split into two calls of subfunction text. Or am I missing something here...

Replies are listed 'Best First'.
Re^3: XML::Parser problems
by mirod (Canon) on Jul 01, 2005 at 08:14 UTC

    Sorry to reply with a RTFM, but this is what the FM reads (emphasis added):

    Char (Expat, String)
    This event is generated when non-markup is recognized. The non-markup sequence of characters is in String. A single non-markup sequence of characters may generate multiple calls to this handler. Whatever the encoding of the string in the original document, this is given to the handler in UTF-8.

    Note that AFAIK all XML parsers behave like this, to allow you to parse documents even if they contain chunks of texts are bigger than the available memory.

    Also the XML::Parser review mentions this, and give you a way to get all the data.

    Update: the Perl XML FAQ also mentions this.

      No need to say sorry. If it is RTFM then it is RTFM. As I do seem to be missing something :). I infact was doing similar combining of string here myself by now (as a way to get around problem), which was mentioned in that review link.

      I quess this gets marked to things, we live and learn.