XML::Parser

Item Description: Low level module for parsing XML in Perl

Review Synopsis: The base of most XML modules, use it for performance or to roll your own module

Description

XML::Parser provides ways to parse XML documents.

built on top of XML::Parser::Expat, a lower level interface to James Clark's expat library
most of the other XML modules are built on top of XML::Parser
stream-oriented
for each event found while parsing a document a user-defined handler can be called
events are start and end tags, text, but also comments, processing instructions, CDATA, entities, element or attribute declarations in the DTD...
handlers receive the parser object and context information
sets of pre-defined handlers can be used as Styles
A companion module, XML::Encodings, allows XML::Parser to parse XML documents in various encodings, besides the native UTF-8, UTF-16 and ISO-8859-1 (latin 1)

Why use XML::Parser

widely used, the first XML module, hence it is very robust
if you need performance as it is low level, and obviously all modules beased on it are slower
you need access to some parsing events that are masked by higher-level modules
one of the Styles does exactly what you want
if you want to write your own module based on XML::Parser

Why NOT use XML::Parser

you'd rather use a higher level module: XML::DOM, XML::Twig, XML::XPath...
you'd rather use a simpler module: XML::PYX or XML::Simple

Related Modules

Besides the modules already mentioned:

XML::UM can translate characters between various encodings,
XML::Checker is a validating parser that just replaces XML::Parser

Personal comments

XML::Parser is the basis of most XML processing in Perl. Even if you don't plan to use it directly, you should at least know how to use it if you are working with XML.

That said I think that it is usually a good idea to have a look at the various ;odules that sub-class XML::Parser, as they are usually easier to use.

There are some compatibility problems between XML::Parser version 2.28 and higher and a lot of other modules, most notably XML::DOM. Plus it seems to be doing some funky stuff with UTF-8 strings. Hence I would stick to version 2.27 at the moment.

Update: Activestate distribution currently includes XML::Parser 2.27

Things to know about XML::Parser

Characters are converted to UTF-8

XML::Parser will gladly parse latin-1 (ISO 8859-1) documents provided the XML declaration mentions that encoding. It will convert all characters to UTF-8 though, so outputting latin-1 is tricky. You will need to use Perl's unicode functions, which have changed recently so I will postpone detailed instructions until I catch-up with them ;--(

Catching exceptions

The XML recommendation mandates that when an error is found in the XML the parser stop processing immediatly. XML::Parser goes even further: it displays an error message and then die's.

To avoid dying wrap the parse in an eval block:

  eval { $parser->parse };
  if( $@)
    { my $error= $@;
      #cleanup
    }
[download]

Getting all the character data

The Char handler can be called several times within a single text element. This happens when the text includes new lines, entities or even at random, depending on expat buffering mechanism. So the real content should actually be built by pushing the string passed to Char, and by using it only in the End handler.

my $stored_content='';             # global

sub Start
  { my( $expat, $gi, %atts)= @_;
    process( $stored_content);     # needed for mixed content such as 
                                   # <p>text <b>bold</b> more text</p>
    $stored_content='';            # needs to be reset 
  }

sub Char
  { my( $expat, $string)= @_;
    $stored_content .= $string;    # can't do much with it
  }

sub End
  { my( $expat, $gi)= @_;
    
    process( $stored_content);     # now it's full
    $stored_content='';            # reset here too
  }
[download]

XML::Parser Styles

Styles are handler bundles. 5 styles are defined in XML::Parser, others can be created by users.

Subs

Each time an element starts, a sub by that name is called with the same parameters that the Start handler gets called with.

Each time an element ends, a sub with that name appended with an underscore ("_"), is called with the same parameters that the End handler gets called with.

Tree

Parse will return a parse tree for the document. Each node in the tree takes the form of a tag, content pair. Text nodes are represented with a pseudo-tag of "0" and the string that is their content. For elements, the content is an array reference. The first item in the array is a (possibly empty) hash reference containing attributes.

The remainder of the array is a sequence of tag-content pairs representing the content of the element.

Objects

This is similar to the Tree style, except that a hash object is created for each element. The corresponding object will be in the class whose name is created by appending "::" to the element name. Non-markup text will be in the ::Characters class. The contents of the corresponding object will be in an anonymous array that is the value of the Kids property for that object.

Stream

If none of the subs that this style looks for is there, then the effect of parsing with this style is to print a canonical copy of the document without comments or declarations. All the subs receive as their 1st parameter the Expat instance for the document they're parsing.

It looks for the following routines:

StartDocument: called at the start of the parse.
StartTag: called for every start tag with a second parameter of the element type. The $_ variable will contain a copy of the tag and the %_ variable will contain attribute values supplied for that element.
EndTag: called for every end tag with a second parameter of the element type. The $_ variable will contain a copy of the end tag.
Text: called just before start or end tags with accumulated non-markup text in the $_ variable.
PI: called for processing instructions. The $_ variable will contain a copy of the PI and the target and data are sent as 2nd and 3rd parameters respectively.
EndDocument: called at conclusion of the parse.

Debug

This just prints out the document in outline form.

Comment on XML::Parser Select or Download Code

Replies are listed 'Best First'.
Re: XML::Parser by mirod (Canon) on Aug 11, 2003 at 16:24 UTC
Note that you should NOT use XML::Parser any longer. The reason is best described by matts, the current maintainer in a post on the perl-xml mailing list: in short, if you want low-level XML parsing you should use SAX, not XML::Parser.	[reply]
That UTF pain... by yosefm (Friar) on Aug 11, 2003 at 16:04 UTC
I encountered this problem (before I knew about it :-( ) too. I'd like to point out that if you only work with one language (in the XML input and in your output) it's very easy to bypass this using Text::Iconv - I did it as a simple conversion sub that I called for outputting data from the XML. I guess this could be done with a handler too, but I haven't tried it yet. Here's my sub: `sub unUTF8 { my $conv = Text::Iconv->new("UTF-8", "iso-8859-8"); #That's hebrew +. return $conv->convert(shift); }` [download]	[reply] [d/l]
Re: That UTF pain... by Aristotle (Chancellor) on Aug 11, 2003 at 21:11 UTC
If you need this frequently, it should probably be `{ my $conv; sub unUTF8 { $conv \|\|= Text::Iconv->new("UTF-8", "iso-8859-8"); return $conv->convert(shift); } }` [download] instead. Makeshifts last the longest.	[reply] [d/l]
Re: That UTF pain... by mirod (Canon) on Aug 11, 2003 at 16:34 UTC
In perl 5.8.* you can also use Encode, which provides encoding/decoding methods. You can also have a look at Converting character encodings for additional ways of doing this (the regexp method might not work with recent versions of perl and/or XML::Parser. XML::Twig also lets you work in the original encoding for the document, by using the keep_encoding option. Finally, if there is any way for you to work in UTF-8, it is probably a good idea. Note that most Web browsers, data bases and mail agents now support it, most editors and terminals too, not to mention perl 5.8.*	[reply]