Description
XML::Parser provides ways to parse XML documents.
- built on top of XML::Parser::Expat, a lower level
interface to James Clark's expat library
- most of the other XML modules are built on top of
XML::Parser
- stream-oriented
- for each event found while parsing a document a
user-defined handler can be called
- events are start and end tags, text, but also
comments, processing instructions, CDATA, entities,
element or attribute declarations in the DTD...
- handlers receive the parser object and context
information
- sets of pre-defined handlers can be used as
Styles
- A companion module, XML::Encodings, allows
XML::Parser to parse XML documents in various
encodings, besides the native UTF-8, UTF-16 and
ISO-8859-1 (latin 1)
Why use XML::Parser
- widely used, the first XML module, hence it is
very robust
- if you need performance as it is low level, and
obviously all modules beased on it are slower
- you need access to some parsing events that are
masked by higher-level modules
- one of the Styles does exactly what you want
- if you want to write your own module based on
XML::Parser
Why NOT use XML::Parser
Related Modules
Besides the modules already mentioned:
- XML::UM can translate characters between various
encodings,
- XML::Checker is a validating parser that just
replaces XML::Parser
Personal comments
XML::Parser is the basis of most XML processing in Perl.
Even if you don't plan to use it directly, you should at
least know how to use it if you are working with XML.
That said I think that it is usually a good idea to have a look at the various ;odules that sub-class XML::Parser, as they are usually easier to use.
There are some compatibility problems between XML::Parser version 2.28 and higher and a lot of other modules, most notably XML::DOM.
Plus it seems to be doing some funky stuff with UTF-8 strings.
Hence I would stick to version 2.27 at the moment.
Update: Activestate distribution
currently includes XML::Parser 2.27
Things to know about XML::Parser
Characters are converted to UTF-8
XML::Parser will gladly parse latin-1 (ISO 8859-1) documents provided the XML declaration mentions that encoding. It will convert all characters to UTF-8 though, so
outputting latin-1 is tricky. You will need to use Perl's
unicode functions, which have changed recently so I will postpone detailed instructions until I catch-up with them ;--(
Catching exceptions
The XML recommendation mandates that when an error is
found in the XML the parser stop processing
immediatly. XML::Parser goes even further: it
displays an error message and then die's.
To avoid dying wrap the parse in an
eval block:
eval { $parser->parse };
if( $@)
{ my $error= $@;
#cleanup
}
|
Getting all the character data
The Char handler can be called several times
within a single text element. This happens when the
text includes new lines, entities or even at random,
depending on expat buffering mechanism. So
the real content should actually be built by pushing
the string passed to Char, and by using it only in the
End handler.
my $stored_content=''; # global
sub Start
{ my( $expat, $gi, %atts)= @_;
process( $stored_content); # needed for mixed content such as
# <p>text <b>bold</b> more text</p>
$stored_content=''; # needs to be reset
}
sub Char
{ my( $expat, $string)= @_;
$stored_content .= $string; # can't do much with it
}
sub End
{ my( $expat, $gi)= @_;
process( $stored_content); # now it's full
$stored_content=''; # reset here too
}
|
XML::Parser Styles
Styles are handler bundles. 5 styles are defined in
XML::Parser, others can be created by users.
Subs
Each time an element starts, a sub by that name is
called with the same parameters that the Start handler
gets called with.
Each time an element ends, a sub with that name
appended with an underscore ("_"), is
called with the same parameters that the
End handler gets called with.
Tree
Parse will return a parse tree for the document. Each
node in the tree takes the form of a tag, content
pair. Text nodes are represented with a pseudo-tag of
"0" and the string that is their content.
For elements, the content is an array reference. The
first item in the array is a (possibly empty) hash
reference containing attributes.
The remainder of the array is a sequence of
tag-content pairs representing the content of the
element.
Objects
This is similar to the Tree style, except that a hash
object is created for each element. The corresponding
object will be in the class whose name is created by
appending "::" to the element name.
Non-markup text will be in the ::Characters class.
The contents of the corresponding object will be in
an anonymous array that is the value of the Kids
property for that object.
Stream
If none of the subs that this style looks for is
there, then the effect of parsing with this style is
to print a canonical copy of the document without
comments or declarations. All the subs receive as
their 1st parameter the Expat instance for the
document they're parsing.
It looks for the following routines:
- StartDocument: called at the start of
the parse.
- StartTag: called for every start tag
with a second parameter of the element type.
The $_ variable will contain a copy of
the tag and the %_ variable will contain
attribute values supplied for that element.
- EndTag: called for every end tag with a
second parameter of the element type. The $_
variable will contain a copy of the end tag.
- Text: called just before start or end
tags with accumulated non-markup text in the $_
variable.
- PI: called for processing instructions.
The $_ variable will contain a copy of the PI and
the target and data are sent as 2nd and
3rd parameters respectively.
- EndDocument: called at conclusion of the
parse.
Debug
This just prints out the document in outline form.
In reply to XML::Parser
by mirod
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
|
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.