Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

XML::Parser question

by primus (Scribe)
on Feb 07, 2003 at 05:35 UTC ( [id://233368]=perlquestion: print w/replies, xml ) Need Help??

primus has asked for the wisdom of the Perl Monks concerning the following question:

hail monks,

i am using XML::Parser and in the xml file there is is some html, for example:
<item> <title>test</title> <description><b><font color=#"dd0000">temp</a></font></b> try</descrip +tion> <link>http://www.nowhere.com</link> </item>
how can i use XML::Parser and just strip out the data between the <title>*</title> tag and the <link>*</link> while ignoring everything in the discription tag?

thanks for the help monks

Replies are listed 'Best First'.
Re: XML::Parser question
by Coruscate (Sexton) on Feb 07, 2003 at 06:17 UTC

    For starters, the XML::XXXXX modules probably won't parse that "xml" document in the first place. You have a </a> tag in there with no opening tag. Deleting that will fix the xml-format problem, but as for the html side of it, the <font> is formatted incorrectly. The pound sign (#) should be inside the quotes. XML::XXXXX won't complain about the latter one though.

    If you vaporize that </a> tag, then you might find XML::Simple to be enough for this task:

    #!/usr/bin/perl -w use strict; use XML::Simple; my $data = XMLin(qq{ <item> <title>test</title> <description><b><font color="#dd0000">temp</font></b> try</description +> <link>http://www.nowhere.com</link> </item> }); print $data->{'title'}, " | ", $data->{'link'}, "\n";


    If the above content is missing any vital points or you feel that any of the information is misleading, incorrect or irrelevant, please feel free to downvote the post. At the same time, reply to this node or /msg me to tell me what is wrong with the post, so that I may update the node to the best of my ability. If you do not inform me as to why the post deserved a downvote, your vote does not have any significance and will be disregarded.

Re: XML::Parser question
by AcidHawk (Vicar) on Feb 07, 2003 at 06:16 UTC

    Have a look at the almost duplicate question XML::parser question.

    I just typed 'XML tag only' in Super Search checked only SOPW and Dont include replys.. there are a lot of hits.

    -----
    Of all the things I've lost in my life, its my mind I miss the most.
Re: XML::Parser question
by mirod (Canon) on Feb 07, 2003 at 11:12 UTC

    The proper way to include non-xml data (that make the XML not-well-formed and thus the parser die) is to escape it.

    There are 2 ways to do this: one is to use entities to replace all '<' and '&'. This is quite easy to generate but isa pain and makes it hard to get back the HTML as markup. The other way is to use CDATA sections. A CDATA section is a fragment of XML that is pretty much skipped by the parser, so it can include markup, as long as it does not include the end-of-cdata-section marker. It makes it quite easy to output the HTML fragment back as markup if you only use CDATA sections for this specific purpose.

    Your file would then be:

    <item> <title>test</title> <description><![CDATA[<b><font color=#"dd0000">temp</a></font></b> try +]]></description> <link>http://www.nowhere.com</link> </item>

    Note that this does not mean that the parser will completely ignore the section though, it will just consider it as non-markup. One often overlooked consequence is that you need the encoding of characters in the section to be the same as in the rest of the document, and the same as defined in the encoding attribute of the xmldeclaration. By default this is UTF-8 or UTF-16, so if the HTML is likely to be in an other encoding you will have to either convert it prior to including it in the XML document, or to have the entire document be in this encoding.

    Finally, you really should not use XML::Parser, but rather either a higher-level module, based on XML::Parser, a libxml2-based module such as XML::LibXML or a SAX module, that will let you choose you parser (I must confess I do not know how CDATA sections are supported in SAX2 though).

    Oh, and do I really need to mention that XML::Twig has a method that would work quite well in this case? ;--) $elt->remove_cdata turns all CDATA sections in the element into regular mark-up (actually you cannot access individual elements within the CDATA section, but when you output it it skips the CDATA markers, and you should get the result you want).


      thank you monks for the help, the only thing which i suppose i should have stated earlier, is that i do not have control over the formatting of the xml... i am pulling the xml from an outside source, and i kinda get what they give me... i hope i can apply some of this to that. thanks again.

        Oh my! Not again!

        If what you get is really what you describe, then do yourself (and your text-in-pointy-brackets provider) a favor: don't call it XML. And write (or have your povider write) a hundred times "If it does not parse, then it is NOT XML" Whether the reason is messed up tags, an encoding problem or anything else, they have no business calling it XML if an XML parser doesn't say that it is well-formed.

        Once you have realized this, it then makes sense that, as you are not processing XML, you cannot use XML tools. At least not directly You need first to convert the data you get into real XML, or even better, have the source provide real XML, make sure it is OK by parsing it, and then you can use an XML module.

Re: XML::Parser question
by jammin (Novice) on Feb 07, 2003 at 10:23 UTC

    Yeah, it's not well formed and the parser won't like it,once you've fixed that it should work fine. I you;re planning on doing a lot of html parsing you'll find that most HTMl doesn't conform to XML standards so you will have this issue a lot.

    I would recommend that you use either a simple reular expression like /\<title\>([^>]+)</ instead.

    Or you could 'use HTML::TokeParser;'. Great tool for looking through tags of HTML (or XML).

    Good luck!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://233368]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (8)
As of 2024-04-23 08:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found