Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

How to convert HTML to XML w/ Perl?

by lacika (Initiate)
on Feb 07, 2004 at 01:18 UTC ( [id://327259]=perlquestion: print w/replies, xml ) Need Help??

lacika has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone, Iīm having trouble trying to convert a HTML webpage to XML. I know there is some perl module out there capable of doing this, just donīt know which one. Fetching the HTML from the web is pretty straightforward, the only thing i need is to output fetched results to XML. Iīm a initiate and a little lost and just canīt do this by myself. I humbly ask anyone to help with anything that might solve this problem. Many thanks in advance p.s.- Great service you guys run here :-)

Replies are listed 'Best First'.
Re: How to convert HTML to XML w/ Perl?
by Zaxo (Archbishop) on Feb 07, 2004 at 01:29 UTC

    Probably the simplest way is to use the -asxml flag of tidy - which is written in perl :) [There is a perl wrapper for TidyLib, called HTML::Tidy]. $ tidy -asxml foo.html > foo.xml

    After Compline,
    Zaxo

Re: How to convert HTML to XML w/ Perl?
by arturo (Vicar) on Feb 07, 2004 at 03:24 UTC

    While we're suggesting modules, I'd also point to libxml2 and associated utilities, which is probably installed if you have a recentish linux installation, and is available through that link if not. It also also has an associated Perl module XML::LibXML. The bonus is, if you install that stuff, you can process the resulting XML with Perl. The drawback to the tidy-based approach is that the libxml2 code is more generic, and so you'd have to work to get DOCTYPE lines to come out correctly; however, libxml2 also has a wider area of application.

    If not P, what? Q maybe?
    "Sidney Morgenbesser"

Re: How to convert HTML to XML w/ Perl?
by skillet-thief (Friar) on Feb 07, 2004 at 14:14 UTC

    The HTML::Tree suite seems to have some XML capabilities. HTML::Element has an XML dump method: $h->as_XML(), which might be a first step, depending on what you want to do.

    There is also a HTML::DOMbo module, which turns your HTML tree into an XML tree, and AFAICS, lets you use all of the DOM tools you want on it.

    While I have been using HTML::Tree a lot recently (and I highly recommend it for doing most anything with HTML), I haven't experimented with the XML stuff yet. But it seems promising.

      First of all, I would like to thank you guys, Zaxo, Arturo, Anonymous Monk and Skillet Thief for the prompt response. I will try the modules you suggested and hopefully come back with a big smile on my face. Your help was very much appreciated! See you soon!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://327259]
Approved by Roger
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-03-28 10:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found