convert whole html-files to xml

dschinn1001 has asked for the wisdom of the Perl Monks concerning the following question:

Hi dear monks !

am a bit dizzled by code-world of perl (after scripting
html ... ) now want to convert whole HTML-files
to xml-files with perl-module HTML::Tiny and
Have read the explanations in HTML::Tiny (short doc there)
I dont understand quite, what do I have to type with
command perl and module HTML::Tiny for to convert now
test01.html to test01.xml ? thx for answer and no
offense for somewhat newbie question of me.

So have understood by now, that to install cpan (as sudo) and to install HTML::Tiny (and I did install XML::Simple ).

Now I only need a converter for this - well then I found one at public IBM ...
here (modified by myself, but it is not correct):

HTMLin should be substituted by 'test001.html' ???

problem is this skript cannot find 'HTMLin'

There is only file test001.html.
What should I write now ?

thx very much !

#!/usr/bin/perl -w
use strict;
use HTML::Tiny;
use Data::Dumper;
my $Tiny = HTML::Tiny->new();
my $data   = $Tiny->HTMLin('test001.xml');
# DEBUG
print Dumper($data) . "\n";
# END
[download]

not to forget here the link of IntelligentBigMamah !

http://www.ibm.com/developerworks/xml/library/x-xmlperl1/index.html

saw answer of tobyink after I wrote above ... hm.

Comment on convert whole html-files to xml Download Code

Replies are listed 'Best First'.
Re: convert whole html-files to xml by tobyink (Canon) on Sep 15, 2013 at 08:09 UTC
With HTML::Tiny? HTML::Tiny doesn't seem like an appropriate choice. To begin with, it has no support for parsing HTML! Here's an example using HTML::HTML5::Parser and HTML::HTML5::Writer; two modules that I wrote, which are available on the CPAN. `#!/usr/bin/env perl use strict; use warnings; use HTML::HTML5::Parser; use HTML::HTML5::Writer qw(DOCTYPE_XHTML1); my $parser = 'HTML::HTML5::Parser'->new; my $writer = 'HTML::HTML5::Writer'->new(markup => 'xhtml', doctype => +DOCTYPE_XHTML1); print $writer->document( $parser->load_html(IO => \DATA) ); __DATA__ <!doctype html> <HTML LANG="en"> <title>Some HTML</TITLE> <P>Here is some HTML</body>` [download] Adding XML::LibXML::PrettyPrint into the mix allows you to tidy up the XHTML output, nicely indenting the tags: #!/usr/bin/env perl use strict; use warnings; use HTML::HTML5::Parser; use HTML::HTML5::Writer qw(DOCTYPE_XHTML1); use XML::LibXML::PrettyPrint; my $parser = 'HTML::HTML5::Parser'->new; my $writer = 'HTML::HTML5::Writer'->new(markup => 'xhtml', doctype => +DOCTYPE_XHTML1); my $pp = 'XML::LibXML::PrettyPrint'->new_for_html; print $writer->document( $pp->pretty_print( $parser->load_html(IO => \DATA), ), ); __DATA__ <!doctype html> <HTML LANG="en"> <title>Some HTML</TITLE> <P>Here is some HTML</body> [download] Sample output: `<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w +3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html lang="en" xmlns="http:// +www.w3.org/1999/xhtml"> <head> <title>Some HTML</title> </head> <body> <p>Here is some HTML</p> </body> </html>` [download] (I really need to look at adding a line break after the doctype.) `use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name`	[reply] [d/l] [select]
Re^2: convert whole html-files to xml by dschinn1001 (Initiate) on Sep 15, 2013 at 09:23 UTC
have updated my q - but not seen your answer. sometimes am blind ... Your answer is hitting ! - will test it today ... thxalotz	[reply]
Re^3: convert whole html-files to xml by tobyink (Canon) on Sep 15, 2013 at 10:17 UTC
"have updated my q" Your question seems to make the assumption that HTML::Tiny works identically to XML::Simple but processes HTML instead of XML. It does not. The two modules are completely unrelated. Applying knowledge from an XML::Simple article to HTML::Tiny will not do what you want. `use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name`	[reply]
Re^4: convert whole html-files to xml by dschinn1001 (Initiate) on Sep 15, 2013 at 15:54 UTC
Re^5: convert whole html-files to xml by Anonymous Monk on Sep 15, 2013 at 21:28 UTC
Some notes below your chosen depth have not been shown here
Re: convert whole html-files to xml by kcott (Archbishop) on Sep 15, 2013 at 04:09 UTC
G'day dschinn1001, Welcome to the monastery. While I haven't used HTML::Tiny, looking at its documentation, I see it's for generating HTML and XML; not for converting HTML to XML. I don't think this is the tool for the task you describe. Have a look at the guidelines in "How do I post a question effectively?" and then post a minimal extract of `test01.html` along with the corresponding conversion you want in `test01.xml`. When we have that, we'll be in a better position to advise what the appropriate tools are. It would also be useful to know if this a one-off exercise, if you have thousands of HTML documents that need converting and you're looking to automate the process, or something in between. -- Ken	[reply] [d/l] [select]
Re: convert whole html-files to xml by Anonymous Monk on Sep 15, 2013 at 03:55 UTC
Well, first thing you do is forget about doing that, and you're done -- html isn't xml and vice versa Oh look, HTML::Tidy::libXML	[reply]
Re^2: convert whole html-files to xml by CountZero (Bishop) on Sep 15, 2013 at 07:10 UTC
Not at all! HTML is far less strict than XML. Perhaps you were thinking of XHTML? CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re^3: convert whole html-files to xml by Anonymous Monk on Sep 15, 2013 at 07:33 UTC
Not at all! HTML is far less strict than XML. Perhaps you were thinking of XHTML? I was thinking <small. na-nana-boo-boo Oh look, HTML::Tidy::libXML $xml/$xhtml	[reply]