| Category: | HTML Utility |
| Author/Contact Info | Briac Pilpré - briac@cpan.org |
| Description: | Pyxie is an alternative way of representing XML datas. These
datas are represented in a really simple way, one information
per line. Now, I know the module XML::PYX exists, and it
even comes with a script called pyxhtml, which does pretty
much what this code does. Hopefully, this code can be easily customized to suit your needs, provided you know how to use HTML::Parser (which is really fun to use, especially the v.3). And the really cool thing is that your HTML doesn't have to be a valid XML file! (I wouldn't try to feed it Word 2000 pseudo-HTML though...) |
#!/usr/bin/perl -w
use strict;
use HTML::Parser ();
# See PYX format description
# http://www.xml.com/pub/a/2000/03/15/feature/index.html
my $parser = HTML::Parser->new(
xml_mode => 1,
unbroken_text => 1,
ignore_elements => ['style', 'script'], # CDATA isn't supporte
+d
start_h => [
sub {
my ($tag, $attr) = @_;
print "($tag\n";
print "A$_\n-$attr->{$_}\n" foreach keys %{$at
+tr};
}, "tagname, attr"],
end_h => [
sub {
print ")" . shift() . "\n";
}, "tagname"],
text_h => [
sub {
my $text = shift;
$text =~ s/^\s*|\s*$//g;
print "-$text\n"
}, "dtext"],
);
die "usage: $0 file1.html > file1.pyx\n" unless @ARGV;
foreach (@ARGV){
$parser->parse_file($_);
$parser->eof();
}
|
|
|
|---|