wfsp has asked for the wisdom of the Perl Monks concerning the following question:

This snippet, based on the synopsis in the docs, runs ok. Add a doctype declaration and it dies.
#!/usr/bin/perl use strict; use warnings; use HTML::Tidy; my $contents = do{local $/; <DATA>}; my $tidy = HTML::Tidy->new( {config_file => q{c:/perm/tidy/cnf/tidy.cnf}} ) or die qq{new failed: $!\n}; $tidy->ignore(type => TIDY_WARNING); $tidy->parse(q{foo.html}, $contents) or die qq{cant parse};; for my $message ( $tidy->messages ) { print $message->as_string; } my $clean = $tidy->clean($contents) or die qq{cant clean}; print $clean; __DATA__ <html> <head> <title>title</title> </head> <body> <p>body</p> </body> </html>
add a doctype
__DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>title</title> </head> <body> <p>body</p> </body> </html>
produces
cant parse at C:\perm\tidy\html_tidy_02.pl line 14, <DATA> line 1.
Anyone experienced this? Have I made a mistake? Is there a work around?

Replies are listed 'Best First'.
Re: HTML::Tidy crashes with doctype declaration
by shmem (Chancellor) on Mar 16, 2008 at 00:30 UTC
    Is there a work around?

    Yes. Use HTML::Parser or such to tidy your HTML.

    Perl is very good at such tasks, and there really is no need to interface a C library to tidy up HTML. On my platform HTML::Tidy doesn't pass the tests due to bugs in libtidy.

    qwurx [shmem] ~/rpms/perl/src/HTML-Tidy-1.08 > make test PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_h +arness(0, 'blib/lib', 'blib/arch')" t/*.t t/00-load..............ok t/cfg-for-parse........ok t/clean-crash..........ok t/extra-quote..........ok t/ignore-text..........ok t/ignore...............ok t/levels...............ok t/message..............ok t/opt-00...............ok t/perfect.............. Failed 1/3 subtests t/pod-coverage.........ok t/pod..................ok t/roundtrip............ok t/segfault-form........ok t/simple...............1/4 Unknown error type: line 2 column 5 - Info: + <body> previously mentioned at t/simple.t line 17 Unknown error type: line 2 column 5 - Info: <body> previously mentione +d at t/simple.t line 17 Unknown error type: line 2 column 5 - Info: <body> previously mentione +d at t/simple.t line 17 t/simple...............ok t/too-many-titles......1/3 Unknown error type: line 4 column 9 - Info: + <head> previously mentioned at t/too-many-titles.t line 22 t/too-many-titles......ok t/unicode.............. Failed 1/7 subtests t/venus................1/3 Unknown error type: line 8 column 2 - Info: + <h1> previously mentioned at t/venus.t line 21 Unknown error type: line 10 column 2 - Info: <h1> previously mentioned + at t/venus.t line 21 Unknown error type: line 11 column 2 - Info: <h1> previously mentioned + at t/venus.t line 21 Unknown error type: line 12 column 2 - Info: <h1> previously mentioned + at t/venus.t line 21 Unknown error type: line 15 column 2 - Info: <h2> previously mentioned + at t/venus.t line 21 Unknown error type: line 17 column 2 - Info: <h4> previously mentioned + at t/venus.t line 21 Unknown error type: line 18 column 2 - Info: <h4> previously mentioned + at t/venus.t line 21 Unknown error type: line 20 column 2 - Info: <h4> previously mentioned + at t/venus.t line 21 Unknown error type: line 25 column 3 - Info: <h4> previously mentioned + at t/venus.t line 21 t/venus................ok t/version..............ok t/wordwrap.............1/2 Unknown error type: line 1 column 1 - Info: + <head> previously mentioned at t/wordwrap.t line 35 t/wordwrap.............ok Test Summary Report ------------------- t/perfect.t (Wstat: 11 Tests: 2 Failed: 0) Parse errors: Bad plan. You planned 3 tests but ran 2. t/unicode.t (Wstat: 11 Tests: 6 Failed: 0) Parse errors: Bad plan. You planned 7 tests but ran 6. Files=20, Tests=78, 2 wallclock secs ( 0.13 usr 0.04 sys + 1.19 cus +r 0.25 csys = 1.61 CPU) Result: FAIL Failed 2/20 test programs. 0/78 subtests failed. make: *** [test_dynamic] Error 255

    Nice test output. Let's grab the first reported failure.

    qwurx [shmem] ~/rpms/perl/src/HTML-Tidy-1.08 > PERL_DL_NONLAZY=1 /usr/ +bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', +'blib/arch')" t/perfect.t t/perfect...... Failed 1/3 subtests Test Summary Report ------------------- t/perfect.t (Wstat: 11 Tests: 2 Failed: 0) Parse errors: Bad plan. You planned 3 tests but ran 2. Files=1, Tests=2, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.10 cusr + 0.02 csys = 0.16 CPU) Result: FAIL Failed 1/1 test programs. 0/2 subtests failed. qwurx [shmem] ~/rpms/perl/src/HTML-Tidy-1.08 >

    Bad plan? 3 tests planned but ran only two? Let's see. Ah, in t/perfect.t I see

    use Test::More tests => 3;

    and then only two tests

    ... isa_ok( $tidy, 'HTML::Tidy' ); ... is( scalar @returned, 0, 'Should have no messages' );

    Ok, off-by-one. common typo. Let's change the 3 against 2 and run again. Output:

    qwurx [shmem] ~/rpms/perl/src/HTML-Tidy-1.08 > PERL_DL_NONLAZY=1 /usr/ +bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', +'blib/arch')" t/perfect.t t/perfect...... All 2 subtests passed Test Summary Report ------------------- Files=1, Tests=2, 0 wallclock secs ( 0.01 usr 0.01 sys + 0.04 cusr + 0.00 csys = 0.06 CPU) Result: FAIL Failed 1/1 test programs. 0/2 subtests failed. qwurx [shmem] ~/rpms/perl/src/HTML-Tidy-1.08 >

    Huh? "All 2 subtests passed", yet "Result: FAIL" ? What's going on here? Let's try to run the test script without harness.

    qwurx [shmem] ~/rpms/perl/src/HTML-Tidy-1.08 > perl -Mblib t/perfect.t + "-T" is on the #! line, it must also be used on the command line at t/ +perfect.t line 1.

    Ah, ok. I have to pass the -T switch on the command line, let's do that.

    qwurx [shmem] ~/rpms/perl/src/HTML-Tidy-1.08 > perl -Mblib -T t/perfec +t.t Insecure dependency in require while running with -T switch at t/perfe +ct.t line 5. BEGIN failed--compilation aborted at t/perfect.t line 5.

    WTF? So ExtUtils::Command::MM turns the "insecure dependencies" into secure ones? I won't dig into that any further, I'm interested in what happens with that dratted test script. I eliminate the -T switch from the shebang line in t/perfect.t - next run:

    qwurx [shmem] ~/rpms/perl/src/HTML-Tidy-1.08 > perl -Mblib t/perfect.t + 1..2 ok 1 - use HTML::Tidy; ok 2 - The object isa HTML::Tidy Segmentation fault

    Segfault? Lets fire up the debugger:

    qwurx [shmem] ~/rpms/perl/src/HTML-Tidy-1.08 > gdb perl GNU gdb Red Hat Linux (6.6-16.fc7rh) Copyright (C) 2006 Free Software Foundation, Inc. ... (no debugging symbols found) Using host libthread_db library "/lib/libthread_db.so.1". (gdb) run -Mblib t/perfect.t Starting program: /usr/bin/perl -Mblib t/perfect.t (no debugging symbols found) ... [Thread debugging using libthread_db enabled] [New Thread -1208416576 (LWP 29388)] (no debugging symbols found) (no debugging symbols found) 1..2 ok 1 - use HTML::Tidy; ok 2 - The object isa HTML::Tidy Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1208416576 (LWP 29388)] 0x00147473 in tidyBufFree () from /usr/lib/libtidy-0.99.so.0 (gdb) bt #0 0x00147473 in tidyBufFree () from /usr/lib/libtidy-0.99.so.0 #1 0x001162e0 in XS_HTML__Tidy__tidy_messages (my_perl=0x9d9b008, cv= +0x9efa564) at Tidy.xs:99 #2 0x0208833d in Perl_pp_entersub () from /usr/lib/perl5/5.8.8/i386-l +inux-thread-multi/CORE/libperl.so #3 0x0208179f in Perl_runops_standard () from /usr/lib/perl5/5.8.8/i3 +86-linux-thread-multi/CORE/libperl.so #4 0x0202710e in perl_run () from /usr/lib/perl5/5.8.8/i386-linux-thr +ead-multi/CORE/libperl.so #5 0x0804921e in main () (gdb) q The program is running. Exit anyway? (y or n) y qwurx [shmem] ~/rpms/perl/src/HTML-Tidy-1.08 >

    So the problem is in libtidy, and I won't debug that. But as an interesting side note - did you notice all the steps necessary to get at that conclusion? Now tell me that the perl testing interface is great and doesn't suck. Expect a rant of mine about that any time soon.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: HTML::Tidy crashes with doctype declaration
by rhesa (Vicar) on Mar 15, 2008 at 15:49 UTC
    It works for me with HTML::Tidy v1.06 and an older libtidy "vers 1st December 2004" on CentOS 4.

    It fails with the same error as you with HTML::Tidy v1.08 and libtidy "vers 1 September 2005" on Ubuntu 6.06.

    I haven't been able to convince 1.08 to accept the doctype yet, so I guess the quickest workaround would be a downgrade :-(