elef has asked for the wisdom of the Perl Monks concerning the following question:
The tasks seems simple, but it's a lot more tricky to do than I expected.
I have a working solution, but it involves decoding character entities with HTML::Entities before running HTML::Strip. As a result of that, if the text in the HTML file contains something like <this is a tag quoted inside html>, it gets stripped along with the real HTML tags.Update: I've found a good workaround: just insert
before print OUT decode_entities($_); to make lt and gt stay character references. Still, I'm interested in your comments/improvements.s/\>\;/\&\;gt\;/g; s/\<\;/\&\;lt\;/g;
#!/usr/bin/perl use strict; use warnings; use File::Copy; use HTML::Strip; use HTML::Entities; sub convert_html; convert_html("path/to/test.html"); sub convert_html($){ # NOTE: $pf contains the path as well as the filename excluding th +e extension. # parse filename $_[0] =~ /(.*)\.(.*)/; my $pf = $1; my $ext = $2; # PREPARE FILES BEFORE RUNNING THE TAG STRIPPER open (IN, "<:encoding(UTF-8)", "${pf}.${ext}"); open (OUT, ">:encoding(UTF-8)", "${pf}_htmlmod.${ext}"); while (<IN>) { s/\x{A0}/ /g; # remove non-breaking spaces s/\n//g; # remove literal line breaks s/<\/?p>/\n/ig; # conserve line breaks ("\/?" b +ecause "<p style =...> blabla</p>" is not caught by the normal regex s/<br( \/)?>/\n/ig; # yet more line breaks s/\&\#8209;/-/g; print OUT decode_entities($_); # print OUT $_; # alternative attempt } close IN; close OUT; print "\nline break and nbsp preparation done\n"; <STDIN>; # STRIP TAGS # using :encoding(UTF-8) breaks this open (IN, "<", "${pf}_htmlmod.${ext}"); open (OUT, ">", "${pf}.txt"); { my $hs = HTML::Strip->new(); # my $hs = HTML::Strip->new( decode_entities => 1 ); # alte +rnative attempt while (<IN>) { my $clean_text = $hs->parse($_); print OUT $clean_text; } close IN; close OUT; unlink "${pf}_htmlmod.${ext}"; } print "\nhtml conversion done\n"; <STDIN>; }
The test file with a couple of BRK tags in the text:
<HTML> <HEAD> <meta http-equiv="Content-Type" content="text/html; charset=UTF- +8"> <!--Filename : PISZ@TRA-DOC-HU-CONCL-C-0371-2003-200506500-06_00 +--> <!-- Feuille de style --> <LINK HREF="lex/css/Style_CNC_C_FR.css" REL="stylesheet" TYPE="t +ext/css"> <LINK HREF="lex/css/Style_CNC_C_HU.css" REL="stylesheet" TYPE="t +ext/css"> <!-- Titre du document --> <TITLE></TITLE> </HEAD> <BODY> <P class="C36Centre">JACOBS</P> <P class="C36Centre">FŐTANÁCSNOK INDÍTV&Aacut +e;NYA<BRK></P> <P class="C36Centre">Az ismertetés napja: 2005. nove +mber 17.<SUP>1</SUP>(<A HREF="#Footnote1" NAME="Footref1">1</A>) </P> <P class="C38Centregrasgrandespacement"><B>C‑371/03. +sz. ügy</B></P> <P class="C37Centregras"><B>Siegfried Aulinger<BRK></B></P +> <P class="C37Centregras"><B>kontra<this should be left in> +</B></P> <P class="C37Centregras"><B>Bundesrepublik Deutschland</B></P> <P class="C71Indicateur"><br></P><BR><BR><BR><BR><P class="C01Po +intAltN">1.<BRK>   +;Ebben az ügyben az ‘Oberlandesgericht Köln’ (k +ölni fellebbviteli bíróság) a Szerb é +;s a Montenegrói Köztársaság, valamint az Európai Gazdasági Közösség k&o +uml;zötti kereskedelem megtiltásáról sz&oac +ute;ló, 1992. június 1‑jei 1432/92/EGK taná +;csi rendelet (a továbbiakban: az embargóról szóló rendelet)(<A +HREF="#Footnote2" NAME="Footref2">2</A>) értelmezés&eac +ute;re vonatkozóan két kérdést terjesztet +t a Bíróság elé előzetes dönt&e +acute;shozatalra. </BODY> </HTML>
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Converting HTML to txt with HTML::Strip
by wfsp (Abbot) on Oct 03, 2010 at 13:43 UTC | |
by elef (Friar) on Oct 04, 2010 at 16:08 UTC |