Converting HTML to txt with HTML::Strip

elef has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I am asking for your help in converting HTML files to UTF-8 txt.

The tasks seems simple, but it's a lot more tricky to do than I expected.

I have a working solution, but it involves decoding character entities with HTML::Entities before running HTML::Strip. As a result of that, if the text in the HTML file contains something like <this is a tag quoted inside html>, it gets stripped along with the real HTML tags.
I tried decoding the character entities later, when the stripper is run (see lines I commented out). In that case, I get incorrect character conversions (eacute and uuml) and "wide character in print" error messages. I could fix the problem by introducing some sort of a workaround into my original solution (say, tell HTML::Entities to ignore < and >, although I can't find an easy way to do it), but I'm more interested in what the "proper" solution is.

Update: I've found a good workaround: just insert

        s/\&gt\;/\&amp\;gt\;/g;
        s/\&lt\;/\&amp\;lt\;/g;
[download]

before print OUT decode_entities($_); to make lt and gt stay character references. Still, I'm interested in your comments/improvements.

Here's my code, it's in a sub as it's part of a larger project (obviously, fill in path/to/test.html if you want to run the script):

#!/usr/bin/perl
use strict;
use warnings;
use File::Copy;

use HTML::Strip;
use HTML::Entities;
sub convert_html;

convert_html("path/to/test.html");


sub convert_html($){
    # NOTE: $pf contains the path as well as the filename excluding th
+e extension.

# parse filename
    $_[0] =~ /(.*)\.(.*)/;
    my $pf = $1;
    my $ext = $2;

# PREPARE FILES BEFORE RUNNING THE TAG STRIPPER
    open (IN, "<:encoding(UTF-8)", "${pf}.${ext}");
    open (OUT, ">:encoding(UTF-8)", "${pf}_htmlmod.${ext}");

    while (<IN>) {
        s/\x{A0}/ /g;                # remove non-breaking spaces
        s/\n//g;                    # remove literal line breaks
        s/<\/?p>/\n/ig;                # conserve line breaks ("\/?" b
+ecause "<p style =...> blabla</p>" is not caught by the normal regex
        s/<br( \/)?>/\n/ig;            # yet more line breaks
        s/\&\#8209;/-/g;
        print OUT decode_entities($_);
        # print OUT $_;                # alternative attempt
    }
    close IN;
    close OUT;
print "\nline break and nbsp preparation done\n";
<STDIN>;


# STRIP TAGS

    # using :encoding(UTF-8) breaks this
    open (IN, "<", "${pf}_htmlmod.${ext}");
    open (OUT, ">", "${pf}.txt");
    {
        my $hs = HTML::Strip->new();
        # my $hs = HTML::Strip->new( decode_entities => 1 );    # alte
+rnative attempt

        while (<IN>) {
        my $clean_text = $hs->parse($_);
        print OUT $clean_text;
    }

    close IN;
    close OUT;
    unlink "${pf}_htmlmod.${ext}";
    }
print "\nhtml conversion done\n";
<STDIN>;
}
[download]

The test file with a couple of BRK tags in the text:

<HTML>
   <HEAD>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-
+8">
   
      <!--Filename : PISZ@TRA-DOC-HU-CONCL-C-0371-2003-200506500-06_00
+-->
      <!-- Feuille de style -->
      <LINK HREF="lex/css/Style_CNC_C_FR.css" REL="stylesheet" TYPE="t
+ext/css">
      <LINK HREF="lex/css/Style_CNC_C_HU.css" REL="stylesheet" TYPE="t
+ext/css">
      <!-- Titre du document -->
      <TITLE></TITLE>
   </HEAD>
   <BODY>
      <P class="C36Centre">JACOBS</P>
      <P class="C36Centre">F&#336;TAN&Aacute;CSNOK IND&Iacute;TV&Aacut
+e;NYA&lt;BRK&gt;</P>
      <P class="C36Centre">Az ismertet&eacute;s napja: 2005.&nbsp;nove
+mber&nbsp;17.<SUP>1</SUP>(<A HREF="#Footnote1" NAME="Footref1">1</A>)
      </P>
      <P class="C38Centregrasgrandespacement"><B>C&#8209;371/03.&nbsp;
+sz.&nbsp;&uuml;gy</B></P>
      <P class="C37Centregras"><B>Siegfried Aulinger&lt;BRK&gt;</B></P
+>
      <P class="C37Centregras"><B>kontra&lt;this should be left in&gt;
+</B></P>
      <P class="C37Centregras"><B>Bundesrepublik Deutschland</B></P>
      <P class="C71Indicateur"><br></P><BR><BR><BR><BR><P class="C01Po
+intAltN">1.&lt;BRK&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
+;Ebben az &uuml;gyben az &#8216;Oberlandesgericht K&ouml;ln&#8217; (k
+&ouml;lni fellebbviteli b&iacute;r&oacute;s&aacute;g) a Szerb &eacute
+;s a Montenegr&oacute;i K&ouml;zt&aacute;rsas&aacute;g, valamint az
         Eur&oacute;pai Gazdas&aacute;gi K&ouml;z&ouml;ss&eacute;g k&o
+uml;z&ouml;tti kereskedelem megtilt&aacute;s&aacute;r&oacute;l sz&oac
+ute;l&oacute;, 1992. j&uacute;nius 1&#8209;jei 1432/92/EGK tan&aacute
+;csi rendelet (a tov&aacute;bbiakban:
         az embarg&oacute;r&oacute;l sz&oacute;l&oacute; rendelet)(<A 
+HREF="#Footnote2" NAME="Footref2">2</A>) &eacute;rtelmez&eacute;s&eac
+ute;re vonatkoz&oacute;an k&eacute;t k&eacute;rd&eacute;st terjesztet
+t a B&iacute;r&oacute;s&aacute;g el&eacute; el&#337;zetes d&ouml;nt&e
+acute;shozatalra.

   </BODY>
</HTML>
[download]

Comment on Converting HTML to txt with HTML::Strip Select or Download Code

Replies are listed 'Best First'.
Re: Converting HTML to txt with HTML::Strip by wfsp (Abbot) on Oct 03, 2010 at 13:43 UTC
This uses HTML::TokeParser::Simple (there are many other parsers) and may help get you started. It preserves your `<BRK>` 'tags', is that what you were after? `#! /usr/bin/perl use warnings; use strict; use HTML::Entities; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( q{monk.html}, ) or die qq{cant parse HTML}; open my $fh_out, q{>:utf8}, q{out.txt} or die qq{cant open file to write}; while (my $t = $p->get_token){ if ($t->is_end_tag(q{p}) or $t->is_tag(q{br})){ print $fh_out qq{\n}; } elsif ($t->is_text){ my $out = $t->as_is; for ($out){ s/^\s+//; s/\s+$//; } next unless $out; print $fh_out decode_entities($out); } }` [download] output (long lines snipped) `JACOBS FŐTANÁCSNOK INDÍTVÁNYA<BRK> Az ismertetés napja: 2005. november 17.1(1) C‑371/03. sz. ügy Siegfried Aulinger<BRK> kontra<this should be left in> Bundesrepublik Deutschland 1.<BRK> Ebben az ügyben az... Európai Gazdasági Közösség közötti... az embargóról szóló rendelet)(2)...` [download] Some numeric entities appear here (in the browser), e.g. `Ő`, these aren't in the file.	[reply] [d/l] [select]
Re^2: Converting HTML to txt with HTML::Strip by elef (Friar) on Oct 04, 2010 at 16:08 UTC
Well, yes, the BRK tags should be conserved with the lt and gt character references converted to < and > (everything that's "in the text", i.e. everything that isn't part of the HTML markup should stay in). Frankly, most of your actual code went right over my head. I'm pretty new to perl and programming in general. I'm not sure what you mean about the the numerical entities not being in the file. They are in the original HTML file and should be converted to the appropriate characters, e.g. 336 is the accented letter Ő. Either way, now I have a solution I'm happy with (the workaround I posted). It's not elegant, but it does everything I want it to so I think I'll stick with it. By the way, it's pretty surprising that there seems to be no foolproof HTML->txt converter module that would just let you just provide a path to an HTML file and spit out a UTF-8 txt with the right line breaks, all the character entities decoded etc. I.e. instead of the 20 or so lines you and I posted, it should be `#! /usr/bin/perl use warnings; use strict; use HTML::Convert; HTML::Convert(file.html);` [download] ... and you'd get file.txt created in the same folder.	[reply] [d/l]