comment on

I've got zillions of lines of stuff that should be html, but, you know, not very clean.

Every line needs to be cleaned up. Problem I'm having is html that has exotic characters like

’

What the hell is that anyway? I don't know, I don't care. It seems to only have meaning under utf-8, and the team I am delivering the data to hasn't switched to utf-8 yet. So the agreed work around is we skip formatting that is "utf-8 only". However, we'd like to quick-convert html to text using HTML::Strip for everything else. Is there a way to do this? Or is there a better way to quick-convert html to text than HTML::Strip?

Below is tests and code that demonstrate the problem.

The meat is in two functions: stripUtf8Entities and stripUtf8EntitiesBetter -- which I call before converting my "html" to text. stripUtf8Entities lets me pass my tests, but only for that one "ugly" special character, I guess it won't work in general. stripUTF8EntitiesBetter doesn't pass tests, because it's just a stub. But this would be the code to change if you have a better idea on how to do this. Test output:

ok 1 - stripUtf8Entities
# before:blah
# after: blah
ok 2 - stripUtf8Entities
# before:&Uuml --
# after: Ü --
ok 3 - stripUtf8Entities
# before:blah -- &rsquo; -- blah
# after: blah --  -- blah
ok 4 - stripUtf8Entities
# before:&Uuml; -- &rsquo; -- blah
# after: Ü --  -- blah
ok 5 - stripUtf8EntitiesBetter
# before:blah
# after: blah
ok 6 - stripUtf8EntitiesBetter
# before:&Uuml --
# after: Ü --
not ok 7 - stripUtf8EntitiesBetter
# before:blah -- &rsquo; -- blah
# after: blah --  -- blah
#   Failed test 'stripUtf8EntitiesBetter
# before:blah -- &rsquo; -- blah
# after: blah --  -- blah'
#   at shopImporter-test.pl line 49.
Wide character in print at /home/hartman/idealo_external_dependencies/
+current/localperl/lib/5.8.8/Test/Builder.pm line 1192.
#          got: 'blah -- â -- blah'
#     expected: 'blah --  -- blah'
not ok 8 - stripUtf8EntitiesBetter
# before:&Uuml; -- &rsquo; -- blah
# after: Ü --  -- blah
#   Failed test 'stripUtf8EntitiesBetter
# before:&Uuml; -- &rsquo; -- blah
# after: Ü --  -- blah'
#   at shopImporter-test.pl line 49.
Wide character in print at /home/hartman/idealo_external_dependencies/
+current/localperl/lib/5.8.8/Test/Builder.pm line 1192.
#          got: 'Ã -- â -- blah'
#     expected: 'Ã --  -- blah'
1..8
# Looks like you failed 2 tests of 8.
[download]

Code:

$ cat utf8-and-html-entities.pl
#!/usr/angebote/perlroot/bin/perl
use strict;
use warnings;

# use strict;
# use IO::File;
# use Text::CSV_XS;
# use DBI;
# use Time::Local;
# use Time::HiRes;
# use Compress::Zlib;
# use LWP::UserAgent;
#use POSIX qw(locale_h);
use HTML::Strip;
use Test::More qw(no_plan);
use Data::Dumper;

#setlocale(LC_CTYPE, "de_DE.ISO8859-1");

require "../../perl/agentFunc.pl";

my $stringsBeforeAfter = [
               [ 'blah', 'blah' ],
               [ '&Uuml --', 'Ü --'],
               ["blah -- &rsquo; -- blah", "blah --  -- blah"],
               ["&Uuml; -- &rsquo; -- blah", "Ü --  -- blah"],
              ];


foreach my $beforeAfter ( @$stringsBeforeAfter ) {
  my ( $before, $after )  = @$beforeAfter;
  my $transformed =HTML2Text(  stripUtf8Entities( $before ) );
  my $strings = [ [ "before", $before ],
                  [ "after", $after ],
                  [ "transformed", $transformed ]
                ];
  #print "strings: " . Dumper($strings);
  is($transformed, $after, "stripUtf8Entities");
}

foreach my $beforeAfter ( @$stringsBeforeAfter ) {
  my ( $before, $after )  = @$beforeAfter;
  my $transformed =HTML2Text(  stripUtf8EntitiesBetter( $before ) );
  my $strings = [ [ "before", $before ],
                  [ "after", $after ],
                  [ "transformed", $transformed ]
                ];
  #print "strings: " . Dumper($strings);
  is($transformed, $after, "stripUtf8EntitiesBetter");
}

sub HTML2Text {
    my ($changeText) = @_;

    my $htmlStripObject = HTML::Strip->new();

    $changeText = $htmlStripObject->parse($changeText);

    return $changeText;
}

# works, but only for one special character: &rsquo
# what happens when I hit another char that doesn't translate well out
+ of utf8?
sub stripUtf8Entities {
   my $string = shift || "";

   my $utf8Entities = ["&rsquo;"];

   foreach my $utf8Entity ( @$utf8Entities ) {
     $string =~ s/$utf8Entity//g;
   }

   return $string;
}

#just a stub -- is there a better, more general way to do this?
sub stripUtf8EntitiesBetter {
   my $string = shift || "";
   return $string;

}
[download]

In reply to HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities? by tphyahoo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.