elef has asked for the wisdom of the Perl Monks concerning the following question:

Udpate: original issue solved, now chatting about integrating modules into perl scripts to allow users to run the script without installing modules separately.


Dear fellow monks, this should be a quick one.

I'm trying to use an HTML stripper and character reference converter written by none other than Tom Christiansen, and the HTML stripper portion seems to work fine, but HTML character references are not being converted.
I'm afraid I know too little about hashes and chr to find the bug myself. This thing was written in 1996 so perhaps the problem is caused by some changes that have been made in perl itself since then.
(Note: I know modules are better for this purpose, but I need a solution that will work on other people's computers without installing a module.)

Here's the entire original script.
And here's the bit that doesn't work for me (it looks for an input file named file.html and creates file.txt - I removed all character references but one to keep the snippet short. Please test on files that contain numerical character references and Á.
#!/usr/bin/perl use strict; use warnings; open (IN, "<:encoding(UTF-8)", "file.html") or die "Can't open file: $ +!"; open (OUT, ">:encoding(UTF-8)", "file.txt") or die "Can't open file: $ +!"; while (<IN>) { my %entity; my $chr; ######################################################### # translate HTML 2.0 entities ######################################################### s{ ( & # an entity starts with a semicolon ( \x23\d+ # and is either a pound (#) and numbers | # or else \w+ # has alphanumunders up to a semi ) ;? # a semi terminates AS DOES ANYTHING ELSE ( +XXX) ) } { $entity{$2} # if it's a known entity use that || # but otherwise $1 # leave what we'd found; NO WARNINGS (XXX) }gex; # execute replacement -- that's code not a +string ######################################################### # but wait! load up the %entity mappings enwrapped in # a BEGIN that the last might be first, and only execute # once, since we're in a -p "loop"; awk is kinda nice after all. ######################################################### BEGIN { %entity = ( Aacute => chr 193, #capital A, acute accent ); for $chr ( 0 .. 255 ) { $entity{ '#' . $chr } = chr $chr; } } print OUT $_; } close IN; close OUT;

Replies are listed 'Best First'.
Re: Troubleshooting a character reference converter script
by Anonymous Monk on Oct 27, 2010 at 18:05 UTC
    while (<IN>) { my %entity; # <--- ...
    I think your problem is that you're clearing the lookup table %entity for every line (within the loop). The BEGIN block runs only once however...

    Set it up outside of the loop instead.

      *hangs head in shame*

      That's it, thank you.

      The backstory is that the original script didn't have the "my" lines in it - I added them because with strict on, perl complained. I somehow dropped them inside the while loop without thinking about the consequences of redefining the variables in every iteration of the loop.
Re: Troubleshooting a character reference converter script
by Corion (Patriarch) on Oct 27, 2010 at 17:39 UTC

    How does it fail to work for you?

      I was about to write "completely", but I did a bit more testing before I posted and it turns out that it does work on the file's very first line. It converts all the character references that are on the first line, but the rest of the file stays unchanged. If I insert a blank line before the first line, the whole file remains unchanged.
      I added an s/a/b/g; line inside the while loop right after my $chr; as a test, and that replacement works fine on all lines of the file. Odd.

      I'm on Windows XP with Activeperl 5.10 by the way.
Re: Troubleshooting a character reference converter script
by ikegami (Patriarch) on Oct 27, 2010 at 19:32 UTC
      From my original post:
      Note: I know modules are better for this purpose, but I need a solution that will work on other people's computers without installing a module.

        I don't believe that because you posted a module you intend to install.

        Which is the relevant difference that allows you to install the module you posted, but not HTML::Entities?