cory2070 has asked for the wisdom of the Perl Monks concerning the following question:

I'm having trouble with some characters in an HTML file i'm trying to extract information from. There's some french characters and whatnot, but they work fine; the problem lies in the "em dash" aka & #8212; aka — aka "the long dash".

Here's a sample: here

"Mr. Bernard Patry (Pierrefonds—Dollard, Lib.)"

After saving the file, and opening it again (the saved file seems to have the character stored correctly), perl spits that line back out as:

"Mr. Bernard Patry (PierrefondsDollard, Lib.)"

i'm using:
use locale; use POSIX qw(locale_h); setlocale(LC_CTYPE, "fr_CA.ISO8859-1");

but that doesn't seem to help. So my problem is that my em dashes are disappearing, and i'd very much like to preserve them, or replace them with the utf-8 & #8212;. Any thoughts?

I'm using perl v5.8.5 on Gentoo. Thanks for any hints, premonitions, or Wall forbid: solutions!

Cheers,
Cory.

Replies are listed 'Best First'.
Re: Ye mighty "em dash"
by dave0 (Friar) on Jun 02, 2005 at 04:35 UTC
    When you say "perl spits that line back out as", do you mean "prints to your terminal as"? Perhaps Perl's printing the character, but your terminal cannot display it.

    If I take this page and save it as emdash.html, and run:

    open(FOO,'<emdash.html') or die $!; while(<FOO>) { if( /Pierrefonds/ ){ print; print join ' ',map { ord } split // ; print "\n"; } }
    both lines containing Bernard Patry's riding name appear in my xterm as "PierrefondsDollard", however, looking at the values of each character printed below each line, I can see that the first one contains an extra unprinted character, decimal value 151. That's the em dash.

    The fun part is that the em dash character of 151 isn't actually in ISO-8859-1. It's from the Windows Latin 1 character set, which isn't directly compatible with ISO-8859-1. This could explain why it doesn't display correctly in your (or at least, my) terminal. See http://www.cs.tut.fi/~jkorpela/www/windows-chars.html for more details.

      Thanks Dave, very helpful.

      You're right, my terminal can't display char 151. It looks like my best (err... easiest) workaround is to write a regex to replace all em dashes with & #8212; Here's what I hacked together:

      $html=~s/\x97/\& #8212;/g;

      note: there shouldn't be space between & and #, but I added it so it would display correctly. Here are some more encodings if anyone is looking to translate any other odd characters.

      Cheers!
Re: Ye mighty "em dash"
by tlm (Prior) on Jun 02, 2005 at 06:13 UTC

    As a minor (and Perl-free) side note, &mdash; works in place of &#8212;, but is far more readable.

    the lowliest monk