I'm having trouble with some characters in an HTML file i'm trying to extract information from. There's some french characters and whatnot, but they work fine; the problem lies in the "em dash" aka & #8212; aka — aka "the long dash".
Here's a sample:
here
"Mr. Bernard Patry (Pierrefonds—Dollard, Lib.)"
After saving the file, and opening it again (the saved file seems to have the character stored correctly), perl spits that line back out as:
"Mr. Bernard Patry (PierrefondsDollard, Lib.)"
i'm using:
use locale;
use POSIX qw(locale_h);
setlocale(LC_CTYPE, "fr_CA.ISO8859-1");
but that doesn't seem to help. So my problem is that my em dashes are disappearing, and i'd very much like to preserve them, or replace them with the utf-8 & #8212;. Any thoughts?
I'm using perl v5.8.5 on Gentoo. Thanks for any hints, premonitions, or Wall forbid: solutions!
Cheers,
Cory.