I'm having trouble with some characters in an HTML file i'm trying to extract information from. There's some french characters and whatnot, but they work fine; the problem lies in the "em dash" aka & #8212; aka — aka "the long dash".

Here's a sample: here

"Mr. Bernard Patry (Pierrefonds—Dollard, Lib.)"

After saving the file, and opening it again (the saved file seems to have the character stored correctly), perl spits that line back out as:

"Mr. Bernard Patry (PierrefondsDollard, Lib.)"

i'm using:
use locale; use POSIX qw(locale_h); setlocale(LC_CTYPE, "fr_CA.ISO8859-1");

but that doesn't seem to help. So my problem is that my em dashes are disappearing, and i'd very much like to preserve them, or replace them with the utf-8 & #8212;. Any thoughts?

I'm using perl v5.8.5 on Gentoo. Thanks for any hints, premonitions, or Wall forbid: solutions!

Cheers,
Cory.

In reply to Ye mighty "em dash" by cory2070

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.