JHG has asked for the wisdom of the Perl Monks concerning the following question:

The perl application I wrote with Komodo on a MAC OS 10.6 reads hundreds of plain .txt files, and does wrap HTML tags around the text and prints the aggregated lines that I paste into a single .htm file which is rather long since it compresses hundred of files so to speak.

The logic of my code, the outcome, everything is as expected BUT for the fact that PERL from time to time spit junk characters.

<p><p style="font-size: 20px; font-weight: bold;">Rhombe de iambe</p> <br>&nbsp;<br />Au cinq &#8730;©toiles ou au tripot<br /> Oncques je ne manque de sortir mon tricot<br /> Ainsi j‚Äôappelle ces carnets o&#8730;&#960; je griffonne<br /> Seul &#8730;† &#8730;™tre seul en ces lieux cacophones<br />

instead of the correct accented letters used in French langage

<p><p style="font-size: 20px; font-weight: bold;">Rhombe de iambe</p> <br>&nbsp;<br />Au cinq étoiles ou au tripot<br /> Oncques je ne manque de sortir mon tricot<br /> Ainsi j'appelle ces carnets où je griffonne<br /> Seul seul à être seul en ces lieux cacophones<br />

There is no consticency in this happening since accented letters are ubiquitous in every .txt files from which the text originate and when I rerun the same code on the same .txt files, the junk can affect the same portion of text or not or appear elswhere or nowhere if I reduce the number of .txt files I start with. It does not start at the beginning of a text file, but can start in the middle of it after a number of correctly accented letters, then it affect the rest of the text and the next ones, 4 of 5, then it stops and the letters are correctly accented again.....

Extremely obnoxious HELP please!

Replies are listed 'Best First'.
Re: Unusual output with french accented letters
by ikegami (Patriarch) on May 17, 2012 at 19:08 UTC
    Can you give a miminal, runnable snippet that demonstrates the problem?
       print '<!--HTML poésieBEGIN-->';

      becomes in Perl output window

      <!--HTML po&#8730;©sieBEGIN-->

      Actually the garbage replacing é is different, namely √© but it shows like &#8730;© when placed between the code tags on this page!! To confuse the issues a little further there are other é before and after this one in my code that come out correctly in the output window!!! Thanks for your time anyway

Re: Unusual output with french accented letters
by locked_user sundialsvc4 (Abbot) on May 17, 2012 at 20:56 UTC

    Does the resulting HTML file specify the character-set that is to be used by the browser when displaying the finished file, and the character set in which the data being presented to it is encoded?   It has been my painful experience that you must be absolutely certain of both of these things ... and then test on every browser you can find.

    Actual examination of the file in a hexadecimal editor might also be a good idea... “never assume...”

Re: Unusual output with french accented letters
by choroba (Cardinal) on May 17, 2012 at 22:58 UTC
    Do you tell Perl your input files are utf-8 encoded (or whatever encoding you use)? Do you tell Perl to encode the output files in utf-8? See open, open, or binmode.
Re: Unusual output with french accented letters
by mbethke (Hermit) on May 18, 2012 at 06:31 UTC

    This is a Mac charset problem. My terminal is UTF-8:

    $ echo -n ‚Äô|recode ..macintosh | xxd 0000000: e280 99 ... $ echo `echo -n ‚Äô|recode ..macintosh` ’
    This means that when I take the mangled characters from "Ainsi j’appelle" and convert them to the traditional Mac charset, the resulting byte sequence interpreted as UTF-8 is an apostrophe. So the reverse must be the case for your file: you are assuming an 8-bit Mac charset but the file is actually UTF-8. Either convert the files by hand using recode or iconv, or use Encode::Guess or similar to detect the encoding on the fly.

      Thank you for your time to start with. I went a step further and noticed the following. In my PERL code I have the following HTML tag

      --HTML poésieBEGIN--

      This piece is a part of what I tell PERL to print in the outpout windows so I can copy/paste into a HTML file, really straitghtforward ins'nt it!

      But in the PERL output windows already it becomes --HTML poésieBEGIN--

      where √© stands for the character é. The strangest part is that there are other é before and after in the code and they come out correctly as é !!!

      The use of the module you suggest is beyond my grasp I am afraid, thanks anyway

        But in the PERL output windows already it becomes --HTML po√©sieBEGIN-- where √© stands for the character é. The strangest part is that there are other é before and after in the code and they come out correctly as é !!!

        Hm, strange. I think that calls for a hex editor. Apparently your text editor is hiding some differences from you that your shell chokes on, so you'd have to try and find the byte or bytes in your source that represent the é in each case. Or maybe it's something with your code, but you'd have to show it.