Before you start any project, please make sure that you are using the UTF-8 encoding. Everyone I talk to says that UTF-8 is almost a near standard in coding, so please make sure that you are using it from the very beginning of your project. When you start a project in a set encoding and there are problems with your data coming in because of it, you will more than likely find it in the beginning of your project, not when you are bogged down in the middle of it like I am.

I did not know or do this when I first started writing the code and data for my project, and now it is apparently impossible for me to convert over to UTF-8. All of my code is written in and around an unknown encoding. This probably makes my code less portable.

Just do yourself the favor and start with UTF-8 and not have to muck around with converting later when you are 1,500+ files into your project when you get into a problem where a character will not display correctly no matter how hard you try to fix it.

This was written after a very bad time trying and failing to deal with this, so please be forgiving.

Have a cookie and a very nice day!
Lady Aleena
  • Comment on Save yourself, start all projets with UTF-8 encoding

Replies are listed 'Best First'.
Re: Save yourself, start all projets with UTF-8 encoding
by grantm (Parson) on Apr 01, 2011 at 07:51 UTC

    While I agree with the sentiment that it seldom makes sense to use an encoding other than UTF-8, I'm struggling with this bit:

    All of my code is written in and around an unknown encoding.

    What does that even mean? Surely 5-10 minutes with a hex dump tool (or Perl's ord() function for that matter) will tell you what encoding you're using.

    Using UTF-8 is not a panacea but it's usually an excellent first step.

      Using UTF-8 is not a panacea but it's usually an excellent first step.

      Like a horny horse is an unicorn, using utf-8 is a panacea

Re: Save yourself, start all projets with UTF-8 encoding
by Anonymous Monk on Apr 01, 2011 at 08:06 UTC
    This was written after a very bad time trying and failing to deal with this, so please be forgiving.

    Eyel ink to tat!

    $ lwp-dump http://perlmonks.org/?node_id=896777
    HTTP/1.1 200 OK
    Connection: Keep-Alive
    Date: Fri, 01 Apr 2011 07:52:09 GMT
    Server: Apache/2.2.17
    Content-Type: text/html; charset=Windows-1252
    Keep-Alive: timeout=5, max=100
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\r
    <!-- Took this out for IE6ites "http://www.w3.org/TR/REC-html40/loose.dtd" -->\r
    <html lang="en">\r
    <head>
    <title>Save yourself, start all projets with UTF-8 encoding</title>
    \40\40\40\40
    <!-- Theme : Web safe blue PerlMonks Theme -->
    <link rel="stylesheet" href="/css/common.css" type="text/css" />
    <link rel="stylesheet" href="?node_id=204962" type="text/css" />
    <!-- No CSS Link in User Settings -->
    <!-- No CSS Data in User S...
    (+ 22965 more bytes not shown)
Re: Save yourself, start all projets with UTF-8 encoding
by JavaFan (Canon) on Apr 04, 2011 at 20:35 UTC
    Before you start any project, please make sure that you are using the UTF-8 encoding.
    What does that mean? I cannot figure out from your posting whether you mean the source code should be in UTF-8 format, all internal strings should be in UTF-8 format, all external data should be in UTF-8 format, or something else.

      Source code should be printable ASCII. Which is, of course, a subset of UTF-8.

      It should be printable ASCII because not everyone who wants of needs to hack on it will have font support for weirdo characters. YOU might not have support for weirdo characters all the time - if, for example, it all goes wrong and you have to fix stuff using the crappy ssh client on your phone. If you need non-ASCII text in your application, then it should be corralled into language-specific ghettoes: templates or resource files.

      In fact, even if your application only needs to work in ASCII-compatible languages (the only two I can think of are English and Latin) then ideally text strings will still live in templates and resource files, for two reasons. First, seperation of concerns. Second, it'll let you more easily add other languages later.

        even if your application only needs to work in ASCII-compatible languages (the only two I can think of are English and Latin)
        Neither English nor Latin is ASCII-compatible. You cannot write English correctly in ASCII. Period!

        You don’t have proper quotes “like these” — nor distinct em dashes like those. You can’t write ± a few minutes, or 5¢, or ℅ General Delivery, let alone πr². You can’t even write 5÷2, or ©2011. What about 5 o’clock? And don’t get me started with things like jalapeños or résumés.

        As for Latin, consider this bit of Latin nested in English:

        Etymology: < Latin domināt‑ participial stem of dominārī to bear rule, govern, lord it, < domin‐us lord, master: compare French dominer.

        See what I mean? The only English you can write in ASCII looks like bumpkin‐English, a word which I’ve self‐censored.

      JavaFan...I mean every file that can be encoded to UTF-8 encode to UTF-8. That way you don't have to rewrite everything because of issues with conversion. I have so many non-UTF-8 files that a manual by hand approach to conversion won't fly. However, because UTF-8 files don't talk to my non-UTF-8 files well, I have to do it all at once. I also can't figure out why UTF-8 scripts can't find files on my computer. If I had initially written my code and data files in UTF-8, I would have found these problems ages ago. However, since I didn't, I am just finding them now, when my code runs very deep. Finding where the problems are is nearly impossible. I have tried to trace the problems, and I can't find them.

      Have a cookie and a very nice day!
      Lady Aleena
Re: Save yourself, start all projets with UTF-8 encoding
by motokitn (Sexton) on Apr 06, 2011 at 03:07 UTC
    Agreed - at the very least make sure your encoding is consistent! My current project involves parsing xml files that are pure trash, from random encodings to horribly abused CDATA's, and the worst bugs seem to be related to encoding. So don't just do yourself a favor; do the poor slob that follows you a favor too!
Re: Save yourself, start all projets with UTF-8 encoding
by planetscape (Chancellor) on Feb 25, 2012 at 22:31 UTC