angelixd has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I'm doing some simple translation of LaTeX for the web, and am writing a simple translator to turn a (very small) set of symbols into html entities. We want our database to store html entities instead of unicode, so I'm doing this in ASCII. Here's a copy of my translation:
my %allowed_text_code = ('$\\alpha$' => 'α', '$\\beta$' => 'β', '$\\gamma$' => 'γ', '$\\delta$' => 'δ', '$\\theta$' => 'θ', '$\\pi$' => 'π', '\\degrees' => '°' );
and the regular expression I'm using later on:
foreach $tex_key (keys %allowed_text_code){ $text =~ s/\Q$tex_key\E/$allowed_text_code{$tex_key}/g; print $text. "\n"; }
I've only tested the '$\\pi$' tag, and whenever I look at the resulting file, I'm getting a carriage-return character and a script f ('¶ƒ' on OS X, in TextWrangler). On the web, I'm getting '¦Ä'. So does anyone have any idea what the encoding issue is? Ideally, I would like the output file to be in strict ASCII so that this problem doesn't arise, or do I need to escape my values in my regular expression, or what is it I'm missing here?

Replies are listed 'Best First'.
Re: character encoding ambiguities when performing regexps with html entities
by mwah (Hermit) on Sep 24, 2007 at 17:58 UTC
    You don't have to escape the \\ if you
    don't put it into ".."'quotes (except \'), just by removing
    the escapes, your code will work fine,
    <workable example provided>:
    use strict; use warnings; my $text=q' \Start We have $\alpha$-helical to a $\beta$-sheet proteins and stuff. The $\beta$-sheet structures relate to $\pi$ by several \degrees. '; my %allowed_text_code = ( '$\alpha$' => '&#945;', '$\beta$' => '&#946;', '$\gamma$' => '&#947;', '$\delta$' => '&#948;', '$\theta$' => '&#952;', '$\pi$' => '&#960;', '\degrees' => '&#176;' ); foreach my $tex_key (keys %allowed_text_code) { $text =~ s/\Q$tex_key\E/$allowed_text_code{$tex_key}/g; } print $text. "\n";

    This would not hold if your TeX-Source don't
    look like I expected ;-)

    Regards
    mwa

      You don't have to escape the \\ if you don't put it into ".."'quotes (except \'),

      But there's no harm in doing so. In fact, I always escape \ in single quotes to avoid accidently doing

      $path = '\\server\share'; # XXX WRONG

      Your changes produces no difference whatsoever.

        ikegami: $path = '\\server\share'; # XXX WRONG

        correct, I'd better written \\ and \' *can*
        be escaped in single quotes.
        BTW, I'd rather guess the problem the OP has
        comes entirely from UTF-x to ISO-y-z (or sth. else)
        conversion (but he did't hint to this or gave input data)

        Regards
        mwa
Re: character encoding ambiguities when performing regexps with html entities
by ikegami (Patriarch) on Sep 24, 2007 at 18:34 UTC

    What happens to $text afterwards?

    As you can see if you view the produced HTML doc in your web browser, the bit you provided works fine.

    my %allowed_text_code = ( '$\\alpha$' => '&#945;', '$\\beta$' => '&#946;', '$\\gamma$' => '&#947;', '$\\delta$' => '&#948;', '$\\theta$' => '&#952;', '$\\pi$' => '&#960;', '\\degrees' => '&#176;' ); my $text = join '', keys %allowed_text_code; foreach my $tex_key (keys %allowed_text_code) { $text =~ s/\Q$tex_key\E/$allowed_text_code{$tex_key}/g; } open(my $fh, '>', 'temp.html') or die; print $fh ("$text\n");

      So thankfully, it turns out that the problem i was trying to solve was still causing me grief. what happens to $text is that it gets dumped to another text file that describes the LaTeX sans comments, extraneous spacing, etc. That file is then searched for plaintext for web display, which is why i wanted to substitute the LaTeX tags for html entities. One of my coders was using straight unicode in the .tex files, and I felt that was a bad idea (mostly since we want to keep it simple), so that's why I'm doing this simple workaround.

      On a related note, is there a regular expression I can write that detects non-ascii characters? I really don't want to parse through a few ten thousands of LaTeX code...

        is there a regular expression I can write that detects non-ascii characters?

        Here are a couple easy ones:

        /[^\x00-\x7f]/ /[^[:ascii:]]/
        They both work whether or not the string happens to have its "utf8 flag" turned on.
Re: character encoding ambiguities when performing regexps with html entities
by WebDragon (Initiate) on Sep 24, 2007 at 21:09 UTC

    I do something a bit different when working within vim to take pasted text from an MS Word document, and translate the few oddball characters I frequently encounter into html entities.

    It took a good bit of experimentation to work this out, but it works well and consistently in translating on the fly, per line or per selection.

    From my .vimrc.web:

    let myentity = "–—“”‘’«»…ãáçêé¼½¾¿°"
    nmap <buffer> <silent> <localleader>utf :.!perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{<C-R>=g:myentity<CR>} );'<CR>
    vmap <buffer> <silent> <localleader>utf :!perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{<C-R>=g:myentity<CR>});'<CR>

    Translated, removed from it's vim environment, the line would look something like:

    perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{–—“”‘’«»…ãáçêé¼½¾¿°});' <yourfile>

    As always, your mileage may vary, but you should find this useful and consistent. :-)

    update: The above are supposed to be the actual utf-8 literals. In other words, you should see more of these: "ãáçêé¼½¾¿°" and NONE of these: "&#8211;&#8212;&#8220;&#8221;&#8216;&#8217;"

    If I had marked the above as <code>, the literals were all escaped which obscured the whole point of the post reply.

Re: character encoding ambiguities when performing regexps with html entities
by CountZero (Bishop) on Sep 24, 2007 at 19:23 UTC
    To show the Greek characters you have to set your browser to display at least CP1253, because that code page contains the Greek characters. Displaying the full UTF-8 would not hurt of course.

    Your phrase "I would like the output file to be in strict ASCII so that this problem doesn't arise" doesn't cut any wood. Your LaTex file is most probably already in "strict" ASCII. (La)Tex being old and venerable dates back from times when there was nothing but ASCII, hence all the funny characters were coded as "escaped names" (like \copyright for ©).

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Actually, the .tex file's encoding is a problem. Were it up to me and my preferred bag of tricks (emacs, pdflatex), I wouldn't be using any funny business. However, pdftex and TeXShop allow unicode input, which is what the previous coder was using, so the files have a few snippets of unicode characters in them. It's kind of weird actually reducing the flexibility of a format, but it's the only way to keep the data consistent across the board.
        I see your problem but you will not solve it with a regex. Your best bet would be to get rid of such Unicode input in your tex-files and replace these characters by their proper tex-forms.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James