character encoding ambiguities when performing regexps with html entities

angelixd has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: character encoding ambiguities when performing regexps with html entities by mwah (Hermit) on Sep 24, 2007 at 17:58 UTC
You don't have to escape the \\ if you don't put it into ".."'quotes (except \'), just by removing the escapes, your code will work fine, <workable example provided>: `use strict; use warnings; my $text=q' \Start We have $\alpha$-helical to a $\beta$-sheet proteins and stuff. The $\beta$-sheet structures relate to $\pi$ by several \degrees. '; my %allowed_text_code = ( '$\alpha$' => 'α', '$\beta$' => 'β', '$\gamma$' => 'γ', '$\delta$' => 'δ', '$\theta$' => 'θ', '$\pi$' => 'π', '\degrees' => '°' ); foreach my $tex_key (keys %allowed_text_code) { $text =~ s/\Q$tex_key\E/$allowed_text_code{$tex_key}/g; } print $text. "\n";` [download] This would not hold if your TeX-Source don't look like I expected ;-) Regards mwa	[reply] [d/l]
Re^2: character encoding ambiguities when performing regexps with html entities by ikegami (Patriarch) on Sep 24, 2007 at 18:05 UTC
You don't have to escape the \\ if you don't put it into ".."'quotes (except \'), But there's no harm in doing so. In fact, I always escape `\` in single quotes to avoid accidently doing `$path = '\\server\share'; # XXX WRONG` [download] Your changes produces no difference whatsoever.	[reply] [d/l] [select]
Re^3: character encoding ambiguities when performing regexps with html entities by mwah (Hermit) on Sep 24, 2007 at 18:14 UTC
ikegami: $path = '\\server\share'; # XXX WRONG correct, I'd better written \\ and \' can be escaped in single quotes. BTW, I'd rather guess the problem the OP has comes entirely from UTF-x to ISO-y-z (or sth. else) conversion (but he did't hint to this or gave input data) Regards mwa	[reply]
Re: character encoding ambiguities when performing regexps with html entities by ikegami (Patriarch) on Sep 24, 2007 at 18:34 UTC
What happens to `$text` afterwards? As you can see if you view the produced HTML doc in your web browser, the bit you provided works fine. `my %allowed_text_code = ( '$\\alpha$' => 'α', '$\\beta$' => 'β', '$\\gamma$' => 'γ', '$\\delta$' => 'δ', '$\\theta$' => 'θ', '$\\pi$' => 'π', '\\degrees' => '°' ); my $text = join '', keys %allowed_text_code; foreach my $tex_key (keys %allowed_text_code) { $text =~ s/\Q$tex_key\E/$allowed_text_code{$tex_key}/g; } open(my $fh, '>', 'temp.html') or die; print $fh ("$text\n");` [download]	[reply] [d/l] [select]
Re^2: character encoding ambiguities when performing regexps with html entities by angelixd (Novice) on Sep 24, 2007 at 19:27 UTC
So thankfully, it turns out that the problem i was trying to solve was still causing me grief. what happens to $text is that it gets dumped to another text file that describes the LaTeX sans comments, extraneous spacing, etc. That file is then searched for plaintext for web display, which is why i wanted to substitute the LaTeX tags for html entities. One of my coders was using straight unicode in the .tex files, and I felt that was a bad idea (mostly since we want to keep it simple), so that's why I'm doing this simple workaround. On a related note, is there a regular expression I can write that detects non-ascii characters? I really don't want to parse through a few ten thousands of LaTeX code...	[reply]
Re^3: character encoding ambiguities when performing regexps with html entities by graff (Chancellor) on Sep 24, 2007 at 22:27 UTC
is there a regular expression I can write that detects non-ascii characters? Here are a couple easy ones: `/[^\x00-\x7f]/ /[^[:ascii:]]/` [download] They both work whether or not the string happens to have its "utf8 flag" turned on.	[reply] [d/l]
Re: character encoding ambiguities when performing regexps with html entities by WebDragon (Initiate) on Sep 24, 2007 at 21:09 UTC
I do something a bit different when working within vim to take pasted text from an MS Word document, and translate the few oddball characters I frequently encounter into html entities. It took a good bit of experimentation to work this out, but it works well and consistently in translating on the fly, per line or per selection. From my .vimrc.web: let myentity = "–—“”‘’セ…信伍庸슭염" nmap <buffer> <silent> <localleader>utf :.!perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{<C-R>=g:myentity<CR>} );'<CR> vmap <buffer> <silent> <localleader>utf :!perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{<C-R>=g:myentity<CR>});'<CR> Translated, removed from it's vim environment, the line would look something like: perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{–—“”‘’セ…信伍庸슭염});' <yourfile> As always, your mileage may vary, but you should find this useful and consistent. :-) update: The above are supposed to be the actual utf-8 literals. In other words, you should see more of these: "信伍庸슭염" and NONE of these: `"–—“”‘’"` If I had marked the above as <code>, the literals were all escaped which obscured the whole point of the post reply.	[reply] [d/l]
Re: character encoding ambiguities when performing regexps with html entities by CountZero (Bishop) on Sep 24, 2007 at 19:23 UTC
To show the Greek characters you have to set your browser to display at least CP1253, because that code page contains the Greek characters. Displaying the full UTF-8 would not hurt of course. Your phrase "I would like the output file to be in strict ASCII so that this problem doesn't arise" doesn't cut any wood. Your LaTex file is most probably already in "strict" ASCII. (La)Tex being old and venerable dates back from times when there was nothing but ASCII, hence all the funny characters were coded as "escaped names" (like `\copyright` for ©). CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re^2: character encoding ambiguities when performing regexps with html entities by angelixd (Novice) on Sep 24, 2007 at 19:37 UTC
Actually, the .tex file's encoding is a problem. Were it up to me and my preferred bag of tricks (emacs, pdflatex), I wouldn't be using any funny business. However, pdftex and TeXShop allow unicode input, which is what the previous coder was using, so the files have a few snippets of unicode characters in them. It's kind of weird actually reducing the flexibility of a format, but it's the only way to keep the data consistent across the board.	[reply]
Re^3: character encoding ambiguities when performing regexps with html entities by CountZero (Bishop) on Sep 24, 2007 at 20:00 UTC
I see your problem but you will not solve it with a regex. Your best bet would be to get rid of such Unicode input in your tex-files and replace these characters by their proper tex-forms. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^4: character encoding ambiguities when performing regexps with html entities by angelixd (Novice) on Sep 25, 2007 at 16:06 UTC