Re: character encoding ambiguities when performing regexps with html entities

To show the Greek characters you have to set your browser to display at least CP1253, because that code page contains the Greek characters. Displaying the full UTF-8 would not hurt of course.

Your phrase "I would like the output file to be in strict ASCII so that this problem doesn't arise" doesn't cut any wood. Your LaTex file is most probably already in "strict" ASCII. (La)Tex being old and venerable dates back from times when there was nothing but ASCII, hence all the funny characters were coded as "escaped names" (like \copyright for ©).

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Comment on Re: character encoding ambiguities when performing regexps with html entities Download Code

Replies are listed 'Best First'.
Re^2: character encoding ambiguities when performing regexps with html entities by angelixd (Novice) on Sep 24, 2007 at 19:37 UTC
Actually, the .tex file's encoding is a problem. Were it up to me and my preferred bag of tricks (emacs, pdflatex), I wouldn't be using any funny business. However, pdftex and TeXShop allow unicode input, which is what the previous coder was using, so the files have a few snippets of unicode characters in them. It's kind of weird actually reducing the flexibility of a format, but it's the only way to keep the data consistent across the board.	[reply]
Re^3: character encoding ambiguities when performing regexps with html entities by CountZero (Bishop) on Sep 24, 2007 at 20:00 UTC
I see your problem but you will not solve it with a regex. Your best bet would be to get rid of such Unicode input in your tex-files and replace these characters by their proper tex-forms. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^4: character encoding ambiguities when performing regexps with html entities by angelixd (Novice) on Sep 25, 2007 at 16:06 UTC
yeah, this is what I ended up doing. I used the regex previously mentioned to find them, though. There was no way I was going to eyeball ~20,000 lines of code... This is what I ended up using to find them; hopefully someone can look at it and use it later: `#!/usr/bin/env perl # find_extended_chars.txt # this script finds any non-alphanumeric characters, codes, etc in a # given input file and prints the line & line number so that it can be # changed. use strict; use Cwd; foreach my $file (@ARGV) { my $error = 0; my $line = 1; open FILE, $file; while (<FILE>){ if ($_ =~ m/[^[:ascii:]]/ ){ if (!$error){ print $file."\n"; $error = 1; } print $line. "\t".$_; } $line++; } }` [download]	[reply] [d/l]