Re^2: character encoding ambiguities when performing regexps with html entities

Actually, the .tex file's encoding is a problem. Were it up to me and my preferred bag of tricks (emacs, pdflatex), I wouldn't be using any funny business. However, pdftex and TeXShop allow unicode input, which is what the previous coder was using, so the files have a few snippets of unicode characters in them. It's kind of weird actually reducing the flexibility of a format, but it's the only way to keep the data consistent across the board.

Comment on Re^2: character encoding ambiguities when performing regexps with html entities

Replies are listed 'Best First'.
Re^3: character encoding ambiguities when performing regexps with html entities by CountZero (Bishop) on Sep 24, 2007 at 20:00 UTC
I see your problem but you will not solve it with a regex. Your best bet would be to get rid of such Unicode input in your tex-files and replace these characters by their proper tex-forms. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^4: character encoding ambiguities when performing regexps with html entities by angelixd (Novice) on Sep 25, 2007 at 16:06 UTC
yeah, this is what I ended up doing. I used the regex previously mentioned to find them, though. There was no way I was going to eyeball ~20,000 lines of code... This is what I ended up using to find them; hopefully someone can look at it and use it later: `#!/usr/bin/env perl # find_extended_chars.txt # this script finds any non-alphanumeric characters, codes, etc in a # given input file and prints the line & line number so that it can be # changed. use strict; use Cwd; foreach my $file (@ARGV) { my $error = 0; my $line = 1; open FILE, $file; while (<FILE>){ if ($_ =~ m/[^[:ascii:]]/ ){ if (!$error){ print $file."\n"; $error = 1; } print $line. "\t".$_; } $line++; } }` [download]	[reply] [d/l]