in reply to Re^2: character encoding ambiguities when performing regexps with html entities
in thread character encoding ambiguities when performing regexps with html entities

I see your problem but you will not solve it with a regex. Your best bet would be to get rid of such Unicode input in your tex-files and replace these characters by their proper tex-forms.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

  • Comment on Re^3: character encoding ambiguities when performing regexps with html entities

Replies are listed 'Best First'.
Re^4: character encoding ambiguities when performing regexps with html entities
by angelixd (Novice) on Sep 25, 2007 at 16:06 UTC
    yeah, this is what I ended up doing. I used the regex previously mentioned to find them, though. There was no way I was going to eyeball ~20,000 lines of code... This is what I ended up using to find them; hopefully someone can look at it and use it later:
    #!/usr/bin/env perl # find_extended_chars.txt # this script finds any non-alphanumeric characters, codes, etc in a # given input file and prints the line & line number so that it can be # changed. use strict; use Cwd; foreach my $file (@ARGV) { my $error = 0; my $line = 1; open FILE, $file; while (<FILE>){ if ($_ =~ m/[^[:ascii:]]/ ){ if (!$error){ print $file."\n"; $error = 1; } print $line. "\t".$_; } $line++; } }