Greetings, and thank you for your reply.
While this is nearly the same output
I received running the Perl script I posted.
The script merely indicated that Unicode::UCD couldn't properly map "\x99" (0099) | "&8482;" (in Decimal),
to a Unicode symbol/entity. In all likelyhood, it was because the document wasn't properly encoded (windows-1252-1|ISO-8859-1), instead of UTF-8|UTF8. I've examined enough of the documents
to know that they aren't "junk", but rather UTF-8 encoded files that weren't saved accordingly.
So, knowing that Perl is quite Unicode|UTF-8 savvy, I was hoping I could find
a way to let Perl discover it's current incorrect encoding -- say ISO-8859-1, and either
convert the embedded symbols to their Decimal equivalent, or, if it's safe, to save it as UTF-8.
In fact, after saving that same document as UTF-8, and running that script on it, caused the script to emit that error. Reading that same document with the embedded symbols/characters in it, while being
ISO-8859-1 with that script emitted:
6 U+0009 GC=Cc CHARACTER TABULATION
2564 U+000A GC=Cc LINE FEED (LF)
25209 U+0020 GC=Zs SPACE
8436 U+0021 GC=Po EXCLAMATION MARK
167 U+0022 GC=Po QUOTATION MARK
35 U+0023 GC=Po NUMBER SIGN
7 U+0024 GC=Sc DOLLAR SIGN
1140 U+0025 GC=Po PERCENT SIGN
46 U+0026 GC=Po AMPERSAND
108 U+0027 GC=Po APOSTROPHE
134 U+0028 GC=Ps LEFT PARENTHESIS
134 U+0029 GC=Pe RIGHT PARENTHESIS
14 U+002A GC=Po ASTERISK
2751 U+002C GC=Po COMMA
439 U+002D GC=Pd HYPHEN-MINUS
1655 U+002E GC=Po FULL STOP
518 U+002F GC=Po SOLIDUS
73 U+0030 GC=Nd DIGIT ZERO
91 U+0031 GC=Nd DIGIT ONE
107 U+0032 GC=Nd DIGIT TWO
53 U+0033 GC=Nd DIGIT THREE
30 U+0034 GC=Nd DIGIT FOUR
49 U+0035 GC=Nd DIGIT FIVE
13 U+0036 GC=Nd DIGIT SIX
5 U+0037 GC=Nd DIGIT SEVEN
21 U+0038 GC=Nd DIGIT EIGHT
12 U+0039 GC=Nd DIGIT NINE
331 U+003A GC=Po COLON
43 U+003B GC=Po SEMICOLON
714 U+003C GC=Sm LESS-THAN SIGN
2176 U+003D GC=Sm EQUALS SIGN
2853 U+003E GC=Sm GREATER-THAN SIGN
103 U+003F GC=Po QUESTION MARK
4 U+0040 GC=Po COMMERCIAL AT
665 U+0041 GC=Lu LATIN CAPITAL LETTER A
547 U+0042 GC=Lu LATIN CAPITAL LETTER B
370 U+0043 GC=Lu LATIN CAPITAL LETTER C
331 U+0044 GC=Lu LATIN CAPITAL LETTER D
625 U+0045 GC=Lu LATIN CAPITAL LETTER E
323 U+0046 GC=Lu LATIN CAPITAL LETTER F
104 U+0047 GC=Lu LATIN CAPITAL LETTER G
171 U+0048 GC=Lu LATIN CAPITAL LETTER H
509 U+0049 GC=Lu LATIN CAPITAL LETTER I
32 U+004A GC=Lu LATIN CAPITAL LETTER J
83 U+004B GC=Lu LATIN CAPITAL LETTER K
378 U+004C GC=Lu LATIN CAPITAL LETTER L
594 U+004D GC=Lu LATIN CAPITAL LETTER M
520 U+004E GC=Lu LATIN CAPITAL LETTER N
410 U+004F GC=Lu LATIN CAPITAL LETTER O
653 U+0050 GC=Lu LATIN CAPITAL LETTER P
39 U+0051 GC=Lu LATIN CAPITAL LETTER Q
623 U+0052 GC=Lu LATIN CAPITAL LETTER R
564 U+0053 GC=Lu LATIN CAPITAL LETTER S
912 U+0054 GC=Lu LATIN CAPITAL LETTER T
486 U+0055 GC=Lu LATIN CAPITAL LETTER U
89 U+0056 GC=Lu LATIN CAPITAL LETTER V
196 U+0057 GC=Lu LATIN CAPITAL LETTER W
8 U+0058 GC=Lu LATIN CAPITAL LETTER X
394 U+0059 GC=Lu LATIN CAPITAL LETTER Y
4 U+005A GC=Lu LATIN CAPITAL LETTER Z
21 U+005B GC=Ps LEFT SQUARE BRACKET
21 U+005D GC=Pe RIGHT SQUARE BRACKET
5 U+005E GC=Sk CIRCUMFLEX ACCENT
4766 U+005F GC=Pc LOW LINE
10143 U+0061 GC=Ll LATIN SMALL LETTER A
2570 U+0062 GC=Ll LATIN SMALL LETTER B
4103 U+0063 GC=Ll LATIN SMALL LETTER C
4907 U+0064 GC=Ll LATIN SMALL LETTER D
16937 U+0065 GC=Ll LATIN SMALL LETTER E
2591 U+0066 GC=Ll LATIN SMALL LETTER F
2564 U+0067 GC=Ll LATIN SMALL LETTER G
3859 U+0068 GC=Ll LATIN SMALL LETTER H
9548 U+0069 GC=Ll LATIN SMALL LETTER I
87 U+006A GC=Ll LATIN SMALL LETTER J
502 U+006B GC=Ll LATIN SMALL LETTER K
6444 U+006C GC=Ll LATIN SMALL LETTER L
4640 U+006D GC=Ll LATIN SMALL LETTER M
7574 U+006E GC=Ll LATIN SMALL LETTER N
10936 U+006F GC=Ll LATIN SMALL LETTER O
4417 U+0070 GC=Ll LATIN SMALL LETTER P
4481 U+0071 GC=Ll LATIN SMALL LETTER Q
10310 U+0072 GC=Ll LATIN SMALL LETTER R
10046 U+0073 GC=Ll LATIN SMALL LETTER S
11385 U+0074 GC=Ll LATIN SMALL LETTER T
4523 U+0075 GC=Ll LATIN SMALL LETTER U
1888 U+0076 GC=Ll LATIN SMALL LETTER V
1574 U+0077 GC=Ll LATIN SMALL LETTER W
537 U+0078 GC=Ll LATIN SMALL LETTER X
2773 U+0079 GC=Ll LATIN SMALL LETTER Y
80 U+007A GC=Ll LATIN SMALL LETTER Z
19 U+007B GC=Ps LEFT CURLY BRACKET
10 U+007C GC=Sm VERTICAL LINE
19 U+007D GC=Pe RIGHT CURLY BRACKET
207 U+007E GC=Sm TILDE
55 U+0099 GC=Cc <unnamed code point in Latin-1 Supplement>
3 U+00A0 GC=Zs NO-BREAK SPACE
(55 U+0099 GC=Cc <unnamed code point in Latin-1 Supplement>) being the offending symbol/character.
Anyway, I see you've provided some other possibilities. So I'd probably do well to further investigate them.
Thanks again, for taking the time to respond.
--chris
#!/usr/bin/perl -Tw
use perl::always;
my perl_version = "5.12.4";
print $perl_version;
| [reply] [d/l] |