comment on

Greetings everyone,
I'm trying to resurrect some old Perl/CGI scripts -- it was a Forum/Bulletin board. Problem is,
I lost my "clean" copy, and now I'm stuck
dealing with a copy that's been handled who-knows-who, and who-knows-what. So it's
been subjected to windows(office|word95|winword) || Macintosh(simple-text|some-other-mac-editor(s)) ||
who knows what else. As a result the files have probably been opened having a BOM in it, then
saved as windows-1252-1, then opened and saved as UTF-8, then saved as ISO-8859-1, then ?? -- well
you get the picture. I've tried running them through my handy dos2unix script to at least unify them
that much. I then ran the following:

#!/bin/sh -
for i in $(find . -type f)
do
    iconv -f ISO-8859-1 -t UTF-8 $i > $i.tmp
    rm $i
    mv $i.tmp $i
done
[download]

which, of course assumes they're all ISO-8859-1 --which they are not.
I ran them through another script I cobbled utilizing FILE(1), eg file -i. That helped, but the results were still less-than-optimal. So when I finally felt I had managed to
unify them into a utf8 state, I began to edit them, only later to discover that there were
some little square boxes showing up in my editor. closer examination showed they were
0099 (hex), which are called "Single Graphic Character Introducer" - not very helpful, to
me anyway. I decided it would have to be "Perl to the rescue", and set out to find a way
to parse these files, and get more info (Perl is MUCH smarter than I am). I discovered the following:

#!/usr/bin/env perl
#
# unicount - count code points in input
# Tom Christiansen <tchrist@perl.com>

use v5.12;
use strict;
use sigtrap;
use warnings;
use charnames ();

use Carp                qw(carp croak confess cluck);
use List::Util          qw(max);
use Unicode::UCD        qw(charinfo charblock);

sub fix_extension;
sub process_input   (&) ;
sub set_encoding    (*$);
sub yuck            ($) ;

my $total = 0;
my %seen = ();

# deep magic here
process_input {
    $total += length;
    $seen{$_}++ for split //;
};

my $dec_width = length($total);
my $hex_width = max(4, length sprintf("%x", max map { ord } keys %seen
+));

for (sort keys %seen) {
    my $count = $seen{$_};
    my $gcat  = charinfo(ord())->{category};
    my $name  = charnames::viacode(ord())
             || "<unnamed code point in @{[charblock(ord())]}>";

    printf "%*d U+%0*X GC=%2s %s\n",
            $dec_width => $count,
            $hex_width => ord(),
            $gcat      => $name;
}

exit;

##################################################

sub yuck($) {
    my $errmsg = $_[0];
    $errmsg =~ s/(?<=[^\n])\z/\n/;
    print STDERR "$0: $errmsg";
}

sub process_input(&) {
    my $function = shift();
    my $enc;

    if (@ARGV == 0 && -t STDIN && -t STDERR) {
        print STDERR "$0: reading from stdin, type ^D to end or ^C to 
+kill.\n";
    }

    unshift(@ARGV, "-") if @ARGV == 0;

FILE:

    for my $file (@ARGV) {
        # don't let magic open make an output handle
        next if -e $file && ! -f _;
        my $quasi_filename = fix_extension($file);
        $file = "standard input" if $file eq q(-);
        $quasi_filename =~ s/^(?=\s*[>|])/< /;

        no strict "refs";
        my $fh = $file;   # is *so* a lexical filehandle! ###98#
        unless (open($fh, $quasi_filename)) {
            yuck("couldn't open $quasi_filename: $!");
            next FILE;
        }
        set_encoding($fh, $file) || next FILE;

        my $whole_file = eval {
            # could just do this a line at a time, but not if counting
+ \R's
            use warnings "FATAL" => "all";
            local $/;
            scalar <$fh>;
        };

        if ($@) {
            $@ =~ s/ at \K.*? line \d+.*/$file line $./;
            yuck($@);
            next FILE;
        }

        do {
            # much faster to alias than to copy
            local *_ = \$whole_file;
            &$function;
        };

        unless (close $fh) {
            yuck("couldn't close $quasi_filename at line $.: $!");
            next FILE;
        }

    } # foreach file

}

# Encoding set to (after unzipping):
#    if file.pod => use whatever =encoding says
#    elsif file.ENCODING for legal encoding name -> use that one
#    elsif file is binary => use bytes
#    else => use utf8
#
# Note that gzipped stuff always shows up as bytes this way, but
#   it internal unzipped bytes are still counted after unzipping
#
sub set_encoding(*$) {
    my ($handle, $path) = @_;

    my $enc_name = (-f $path && -B $path) ? "bytes" : "utf8";

    if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
        my $ext = $1;
        die unless defined $ext;

        if ($ext eq "pod") {
            my $int_enc = qx{
                perl -C0 -lan -00 -e 'next unless /^=encoding/; print 
+\$F[1]; exit' $path
            };
            if ($int_enc) {
                chomp $int_enc;
                $ext = $int_enc;
              ##print STDERR "$0: reset encoding to $ext on $path\n";
            }
        }

        require Encode;
        if (my $enc_obj = Encode::find_encoding($ext)) {
            my $name = $enc_obj->name || $ext;
            $enc_name = "encoding($name)";
        }
    }

    return 1 if eval {
        use warnings FATAL => "all";
        no strict "refs";
      ##print STDERR qq(binmode($handle, ":$enc_name")\n);
        binmode($handle, ":$enc_name") || die "binmode to $enc_name fa
+iled";
        1;
    };

    for ($@) {
        s/ at .* line \d+\.//;
        s/$/ for $path/;
    }

    yuck("set_encoding: $@");

    return undef;
}

sub fix_extension {
    my $path = shift();
    my %Compress = (
        Z       =>  "zcat",
        z       => "gzcat",            # for uncompressing
        gz      => "gzcat",
        bz      => "bzcat",
        bz2     => "bzcat",
        bzip    => "bzcat",
        bzip2   => "bzcat",
        lzma    => "lzcat",
    );

    if ($path =~ m{ \. ( [^.\s] +) \z }x) {
        if (my $prog = $Compress{$1}) {
            # HIP HIP HURRAY! for magic open!!!
            # HIP HIP HURRAY! for magic open!!!
            # HIP HIP HURRAY! for magic open!!!
            return "$prog $path |";
        }
    }

    return $path;
}

END {
    close(STDIN)  || die "couldn't close stdin: $!";
    close(STDOUT) || die "couldn't close stdout: $!";
}

UNITCHECK {
    $SIG{  PIPE  } = sub { exit };
    $SIG{__WARN__} = sub {
        confess "trapped uncaught warning" unless $^S;
    };
}
[download]

which, while not necessarily it's intended use, did shed some further info.
It dumped the following info:

utf8 "\x99" does not map to Unicode at ./word_lets.cgi line 1
[download]

Well, after much further research, I discover that particular character, is
the tm in Latin-1, or ™ (™) in decimal, using UTF-8.
Now, I'd just stop there, and send Perl, Find, Grep, Cat, or Awk on a seek-and-replace
mission. Then be done with it. But I'm sure this (that) isn't the last of them.
It all wouldn't be such a big deal, except I have over one hundred files to deal with.
Surely I'm not the only one that's had to overcome something like this. I did spend
quite some time trying to find a solution reading all the perldoc's. While there was much to
learned regarding :unicode && :utf-8 | :utf8, last time I tried to slurp a file in, and modify it
within Perl using unicode | or utf8, I ended up with ms-dos/windows line endings (CR/LF), and I'm
on a BSD-UNIX machine. :(

Any, and all help/pointers greatly appreciated.

Thank you for all your consideration.

--chris

#!/usr/bin/perl -Tw
use perl::always;
my $perl_version = "5.12.4";
print $perl_version;

In reply to Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8? by taint

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.