There are two problems. One is that non-ASCII data comes URI escaped. Analog doesn't translate this for us, so if nothing is done then the reports mention things like %D7%A8%20 and useless stuff like that.
The more serious problem is that data can come in in unknown encodings: either in UTF-8 or in one of the less chic 8-bit encodings. None of the data identifies itself, so there's no bulletproof algorithm to fix this.
The snippet below makes a best effort and will work only for sites that feature ONE more language other than English. (Or more precicely, need one more 8-bit encoding.) This is a hack; this is only a hack. But it works for me :)
What it does is read incoming data line by line and attempt to treat all incoming URLs as UTF-8, or falling back on The Preferred Charset. Then it normalizes everything to that charset and pipes it off to analog. There's an additional step of fixing up the charset declared in the final report, because analog hardcodes that to that of the actual *language* the report is in (ew!).
To use this, you will need to edit the paths and the hardcoded preferred charset. Good luck!
#!/usr/bin/perl -wC0 use strict; use Encode; use Tie::File; use PerlIO::gzip; our $LOGDIR = "/var/log/apache-perl"; our $REPORT = "/var/www/analog/analog.html"; # 1. read input and normalize it. # 2. pipe it off to analog # 3. when analog finishes, fixup the silly encoding. # 2 has to actually start before 1 :) my $pid = open ANALOG, "| analog -" or die "can't start analog: $!"; $SIG{PIPE} = sub { die "sigpipe" }; # read logs foreach my $file (glob $LOGDIR . "/access.log*") { #print "Preprocessing $file\n"; # gzip autopop (see PerlIO::gzip) transparently handles gz or +non-gz input open my $log, "<:gzip(autopop)", $file or die "can't read log +$file: $!"; while(<$log>) { # normalize URI-escaped and variously encoded data to +CP1255. s/(?:%|\\x)([0-9A-Fa-f]{2})/chr(hex($1))/eg; eval { $_ = Encode::decode('utf-8', $_, 1); 1 } or $_ = Encode::decode('cp1255', $_); $_ = Encode::encode('cp1255', $_); print ANALOG or die "analog: $!"; } } # let analog do its stuff close ANALOG or die "analog: $!"; waitpid $pid, 0; # let it finish # now 3 - fixup the silly encoding. tie my @report, 'Tie::File', $REPORT or die "Tie::File: $!"; FIXUP: foreach (@report) { # take care not to change the length of the line, so that the +editing # may be done in-place. s{charset=ISO-8859-1"} {charset=CP-1255" } && last FIXUP; }
|
|---|