comment on

I help run a site with non-ASCII data that gets referred to by search engines. We use analog, a great web traffic analyzer, to make summaries of our traffic, but unfortunately that doesn't work very well with non-English data.

There are two problems. One is that non-ASCII data comes URI escaped. Analog doesn't translate this for us, so if nothing is done then the reports mention things like %D7%A8%20 and useless stuff like that.

The more serious problem is that data can come in in unknown encodings: either in UTF-8 or in one of the less chic 8-bit encodings. None of the data identifies itself, so there's no bulletproof algorithm to fix this.

The snippet below makes a best effort and will work only for sites that feature ONE more language other than English. (Or more precicely, need one more 8-bit encoding.) This is a hack; this is only a hack. But it works for me :)

What it does is read incoming data line by line and attempt to treat all incoming URLs as UTF-8, or falling back on The Preferred Charset. Then it normalizes everything to that charset and pipes it off to analog. There's an additional step of fixing up the charset declared in the final report, because analog hardcodes that to that of the actual *language* the report is in (ew!).

To use this, you will need to edit the paths and the hardcoded preferred charset. Good luck!

#!/usr/bin/perl -wC0

use strict;
use Encode;
use Tie::File;
use PerlIO::gzip;

our $LOGDIR = "/var/log/apache-perl";
our $REPORT = "/var/www/analog/analog.html";

# 1. read input and normalize it.
# 2. pipe it off to analog
# 3. when analog finishes, fixup the silly encoding.

# 2 has to actually start before 1 :)

my $pid = open ANALOG, "| analog -" or die "can't start analog: $!";
$SIG{PIPE} = sub { die "sigpipe" };

# read logs
foreach my $file (glob $LOGDIR . "/access.log*") {
        #print "Preprocessing $file\n";
        # gzip autopop (see PerlIO::gzip) transparently handles gz or 
+non-gz input
        open my $log, "<:gzip(autopop)", $file or die "can't read log 
+$file: $!";
        while(<$log>) {
                # normalize URI-escaped and variously encoded data to 
+CP1255.

                s/(?:%|\\x)([0-9A-Fa-f]{2})/chr(hex($1))/eg;

                eval { $_ = Encode::decode('utf-8', $_, 1); 1 } or
                        $_ = Encode::decode('cp1255', $_);
                $_ = Encode::encode('cp1255', $_);

                print ANALOG or die "analog: $!";
        }
}

# let analog do its stuff
close ANALOG or die "analog: $!";
waitpid $pid, 0; # let it finish

# now 3 - fixup the silly encoding.
tie my @report, 'Tie::File', $REPORT or die "Tie::File: $!";
FIXUP: foreach (@report) {
        # take care not to change the length of the line, so that the 
+editing
        # may be done in-place.
        s{charset=ISO-8859-1"}
         {charset=CP-1255"   } && last FIXUP;
}
[download]

In reply to Make analog work with non-ASCII data by gaal

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.