comment on

In Statistician in my garbage..., monk larsen had shown the code for randomly putting together texts and images from your browser's cache, thus giving you a snapshot of what your browsing behavior is. The code depended on the filename to determine the file type. It was not working for my Firefox's cache, as Firefox squashes filenames into something else. I updated the code to use File::MMagic to determine the file type.

#!/usr/local/bin/perl -w


use strict;


# Digs in your browser's cache 
# like a statistician in your trashcan...

package Lurker;

use File::Find;
use File::MMagic;


my $cache = {
    IMAGES => [],
    DOCS => [],
};

sub lurk
{
    my $dir = shift;
    my $mm = new File::MMagic;
    print STDERR "Reading cache...";
    
    find(
     sub 
     {   
         for ( $File::Find::name ) {
           my $res = $mm->checktype_filename($_);
           push @{ $cache->{ IMAGES }}, $_ if ($res =~ m/image\//)  ;
           push @{ $cache->{ DOCS }}, $_   if ($res =~ m/text\/html/) 
         }
       }, $dir 
    );
    
    print STDERR "OK!\n";
  }

sub pick_random
{
    my $what = shift; 
    
    my $n = scalar( @{$cache->{ $what }} );
    
    return ${$cache->{ $what }}[ rand $n ];
}




package My_HTML_Parser;

use base 'HTML::Parser';

sub start
{
    my $self = shift;
    my ($tag, $attr, $attrseq, $origtext) = @_;
    
    my ($orig_src, $new_src);
    
    if ($tag eq 'img') {
    $orig_src = $attr->{'src'};        
    $new_src = Lurker::pick_random( 'IMAGES' );
    $origtext =~ s/$orig_src/$new_src/;
    }
    print $origtext;
}

sub text
{
    my $self = shift;
    my ($text) = @_;
    
    print $text;
}

sub end
{
    my $self = shift;
    my ($tag) = @_;
    
    print "</$tag>";
}



package main;

my $cache_directory = '/home/rshekhar/.mozilla/firefox/jg2e8cd7.defaul
+t/Cache';

Lurker::lurk( $cache_directory );

my $doc = Lurker::pick_random('DOCS');

print STDERR "Now parsing $doc...\n";

my $a = new My_HTML_Parser;
$a->parse_file( $doc );
[download]

Tip: To find your Firefox's cache location, type about:cache in the location bar and check the Cache Directory:

In reply to [Updated] Statistician in my garbage... by lunatech

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.