larsen has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a better version of my small search engine. Here my first attempt to exploit hashes of hashes. I submit my code to your attention. Thank you in advance for your observation.

#/usr/bin/perl -w use strict; my %words_index; my %file_index; opendir( DIR, "./"); @files = grep { -f and /\.html$/} readdir DIR; closedir DIR; undef $/; foreach my $file (@files) { my $default_description = "No description"; my $whole_file; open( FILE, $file) || die "Can't open file $file - $!\n"; $whole_file = <FILE>; $whole_file =~ /<html>(.*)<\/html>/i; $file_index{ $file }{TITLE} = $1; $file_index{ $file }{DESCRIPTION} = $default_description; $whole_file =~ s/<[^>]*>//g; @words = split / /, $whole_file; foreach my $word (@words) { $words_index{ $word }{ $file } += 1; } print "$file - $file_index{ $file }{TITLE} - $file_index{ $file }{ +DESCRIPTION}\n"; close FILE; }


Now I'm going to explore Storable Data's mysteries :)

See you
Larsen

Replies are listed 'Best First'.
Re: HTML pages indexer
by ferrency (Deacon) on Aug 07, 2000 at 18:40 UTC
    You said you're looking into the Storable module. Is this to make your Hash of Hashes persistent?

    If so, you might want to check out MLDBM. Using MLDBM, you can tie your hash (of hashes) to a file on disk, and it will Automagically become persistant. One caveat is, with most underlying databases that MLDBM uses, there is an upper limit to the size of each (top-level) hash key; so if your sub-hashes are Very Large, you might run into some mysterious failures.

    As for your code above... isn't it pretty slow? It looks like you're reading in a whole file at a time into a scalar, and then running a bunch of regexes on it. You may find it more time- and space- efficient to use the HTML::Parser module to do things like take out the tags, give you the contents of particular tags, and so on. The Perl Journal has an article on HTML::Parser in a recent issue. HTML::Parser will allow you more flexibility in the future when your requirements change as well.

    Update: Oh yeah, one more thing... if you want to extend your indexerto recursively analyze subdirectories and their file contents as well, you should check out the File::Find module instead of limiting yourself to opendir/readdir.

    Alan

      Thank you very much.
      Larsen
Re: HTML pages indexer
by turnstep (Parson) on Aug 07, 2000 at 23:00 UTC

    Here is what I came up with. Notes follow below it.

    #/usr/bin/perl -w use strict; my (%file_index, %words_index); opendir( DIR, ".") or die "Could not read from current directory - $!\ +n"; my @files = grep { -f and /\.html$/} readdir DIR; closedir DIR; local $/; my $default_description = "No description"; my $default_title = "No title"; for my $file (@files) { open(FILE, "$file") or die "Can't open file $file - $!\n"; my $whole_file = <FILE>; close(FILE); my $title; if ($whole_file =~ /<TITLE>\s*(.*)\s*<\/TITLE>/is) { ($title=$1) =~ s/\s+/ /g; $title =~ s/ *$//; $title =~ s/ +/ /g; } $file_index{$file}{'TITLE'} = defined $title ? $title : $default_tit +le; ## Similar stuff (using META tags perhaps?) goes here $file_index{$file}{'DESCRIPTION'} = $default_description; for (split(/\W+/, join(" ", split(/<[^>]*>/, $whole_file)))) { $words_index{$_}{$file}++; } print "$file - ($file_index{$file}{'TITLE'}) - " . "$file_index{$file}{'DESCRIPTION'}\n"; }

    Code with notes:

    #/usr/bin/perl -w

    use strict;
    my (%file_index, %words_index);

    ## Always check the result of opendir:
    opendir( DIR, ".") or die "Could not read from current directory - $!\n";
    my @files = grep { -f and /\.html$/} readdir DIR;
    closedir DIR;

    ## It is better to localize global variables than to just undefine
    ## them. At the very least, store the value before undefining it
    ## so you can restore it later. (which is basically what local does anyway)

    local $/;

    my $default_description = "No description";

    ## Might as well add a default title to go with the other default:

    my $default_title = "No title";

    for my $file (@files) {

    open(FILE, "$file") or die "Can't open file $file - $!\n";

    ## Might as well close the file as soon as we are done with it:

    my $whole_file = <FILE>; close(FILE);

    my $title;

    ## Need the if statement because $1 might hang around
    ## from a previous match and mess us up:

    if ($whole_file =~ /<TITLE>\s*(.*)\s*<\/TITLE>/is) {

    ## This mess removes newlines and extra spaces from the title
    ## First change whitespace (e.g. tabs, and newlines) to spaces,
    ## then remove trailing spaces, then compress all whitespace
    ## (This is also a good argument to consider using an already written HTML parser from CPAN)

      ($title=$1) =~ s/\s+/ /g;
      $title =~ s/ *$//;
      $title =~ s/  +/ /g;
    }


    ## "$foo = $bar || $baz;" looks cooler, but doesn't account
    ## for people who title their page "0" - hence the ternary test :)

    $file_index{$file}{'TITLE'} = defined $title ? $title : $default_title;

    ## Similar stuff (using META tags perhaps?) goes here
    $file_index{$file}{'DESCRIPTION'} = $default_description;

    ## This just splits out the HTML, then splits the resulting words on
    ## whitespace. Note the join is using a space, not a blank.
    ## Storing them into temporary arrays would look neater, but be more wasteful

    for (split(/\W+/, join(" ", split(/<[^>]*>/, $whole_file)))) {
      $words_index{$_}{$file}++;
    }


    print "$file - ($file_index{$file}{'TITLE'}) - " .
          "$file_index{$file}{'DESCRIPTION'}\n";

    }

      Thank you very much.
      As soon as possible, I will vote for your post :)
      I naively burn out my votes this morning :)
      Larsen
RE: HTML pages indexer
by toadi (Chaplain) on Aug 07, 2000 at 17:19 UTC
    Hello,
    What the problem ??? Is there something we need to do with this code???
    It's maybe better to drop this in the Snippets or Code section.

    --
    My opinions may have changed,
    but not the fact that I am right

      I was wondering if someone could read the code, looking for possible patologies. I'll post my indexer in Snippets after I've completed it.
      Thanks
      Larsen
RE: HTML pages indexer
by DrManhattan (Chaplain) on Aug 07, 2000 at 18:31 UTC

    Here are some small things:

    # I'm guessing you meant "title" and not "html" here :) # Also, you'll want a /s after your regex to catch titles # that span multiple lines. Another approach is to # strip all the \n's and/or \r's out of $whole_file # before parsing it. $whole_file =~ /<html>(.*)<\/html>/i; $file_index{ $file }{TITLE} = $1; # Ditto with the /s $whole_file =~ s/<[^>]*>//g; # This will work better as "split /\s+/, $whole_file" # since it will catch more than just single spaces @words = split / /, $whole_file;

    -Matt