HTML pages indexer

larsen has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML pages indexer by ferrency (Deacon) on Aug 07, 2000 at 18:40 UTC
You said you're looking into the Storable module. Is this to make your Hash of Hashes persistent? If so, you might want to check out MLDBM. Using MLDBM, you can tie your hash (of hashes) to a file on disk, and it will Automagically become persistant. One caveat is, with most underlying databases that MLDBM uses, there is an upper limit to the size of each (top-level) hash key; so if your sub-hashes are Very Large, you might run into some mysterious failures. As for your code above... isn't it pretty slow? It looks like you're reading in a whole file at a time into a scalar, and then running a bunch of regexes on it. You may find it more time- and space- efficient to use the HTML::Parser module to do things like take out the tags, give you the contents of particular tags, and so on. The Perl Journal has an article on HTML::Parser in a recent issue. HTML::Parser will allow you more flexibility in the future when your requirements change as well. Update: Oh yeah, one more thing... if you want to extend your indexerto recursively analyze subdirectories and their file contents as well, you should check out the File::Find module instead of limiting yourself to opendir/readdir. Alan	[reply]
RE: Re: HTML pages indexer by larsen (Parson) on Aug 07, 2000 at 19:06 UTC
Thank you very much. Larsen	[reply]
Re: HTML pages indexer by turnstep (Parson) on Aug 07, 2000 at 23:00 UTC
Here is what I came up with. Notes follow below it. #/usr/bin/perl -w use strict; my (%file_index, %words_index); opendir( DIR, ".") or die "Could not read from current directory - $!\ +n"; my @files = grep { -f and /\.html$/} readdir DIR; closedir DIR; local $/; my $default_description = "No description"; my $default_title = "No title"; for my $file (@files) { open(FILE, "$file") or die "Can't open file $file - $!\n"; my $whole_file = <FILE>; close(FILE); my $title; if ($whole_file =~ /<TITLE>\s(.)\s<\/TITLE>/is) { ($title=$1) =~ s/\s+/ /g; $title =~ s/ $//; $title =~ s/ +/ /g; } $file_index{$file}{'TITLE'} = defined $title ? $title : $default_tit +le; ## Similar stuff (using META tags perhaps?) goes here $file_index{$file}{'DESCRIPTION'} = $default_description; for (split(/\W+/, join(" ", split(/<[^>]>/, $whole_file)))) { $words_index{$_}{$file}++; } print "$file - ($file_index{$file}{'TITLE'}) - " . "$file_index{$file}{'DESCRIPTION'}\n"; } [download] Code with notes: `#/usr/bin/perl -w use strict; my (%file_index, %words_index);` ## Always check the result of opendir:`opendir( DIR, ".") or die "Could not read from current directory - $!\n"; my @files = grep { -f and /\.html$/} readdir DIR; closedir DIR;` ## It is better to localize global variables than to just undefine ## them. At the very least, store the value before undefining it ## so you can restore it later. (which is basically what local* does anyway)`local $/; my $default_description = "No description";` ## Might as well add a default title to go with the other default: `my $default_title = "No title"; for my $file (@files) { open(FILE, "$file") or die "Can't open file $file - $!\n";` ## Might as well close the file as soon as we are done with it: `my $whole_file = <FILE>; close(FILE); my $title;` ## Need the if statement because $1 might hang around ## from a previous match and mess us up: `if ($whole_file =~ /<TITLE>\s(.)\s<\/TITLE>/is) {` ## This mess removes newlines and extra spaces from the title ## First change whitespace (e.g. tabs, and newlines) to spaces, ## then remove trailing spaces, then compress all whitespace ## (This is also a good argument to consider using an already written HTML parser from CPAN) `($title=$1) =~ s/\s+/ /g; $title =~ s/ $//; $title =~ s/ +/ /g; }`** ## "`$foo = $bar \|\| $baz;`" looks cooler, but doesn't account ## for people who title their page "0" - hence the ternary test :) `$file_index{$file}{'TITLE'} = defined $title ? $title : $default_title; ## Similar stuff (using META tags perhaps?) goes here $file_index{$file}{'DESCRIPTION'} = $default_description;` ## This just splits out the HTML, then splits the resulting words on ## whitespace. Note the `join` is using a space, not a blank. ## Storing them into temporary arrays would look neater, but be more wasteful `for (split(/\W+/, join(" ", split(/<[^>]*>/, $whole_file)))) { $words_index{$_}{$file}++; } print "$file - ($file_index{$file}{'TITLE'}) - " . "$file_index{$file}{'DESCRIPTION'}\n"; }`	[reply] [d/l]
RE: Re: HTML pages indexer by larsen (Parson) on Aug 08, 2000 at 01:51 UTC
Thank you very much. As soon as possible, I will vote for your post :) I naively burn out my votes this morning :) Larsen	[reply]
RE: HTML pages indexer by toadi (Chaplain) on Aug 07, 2000 at 17:19 UTC
Hello, What the problem ??? Is there something we need to do with this code??? It's maybe better to drop this in the Snippets or Code section. -- My opinions may have changed, but not the fact that I am right	[reply]
RE: RE: HTML pages indexer by larsen (Parson) on Aug 07, 2000 at 17:38 UTC
I was wondering if someone could read the code, looking for possible patologies. I'll post my indexer in Snippets after I've completed it. Thanks Larsen	[reply]
RE: HTML pages indexer by DrManhattan (Chaplain) on Aug 07, 2000 at 18:31 UTC
Here are some small things: `# I'm guessing you meant "title" and not "html" here :) # Also, you'll want a /s after your regex to catch titles # that span multiple lines. Another approach is to # strip all the \n's and/or \r's out of $whole_file # before parsing it. $whole_file =~ /<html>(.)<\/html>/i; $file_index{ $file }{TITLE} = $1; # Ditto with the /s $whole_file =~ s/<[^>]>//g; # This will work better as "split /\s+/, $whole_file" # since it will catch more than just single spaces @words = split / /, $whole_file;` [download] -Matt	[reply] [d/l]