divinus has asked for the wisdom of the Perl Monks concerning the following question:

I am more of a PERL user than a PERL writer and I have a problem that I cannot figure out. I am very limited in PERL experience so the solution could be very obvious. I am using a search engine in PERL to search my site. I really like how it works but unfortunately, it searches the html document and gives me results based on things such as links and even Javascript. I would consult the original writer of the code but I downloaded it from a site a couple days ago and forgot where I got it. Here is the code, important and unimportant. Im praying I use the code tags right.
#!/usr/bin/perl #The following code deals with the form data if ($ENV{'REQUEST_METHOD'} eq 'POST') { read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); @pairs =split(/&/, $buffer); foreach $pair (@pairs) { ($name, $value) = split(/=/, $pair); $value =~ tr/+/ /; $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; $FORM{$name}= $value; } } $keyword=$FORM{keyword}; print "Content-type: text/html\n\n"; print "<h2> Here are the files we found</h2>\n\n"; chdir("/usr/local/etc/httpd"); opendir(DIR, "."); while($file = readdir(DIR)) { next if ($file !~ /.htm/); open(FILE, $file); $foundone = 0; $title = ""; while (<FILE>) { if (/$keyword/i) { $foundone = 1; } if (/<TITLE>/) { chop; $title = $_; $title =~ s/<TITLE>//g; $title =~ s/<\TITLE>//g; } } if ($title eq "") { $title = $file; } if($foundone) { print "<A HREF=/$file>$title</A><BR>"; $listed=1; } close(FILE);
After that it just prints out some default stuff and exits. But like I said its giving html and javascript results and really the only way I can think of changing that is to ignore everything in between the head tags (for javascript) and the tags (for the links which are my main problems). I hope I have made some sense here. Like I said, I'm more of a reader than a writer. I understand almost every line of this code, I just don't know how to correctly alter it. On a side note, the program also never prints out the title and istead prints the name of the file every time but thats not my biggest concern right now. If you have any advice on either of them I would appreciate it. Thanks. Divinus

Replies are listed 'Best First'.
Re: Search Engine troubles
by Cine (Friar) on Aug 21, 2001 at 22:45 UTC
    It prob never finds a title because its looking for the word TITLE. Add a 'i' to the regex where searing for it.
    To solve your other problem look into the HTML::Parser especially their htext example.


    Update:

    Ohh well. I might as well post the code

    Update2:

    Updated the code to actually work
    #!/usr/bin/perl -w use HTML::Parser; #The following code deals with the form data if ($ENV{'REQUEST_METHOD'} eq 'POST') { read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); @pairs =split(/&/, $buffer); foreach $pair (@pairs) { ($name, $value) = split(/=/, $pair); $value =~ tr/+/ /; $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; $FORM{$name}= $value; } } #######instead of the above code use the CGI module. #HTML::Parser stuff. Stolen from the HTML::Parser package's htext #example my $keyword; my $hit; my $title; my %inside; sub htmltag { my($tag, $num) = @_; $inside{$tag} += $num; } sub htmltext { return if $inside{script} || $inside{style}; $title = $_[0] if ($inside{title}); $hit = 1 if ($_[0] =~ /$keyword/); } my $parser = HTML::Parser->new( 'api_version' => 3, 'handlers' => [start => [\&htmltag, "tagname, '+1'"], end => [\&htmltag, "tagname, '-1'"], text => [\&htmltext, "dtext"], ], 'marked_sections' => 1); $keyword=$FORM{keyword}; print "Content-type: text/html\n\n"; print "<h2> Here are the files we found</h2>\n\n"; chdir("/usr/local/etc/httpd"); opendir(DIR, "."); while($file = readdir(DIR)) { next if ($file !~ /.htm/); $hit = 0; $title = ""; %inside = (); open(FILE, $file) || die "couldnt open $!"; $parser->parse(\*FILE)->eof; close(FILE); if($hit) { $title = $file unless ($title); print "<A HREF=/$file>$title</A><BR>"; $listed=1; } }


    T I M T O W T D I
      Thanks, I will get to work with the code you gave me and see if I get it to work but I tried the advice about the title tag and the search results gave me completely blank lines instead of the file names. Then again, I dont know what a regex is nor do I know what searing is. haha. But this is what I tried.
      if (/<TITLE>/i)
      Is that what you had in mind? Thanks for reesponding. Divinus
        searing is just me not able to spell ;) Offcourse it should have said searching, but you understood me fine I can see ;)

        T I M T O W T D I
Re: Search Engine troubles
by Beatnik (Parson) on Aug 21, 2001 at 22:46 UTC
    If I were you I'd just contact the author (if you can find him) and forcefeed him a hardcopy of the CGI POD for breakfast. After that you might need a HTML tag stripper (which can be done with HTML::Parser but a rought (and dirty) way of doing that is specified in perlfaq9...
    $string =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: Search Engine troubles
by perrin (Chancellor) on Aug 22, 2001 at 01:36 UTC
    This "search engine" is really just a grep. There are many free search engines which are faster, scale better for large numbers of documents, and solve the problem of ignoring HTML tags. Some common ones include Swish, Glimpse, and ht://Dig. You can read more about them at searchtools.com. There are some pure Perl solution listed there. You could also look at Search::InvertedIndex.