S_Shrum has asked for the wisdom of the Perl Monks concerning the following question:

I have written a fairly sweet little site robot script that I wanted to use to build a site content index. This way I could do searches on my own file and set it up anyway I wanted without having to pay a search engine company to do it for me and without having to put up stupid little banners. Additionally, I have the time to do it ;-)

Anyways, the script works ok but I want to expand upon it. It's been awhile since I wrote it and I'm a bit fuzzy with my Perl.

I want to allow the user to specify what tags within the document they want to search and store to the content index db. I'm looking for a way to do 2 things: search and retrieve the text between the open and close tag of any user specified type (this includes XML tags) and a nice little loop to go through all the user specified ones.

I figured I would let the user set the fields and order in the following order:

...indexer.pl?tag1=title&tag2=author&tag3=body

tag0 is *always* the complete URL to the page.

This way anybody could create their own content index to run searches on. Let me know if somebody else has made one and I'll stop now otherwise let me know if I've got all my marbles in one bag. Seems to me like somebody must have built one before as this seems like a good idea to me.

TIA

======================
Sean Shrum
http://www.shrum.net

Replies are listed 'Best First'.
Re: Help need with code loop
by Kanji (Parson) on Jul 23, 2002 at 07:50 UTC
    search and retrieve the text between the open and close tag of any user specified type

    HTML::TokeParser (alt.) or it's ::Simple (alt.) offspring should help here.

    nice little loop to go through all the user specified ones

    It would be a simple foreach (param('tag')) if you dropped the numbered suffixes, but otherwise you could use [grep] to filter out the unwanted params.

    use CGI qw/ :standard /; use Data::Dumper; use HTML::TokeParser; my $file = param('file'); # user defined tags. my @tags = grep /\Atag\d+\z/, param; my %text; foreach my $tag_name ( map param($_), @tags ) { # Is there a better way to rewind the parser? my $html = HTML::TokeParser->new($file) or die "Cannot parse $file"; print "$tag_name\n"; while ( my $tag = $html->get_tag($tag_name) ) { $text{$tag_name} = $html->get_trimmed_text("/$tag_name"); } } print Data::Dumper->Dump( [ \%text ], [qw( text )] );

    Something to be wary of, however, is if there isn't a closing tag, get_trimmed_text will return everything until the end of the document; which could suck royally if all you want to search are <meta> tags.

        --k.


Re: Help need with code loop
by S_Shrum (Pilgrim) on Jul 23, 2002 at 06:28 UTC

    Just a bit more

    The entire contents of a page will be in a variable called $html. I need a loop that will allow me to go through all the iterations of tag# that will parse $html for the text in between the open and closed tag name:

    --> tag1=body

    ...would look for everything after <body> and before </body>. Remember that there can and will be multiple tag entries:

    --> indexer.pl?tag1=body&tag2=title&tag3=....

    TIA

    ======================
    Sean Shrum
    http://www.shrum.net

•Re: Help need with code loop
by merlyn (Sage) on Jul 23, 2002 at 12:53 UTC
    You might want to consider either providing a user interface that is an XPath directly, or else performing a simple transformation from your input string to an XPath, because then you can use XPath-style modules to assist you in the query.

    -- Randal L. Schwartz, Perl hacker