RFC: URI::SearchTerms

URI::SearchTerms - Collect search terms from the search URLs of common search engines

See the POD in the code below for more information. My questions are:

Is the name OK? I wasn't really sure what namespace to use...
Should I make use of CGI::Simple/CGI for the query string parsing? Or should I just pull the relevant code from the module and plop it into my own?
Along the same lines, I'm under the impression that it might be worth it to use CGI::Simple if available instead of CGI. Is this valid, or should I just stick with CGI since it is in the core?
Even though I've done some manual testing, I need to write test cases. That would require URLs from the search engines that might be wierd edge-cases. Anyone have some? (I've already tested the URLs from the main sites and some from my referer logs.)

Any questions, comments, suggestions, et cetera are extremely welcome.

package URI::SearchTerms;

use warnings;
use strict;

=head1 NAME

URI::SearchTerms - Collect search terms from the search URLs of
common search engines

=head1 SYNOPSIS

    use URI::SearchTerms;
    my $search_url = "http://www.google.com/search?q=foo+bar+baz";
    
    my @terms = URI::SearchTerms->terms($search_url);
    print join ":", @terms, "\n";

=head1 DESCRIPTION

An early version of this was written with the intention to use it to
parse webserver log files (specifically the referer) to discover
what search terms users were using to find the site.  The idea later
transformed into this.

Besides parsing referers in log files, this could be used
dynamically in CGI or mod_perl scripts to detect if users are coming
from search engine results, and if so, what search terms they used.

Currently the module supports Google, Yahoo, MSN, and AOL search
URLs.  If you would like to suggest another search engine to
support, please email me (C<tsibley@cpan.org>) with either a few
example URLs or, less preferrably, a place to get my own.  Patches
are even better.  : )

=head1 METHODS

=head2 URI::SearchTerms::terms($url), URI::SearchTerms->terms($url)

This takes one argument: the URL to parse.  It returns an array of
the search terms, which will in most cases only contain one element.
C<terms()> may be called in the class-style or the fully qualified
style.

=cut

# Try to require CGI::Simple first.  If that fails, try to
# load CGI.pm.  If all that fails, die with an error.  I
# don't use URI::QueryParam because it doesn't handle all
# cases as it should.
my $CGI = 'CGI::Simple';
eval { require CGI::Simple; };
eval { require CGI; $CGI = 'CGI'; } if $@;
if ($@) {
    die "The CGI::Simple or CGI modules must be installed " .
        "for URI::SearchTerms to work!";
}

require URI;

my %pats = (
    google  => {
        pat     => qr<google\.>,
        keys    => ['q','as_q'],
    },
    yahoo   => {
        pat     => qr<yahoo\.>,
        keys    => ['p'],
    },
    msn     => {
        pat     => qr<msn\.>,
        keys    => ['q'],
    },
    aol     => {
        pat     => qr<aol\.>,
        keys    => ['query'],
    },
);

sub terms {
    my $url = $_[1] ? $_[1] : $_[0];
    my @terms;
    
    my $uri   = URI->new($url);
    my $host  = $uri->host;
    my $query = $uri->query;
    
    for (keys %pats) {
        if ($host =~ /$pats{$_}->{pat}/) {
            my $q = $CGI->new($query);
            for (@{$pats{$_}->{keys}}) {
                push @terms, $q->param($_);
            }
        }
    }
    return @terms;
}

=head1 REQUIREMENTS

Currently, this module uses L<URI.pm> and L<CGI::Simple|CGI::Simple>
(or if that isn't available L<CGI.pm>) to parse the query strings from
the URLs and extract the appropriate params.

=head1 BUGS

This module desperately needs more test cases.  There are probably a
bunch of valid URLs for Yahoo or MSN or AOL that don't work
(although I think I've covered Google pretty well).  If you find
one, please email me the URL at C<tsibley@cpan.org>.

=head1 LICENSE

This module is free software, and may be distributed under the same
terms as Perl itself.

=head1 AUTHOR

Copyright (C) 2003, Thomas R. Sibley C<tsibley@cpan.org>

=cut

1;
[download]

Updated the POD as per Corion's suggestion and changed the code to use URI.pm for query string extraction.

Comment on RFC: URI::SearchTerms Download Code

Replies are listed 'Best First'.

Re: RFC: URI::SearchTerms
by Corion (Patriarch) on Oct 12, 2003 at 09:33 UTC

A short thing on the synopsis - drop the alternatives. A good synopsis should have as much ready-to-copy code as possible.

I would rewrite the synopsis as follows:

    use URI::SearchTerms;
    my $search_url = "http://www.google.com/search?q=foo+bar+baz";

    my @terms = URI::SearchTerms->terms($search_url);
    print join ":", @terms, "\n";
[download]

All the finer things, like calling alternatives should go in the detailed discussion of every function (that is, into the discussion of terms()).

Whether you use the class-style call or the fully qualified style call is a matter of taste, and I'm not exactly sure which one I prefer - I guess both are relatively valid, as there is not much use to pass more than one search url to the function.

perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ;    # The  
$d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider
($c = $d->accept())->get_request(); $c->send_response( new   #in the
HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' #  web
[download]

[reply]
[d/l]
[select]

Re: RFC: URI::SearchTerms
by Juerd (Abbot) on Oct 12, 2003 at 14:07 UTC

See URI::Sequin.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

[reply]

Re: Re: RFC: URI::SearchTerms

by The Mad Hatter (Priest) on Oct 12, 2003 at 15:20 UTC

Arrrrgh. I wish I had seen that beforehand. Oh well. Any idea why the name Sequin was used?

[reply]

Re: Re: Re: RFC: URI::SearchTerms

by Juerd (Abbot) on Oct 12, 2003 at 18:10 UTC

Any idea why the name Sequin was used?

S earch
E ngine
Q uery
U RL
I nformation
N oter

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

[reply]

OT: module naming mini-rant (was: RFC: URI::SearchTerms)

by Aristotle (Chancellor) on Oct 13, 2003 at 13:37 UTC

Re: RFC: URI::SearchTerms
by Jaap (Curate) on Oct 12, 2003 at 11:51 UTC

[reply]

Re: Re: RFC: URI::SearchTerms

by The Mad Hatter (Priest) on Oct 12, 2003 at 15:11 UTC

Yes, at some point.

[reply]

Re: RFC: URI::SearchTerms
by Zaxo (Archbishop) on Oct 12, 2003 at 04:21 UTC

Have you looked at WWW::Search?

After Compline,
Zaxo

[reply]

Re: Re: RFC: URI::SearchTerms

by The Mad Hatter (Priest) on Oct 12, 2003 at 05:09 UTC

WWW::Search

[reply]

Re: RFC: URI::SearchTerms
by Jaap (Curate) on Oct 12, 2003 at 11:35 UTC

Should I make use of CGI::Simple/CGI for the query string parsing? Or should I just pull the relevant code from the module and plop it into my own?

[reply]

Re: Re: RFC: URI::SearchTerms

by PodMaster (Abbot) on Oct 12, 2003 at 11:57 UTC

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]