URI::SearchTerms - Collect search terms from the search URLs of common search engines

See the POD in the code below for more information. My questions are:

Any questions, comments, suggestions, et cetera are extremely welcome.

package URI::SearchTerms; use warnings; use strict; =head1 NAME URI::SearchTerms - Collect search terms from the search URLs of common search engines =head1 SYNOPSIS use URI::SearchTerms; my $search_url = "http://www.google.com/search?q=foo+bar+baz"; my @terms = URI::SearchTerms->terms($search_url); print join ":", @terms, "\n"; =head1 DESCRIPTION An early version of this was written with the intention to use it to parse webserver log files (specifically the referer) to discover what search terms users were using to find the site. The idea later transformed into this. Besides parsing referers in log files, this could be used dynamically in CGI or mod_perl scripts to detect if users are coming from search engine results, and if so, what search terms they used. Currently the module supports Google, Yahoo, MSN, and AOL search URLs. If you would like to suggest another search engine to support, please email me (C<tsibley@cpan.org>) with either a few example URLs or, less preferrably, a place to get my own. Patches are even better. : ) =head1 METHODS =head2 URI::SearchTerms::terms($url), URI::SearchTerms->terms($url) This takes one argument: the URL to parse. It returns an array of the search terms, which will in most cases only contain one element. C<terms()> may be called in the class-style or the fully qualified style. =cut # Try to require CGI::Simple first. If that fails, try to # load CGI.pm. If all that fails, die with an error. I # don't use URI::QueryParam because it doesn't handle all # cases as it should. my $CGI = 'CGI::Simple'; eval { require CGI::Simple; }; eval { require CGI; $CGI = 'CGI'; } if $@; if ($@) { die "The CGI::Simple or CGI modules must be installed " . "for URI::SearchTerms to work!"; } require URI; my %pats = ( google => { pat => qr<google\.>, keys => ['q','as_q'], }, yahoo => { pat => qr<yahoo\.>, keys => ['p'], }, msn => { pat => qr<msn\.>, keys => ['q'], }, aol => { pat => qr<aol\.>, keys => ['query'], }, ); sub terms { my $url = $_[1] ? $_[1] : $_[0]; my @terms; my $uri = URI->new($url); my $host = $uri->host; my $query = $uri->query; for (keys %pats) { if ($host =~ /$pats{$_}->{pat}/) { my $q = $CGI->new($query); for (@{$pats{$_}->{keys}}) { push @terms, $q->param($_); } } } return @terms; } =head1 REQUIREMENTS Currently, this module uses L<URI.pm> and L<CGI::Simple|CGI::Simple> (or if that isn't available L<CGI.pm>) to parse the query strings from the URLs and extract the appropriate params. =head1 BUGS This module desperately needs more test cases. There are probably a bunch of valid URLs for Yahoo or MSN or AOL that don't work (although I think I've covered Google pretty well). If you find one, please email me the URL at C<tsibley@cpan.org>. =head1 LICENSE This module is free software, and may be distributed under the same terms as Perl itself. =head1 AUTHOR Copyright (C) 2003, Thomas R. Sibley C<tsibley@cpan.org> =cut 1;

Updated the POD as per Corion's suggestion and changed the code to use URI.pm for query string extraction.

Replies are listed 'Best First'.
Re: RFC: URI::SearchTerms
by Corion (Patriarch) on Oct 12, 2003 at 09:33 UTC

    A short thing on the synopsis - drop the alternatives. A good synopsis should have as much ready-to-copy code as possible.

    I would rewrite the synopsis as follows:

    use URI::SearchTerms; my $search_url = "http://www.google.com/search?q=foo+bar+baz"; my @terms = URI::SearchTerms->terms($search_url); print join ":", @terms, "\n";

    All the finer things, like calling alternatives should go in the detailed discussion of every function (that is, into the discussion of terms()).

    Whether you use the class-style call or the fully qualified style call is a matter of taste, and I'm not exactly sure which one I prefer - I guess both are relatively valid, as there is not much use to pass more than one search url to the function.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: RFC: URI::SearchTerms
by Juerd (Abbot) on Oct 12, 2003 at 14:07 UTC
      Arrrrgh. I wish I had seen that beforehand. Oh well. Any idea why the name Sequin was used?

        Any idea why the name Sequin was used?

        S earch
        E ngine
        Q uery
        U RL
        I nformation
        N oter

        Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re: RFC: URI::SearchTerms
by Jaap (Curate) on Oct 12, 2003 at 11:51 UTC
      Yes, at some point.
Re: RFC: URI::SearchTerms
by Zaxo (Archbishop) on Oct 12, 2003 at 04:21 UTC
      Yes, I have. I assume you mean as a namespace since WWW::Search provides a different function. The reason I wasn't sure about it being a good namespace is that it is used for modules that actually do the search. I guess it depends whether you'd classify my module as mainly dealing with URLs or web searches.
Re: RFC: URI::SearchTerms
by Jaap (Curate) on Oct 12, 2003 at 11:35 UTC
    Should I make use of CGI::Simple/CGI for the query string parsing? Or should I just pull the relevant code from the module and plop it into my own?

    If all the searchengines your Module can work with use the default ?aap=blaat&boat=bloat way of encoding the URL's, you should use the CGI module as a module. The advantage is that when someune fixes a bug in CGI.pm, you don't need to manually fix that bug in your module too.
      If he sticks to URI.pm, he doesn't have to worry about bugs in CGI.pm ;)

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.