cptstarfire has asked for the wisdom of the Perl Monks concerning the following question:

hello, i am fairly new to using perl, so i cud do with some help.. i have an array containing a whole lot of web address (@sites) and i need to search this array and if i find the same site again i need to make a count of that, if its a new site it will be added to another array (@validsites) i then need to list the most popular sites. what would be the best way to do this?

Replies are listed 'Best First'.
Re: Array Searching
by InfiniteSilence (Curate) on Mar 09, 2006 at 19:33 UTC
    How many web addresses is a "whole lot"?

    For instance, how long does it take to run this:

    #!/usr/bin/perl -w use strict; my @addresses = qw|http://www.foo.com http://www.foo1.com http://www.f +oo2.com http://www.foo.com http://www.a.com http://www.a.com http:/ +/www.a.com http://www.a.com|; my %add; for(@addresses){$add{$_}++}; print sort {$add{$b} <=> $add{$a}} keys %add; 1;
    Update: LOL. ikegami rewrote my code, but do you realize that now this routine will sort the same 100K+ list twice just to get the top 10? I mean, heck, neither of our answers is really correct when you think about the volume...perhaps the data should be pushed to a database instead?

    Celebrate Intellectual Diversity

      • Reverted names back to those the OP uses.
      • Switch to pre-increment for speed boost in some perls.
      • Answer both questions, not just one.
      • Made input list more readable.
      • Made printed list more readable.
      • Removed extraneous 1;.
      use strict; use warnings; my @sites = qw| http://www.foo.com http://www.foo1.com http://www.foo2.com http://www.foo.com http://www.a.com http://www.a.com http://www.a.com http://www.a.com |; my %sites; ++$sites{$_} foreach @sites; my @validsites = sort { $a <=> $b } keys %sites; my @popularsites = sort { $sites{$b} <=> $sites{$a} } keys %sites; splice(@popularsites, 10); # Keep only the 10 most popular. { local $, = "\n"; local $\ = "\n"; print 'Most Popular Sites', '------------------', @popularsites, '', 'All Sites', '---------', @validsites; }
      a whole lot is about 100,000.
Re: Array Searching
by hesco (Deacon) on Mar 09, 2006 at 20:03 UTC
    At the risk of xp--, but in the interest of being helpful, this is untested here, surely bug-filled, off the cuff, and written w/o consultation to any working code or documentation, so use at your own risk:
    sub sitecounter { my @sites = @_; my(@sitepopularity); my($s,$site,@sitessorted,@validsites,@tmp); @sitessorted = sort @sites; foreach $site (@sitessorted){ push @validsites, $site unless $validsite[-1] eq $site; } foreach $site (@validsites){ foreach $s (@sites){ push @tmp $s if $site eq $s; } push @sitepopularity ($site, #@tmp); } return @sitepopularity; }
    Of course it depends on how you define a site, should you add a subroutine to handle regex mangling to process the root domain, instead of the individual pages themselves, as this one does? But if you debug that, that ought to get you started.

    Check out the docs for push, pop, sort, and if this was a homework question I helped you with, the caveats offered to new users here who ask others to do their homework for them.

    -- Hugh

      Thanks a lot for your help, another useful thing would be to show the amount of "hits", occurences the most popular sites have received.