Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse my links and get server name info between "http://" and the first "/"
I need to then get a count of each one. Here an example of what I have:
http://riverserver/dir1/dir2/index.html http://riverserver/dir1/dir2/index.html http://perlmonks.org/index.pl http://webserver.company.serv/Adir/Bdir/thePage.cfm http://riverserver/dir1/dir2/index.html http://webserver.company.serv/Adir/Bdir/thePage.cfm
I want to parse the info and get this:
riverserver = 3 perlmonks.org = 1 webserver.company.serv = 2
I think I have a problem with how to parse the links and add a hash but not sure if I am doing this right??
use strict; my $link1 = 'http://perlmonks.org/index.pl'; my $link2 = 'http://riverserver/dir1/dir2/index.html'; #more links etc... $link1 =~ s/http\:\/\///; $link2 =~ s/http\:\/\///; my @link1 = split /\//, $link1; my @link2 = split /\//, $link2; print "$link1[0]\n"; print "$link2[0]\n";

Replies are listed 'Best First'.
Re: Parsing to get server info
by ctilmes (Vicar) on Jul 15, 2003 at 13:18 UTC
    Use URI.
    use URI; my %hostcount; while (<DATA>) { my $u = URI->new($_); $hostcount{$u->host}++; } foreach my $host (keys %hostcount) { print "$host = $hostcount{$host}\n"; } __DATA__ http://riverserver/dir1/dir2/index.html http://riverserver/dir1/dir2/index.html http://perlmonks.org/index.pl http://webserver.company.serv/Adir/Bdir/thePage.cfm http://riverserver/dir1/dir2/index.html http://webserver.company.serv/Adir/Bdir/thePage.cfm

    Output:

    webserver.company.serv = 2 riverserver = 3 perlmonks.org = 1
      I tried exactly as you had and got this message:
      Can't locate object method "host" via package "URI::_generic" (perhaps + you forgo t to load "URI::_generic"?) at C:\Perl\bin\url1.pl line 11, <DATA> lin +e 7.
      I do have "URI" module on my Windows NT. Please advise.
Re: Parsing to get server info
by gjb (Vicar) on Jul 15, 2003 at 13:14 UTC

    If you're sure that the data you process contains only valid URLs, you can do it a bit more conveniently with:

    $link =~ m{http://([^/]+)}; $server = $1;
    Hope this helps, -gjb-

      As an add on here. maybe
      $line =~ m{^https?://([^/]+)}; $server{$1}++; for ( keys %server ) { print "$_ : $server{$_}\n"; }
      This way it will also catch https URLs. Though using a module as recommended below is probably the best way.


      MMMMM... Chocolaty Perl Goodness.....