jkstraw has asked for the wisdom of the Perl Monks concerning the following question:

I am working on my first perl script (no programming exp. at all) and am in well over my head!

Here is what I am trying to accomplish:

1. Grab a index.html page
2. Parse out all of the URLs
3. Create a unique list of the URLs
4. Convert the unique list of URLs to a unique list of IP addresses.

So far I have 3 1/2 of this done. I am able to print out a unique list of URLs (stored in an array) - I can convert URLs to IP addresses but somehow I am loosing the unique portion of the equation when I print out the resolved urls.

I have never done any programming before - so I suspect it may be syntax related but I can't seem to find the problem. I have read through the docs on the modules I am using but still had some lingering questions. I posted at Perl Guru forums but the repose I got was - ask the monks!

Line 62 is a comment explaining the closest I have come and following on 64-67 is what prints out the full list of IPs (but with duplicates).

In the comments you can see I made notes of where I can prove the array is returning unique elements. (I tested by printing the array).

I may have made a bad assumption however. When I start using the Net::DNS module to resolve the URLs I do the following:

my $res = Net::DNS::Resolver->new; foreach $urls (@hrefs) { my $query = $res->query($urls, "A");
My assumptions are:

a) $urls will inherit only the unique elements of the @hrefs array
b) that I can only query a scalar variable...not an array using Net::DNS.

Any help would be greatly appreciated!

Cheers!

Here is the complete script:

#!/usr/bin/perl use LWP::Simple; use Array::Unique; tie @hrefs, 'Array::Unique'; tie @ips, 'Array::Unique'; use Net::DNS; # Store the output of the web page (html and all) in the $content vari +able # print "URL to Parse? "; $a = <STDIN>; chop $a; $b = "http://"; $c = $b . $a; $" = \n; my $content = get("$c"); if (defined $content) { #$content will contain the html associated with the url mentioned abov +e. #print $content.""; } else { #If an error occurs then $content will not be defined. print "Error: Get failed"; } # # Parse ALL URLs from the .html page and store all unique entries in t +he @hrefs array # $content =~ s/\'//g; $content =~ s/\+//g; @hrefs = ($content =~ m/:\/\/\"*([\w\-]+\.[\w\-\.]+?)[\"|\/|\s+]/ig); # # Resolve the URLs stored in the array # my $res = Net::DNS::Resolver->new; foreach $urls (@hrefs) { # ## things are unique here!! # my $query = $res->query($urls, "A"); # ## things are unique here!! # if ($query) { foreach $rr (grep { $_->type eq 'A' } $query->answer) { # ## At this location uniqueness is not preserved # # print ($rr->address . ' '); # This prints out many duplicate IP + addresses in a scalar variable # # This was one of my attempts to get things working # This is the closest to what I want to do but the list is not unique +. # @ips = ($rr->address); foreach $uips (@ips) { print $uips . "\n"; } } } else { warn "query failed: ", $res->errorstring, "\n"; } } # # for troubleshooting - this print statement is unique # #print $urls . "\n"; # ## More testing below - didn't work # # # Add unique IPs to array @ips # #foreach $unique (@ips) { # print "@ips"; # print $unique . "\n"; #}

Replies are listed 'Best First'.
Re: issues maintaining uniqueness
by BrowserUk (Patriarch) on Apr 27, 2008 at 00:02 UTC

    My assumptions are:

    a) $urls will inherit only the unique elements of the @hrefs array

    That's your mistake. When you iterate over an array using for, the indexing variable ($urls) will take on each value in turn regardless of whether it is unique or not.

    The simplest way to ensure uniqueness (assuming that ordering is not important to your application) is to use a hash.

    my @hrefs = ...; my %uniqHrefs; ### This will create one entry in %uniqHrefs for each unique value in +@hrefs; @uniqHrefs{ @hrefs } = (); my $res = Net::DNS::Resolver->new; foreach $urls ( keys %uniqHrefs ) { my $query = $res->query($urls, "A"); ...

    If you need to retain ordering, then you'll need something slightly different but that doesn't seem to be the case.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: issues maintaining uniqueness
by oko1 (Deacon) on Apr 27, 2008 at 03:40 UTC

    To follow up to BrowserUk's suggestion, you could also use a '%seen' hash (there's nothing magical about the name - it's just an indicator of its function.) This also processes the URLs in order.

    my %seen; for $urls (@hrefs){ next if $seen{$urls}++; my $query = $res->query($urls, "A"); ...

    Each new URL becomes a key in '%seen'; if a key identical to the current one already exists, we return to the top of the loop.

    
    -- 
    Human history becomes more and more a race between education and catastrophe. -- HG Wells
    
Re: issues maintaining uniqueness
by mr_mischief (Monsignor) on Apr 27, 2008 at 06:27 UTC
    One incorrect assumption you've made has nothing to do with language nor implementation. URLs include more information than IP addresses do. An IP address is just a destination on a network. A URL also specifies a resource on that machine. Therefore, even if you successfully compile a list of unique URLs, there's every possibility that the list of IP addresses resolved from them will not be unique. Two URLs might be different resources on the same host. Most web sites are even hosted on some sort of shared hosting or virtual hosting environment, so two hostnames that are different even have a chance of being on the same IP address.
      Hi mischief,

      I don't think this is correct - I am trying to do things this way for that reason specifically.

      If I didn't make the URLs unique before passing them to the Net::DNS::Resolver module I would be doing even more duplication (some duplication is unavoidable as you correctly pointed out).

      By resolving only unique URLs I am minimizing the amount of DNS resolution that is required. It would be wasteful to resolve the exact same URL multiple times as it would always yield the same result.

      Thanks for the reply mischief,

      I actually didn't make that assumption. I could have easily not worried about things being unique until I got to the IP values - but by doing that I am wasting processing power and bandwidth by doing a DNS lookup on the same URL multiple times.

      This is why I decided to make sure I pass a unique set of hostnames to the Net:DNS::Resolver and then run the results (IPs (including duplicates)) through the Array::Unique module again.

      In fact if you try the script as is and compare the regex results before and after passing through the Array:Unique module you will see I am saving a large amount of duplication.

        That's better than what I thought I read, though you could cut the resolver overhead just as much by checking against each hostname before you look it up to see if it's been looked up before. This sort of task is just begging for a hash. That would use one loop instead of two, and probably be simpler to follow.

        my %looked_up; my @urls = qw( list of URLS however you got them ); my @ips; foreach ( @urls ) { my $hostname = extract_hostname_from_url( $_ ); unless ( exists $looked_up{ $hostname ) { my $packed_ip = gethostbyname( $hostname ); if (defined $packed_ip) { $ip_address = inet_ntoa($packed_ip); push @ips, $ip_address; $looked_up{ $hostname } = 1; } } } # do with @ips whatever you were going to do with them

        This retries hostname lookups that fail. You could change that easily by moving the autovivifying hash element assignment outside the if block for the packed IP address being defined.

        the sub extract_hostname_from_url is left as an exercise.

Re: issues maintaining uniqueness
by locked_user sundialsvc4 (Abbot) on Apr 28, 2008 at 21:22 UTC

    One thing that you should always be very-aware of is CPAN: the rather-vast library of Perl software. Without exception, the first thing that you should do, when contemplating a “new” task, is to check CPAN to see if it's already been done. It probably has.