in reply to issues maintaining uniqueness

One incorrect assumption you've made has nothing to do with language nor implementation. URLs include more information than IP addresses do. An IP address is just a destination on a network. A URL also specifies a resource on that machine. Therefore, even if you successfully compile a list of unique URLs, there's every possibility that the list of IP addresses resolved from them will not be unique. Two URLs might be different resources on the same host. Most web sites are even hosted on some sort of shared hosting or virtual hosting environment, so two hostnames that are different even have a chance of being on the same IP address.

Replies are listed 'Best First'.
Re^2: issues maintaining uniqueness
by jkstraw (Novice) on Apr 30, 2008 at 19:20 UTC
    Hi mischief,

    I don't think this is correct - I am trying to do things this way for that reason specifically.

    If I didn't make the URLs unique before passing them to the Net::DNS::Resolver module I would be doing even more duplication (some duplication is unavoidable as you correctly pointed out).

    By resolving only unique URLs I am minimizing the amount of DNS resolution that is required. It would be wasteful to resolve the exact same URL multiple times as it would always yield the same result.

Re^2: issues maintaining uniqueness
by jkstraw (Novice) on Apr 30, 2008 at 18:46 UTC
    Thanks for the reply mischief,

    I actually didn't make that assumption. I could have easily not worried about things being unique until I got to the IP values - but by doing that I am wasting processing power and bandwidth by doing a DNS lookup on the same URL multiple times.

    This is why I decided to make sure I pass a unique set of hostnames to the Net:DNS::Resolver and then run the results (IPs (including duplicates)) through the Array::Unique module again.

    In fact if you try the script as is and compare the regex results before and after passing through the Array:Unique module you will see I am saving a large amount of duplication.

      That's better than what I thought I read, though you could cut the resolver overhead just as much by checking against each hostname before you look it up to see if it's been looked up before. This sort of task is just begging for a hash. That would use one loop instead of two, and probably be simpler to follow.

      my %looked_up; my @urls = qw( list of URLS however you got them ); my @ips; foreach ( @urls ) { my $hostname = extract_hostname_from_url( $_ ); unless ( exists $looked_up{ $hostname ) { my $packed_ip = gethostbyname( $hostname ); if (defined $packed_ip) { $ip_address = inet_ntoa($packed_ip); push @ips, $ip_address; $looked_up{ $hostname } = 1; } } } # do with @ips whatever you were going to do with them

      This retries hostname lookups that fail. You could change that easily by moving the autovivifying hash element assignment outside the if block for the packed IP address being defined.

      the sub extract_hostname_from_url is left as an exercise.

        Thank you everyone!

        I can't say it wasn't painful - but the hashing worked like a charm. Shout out to mischief for getting me to think outside the box in regards to order of operations.

        This was a pretty crazy first attempt at scripting for me but the community here really came through with stellar advice!

        Cheers!