ghettofinger has asked for the wisdom of the Perl Monks concerning the following question:

Wise monks,

I am trying to wrap my head around a couple of problems, so I thought I would see if anyone here in the monastery could give me some advice.

I am trying to scrape a webpage to remove a username and the ip address associated with it. There are multiple username/ip address pairings on the page. This information will then be inserted into a database. The problem that I will run into though is that the same username and ip will undoubtably be on the page more than one time. In order to make sure that only one pairing of information is inserted into the database, wouldn't I want to insert the scraped information into a hash or array before doing a database insert so I could somehow remove duplicates? I am not sure how to go about this.

Do I use a hash, or an array? Can I use a regular expression to add the username ip address pair of data to a hash and then remove duplicates? Is this how it should be done? I am really confused.

I appreciate everyone's help and thanks in advance.

--ghettofinger
  • Comment on Using hashes or arrays to remove duplicate entries

Replies are listed 'Best First'.
Re: Using hashes or arrays to remove duplicate entries
by rev_1318 (Chaplain) on Apr 13, 2005 at 07:57 UTC
    It all depends:
    1. can the same username appear more then once, but with different ip addresses? If not, use a hash with the username as the key and a arrayref containing ip addresses as the value.
    2. - can the same ip address appear more then once, but with different usernames? If not, then do the opposite as 1.
    If both can appear multiple times, you best use the combination username-ip-address as the key of a hash. Then either use a arrayref with username and ip address as the vaule and use tis array as the data for your datavase-insert or split the hash key and use that.

    HTH,
    Paul

Re: Using hashes or arrays to remove duplicate entries
by tweetiepooh (Hermit) on Apr 13, 2005 at 10:08 UTC
    Why not create a unique key at the database level, then when you try to
    insert an existing value it database will disallow it?
    Trap this in your program.

    This method will let you keep distinct values between runs as well as inside runs.

    Better create a hash as suggested to prevent attempts to insert within the run.
Re: Using hashes or arrays to remove duplicate entries
by tlm (Prior) on Apr 13, 2005 at 11:07 UTC

    To avoid duplicate user/IP pairs, you could use something of this general form

    my %seen; while ( my ( $user, $ip ) = next_pair() ) { next if $seen{ $user }{ $ip }++; insert( $user, $ip ); }
    Now you need to come up with sensible definitions for next_pair() and insert().

    Just to be clear, the above will prevent any pairing from being inserted more than once, but it is possible users and IPs to be inserted multiple times, as long as in each insertion they are associated with different IPs and users, respectively. If you want to make sure the users are inserted only once, irrespective of IP address, then the first line in the loop above would become

    next if $seen{ $user }++;
    Siimilarly, if you want to make sure IPs are inserted only once, that line would instead be
    next if $seen{ $ip }++;

    One common gotcha whenever you are trying to avoid duplicates results from not having a sufficiently clear specification of what items should be regarded as equivalent. For example, how should your program deal with the pairs (john doe|12.345.678.901) and (John Doe|12.345.678.901). The code above, as written, would result in two insertions, but maybe you want to avoid any case distinctions in the name (and thus avoid the second insertion). If so, you'd need to change the first line in the loop to something like:

    next if $seen{ uc $user }{ $ip }++;
    This ensures that your duplicate control scheme detects user names case-insensitively.

    This small example illustrates the need to specify exactly what one means by "duplicates", and from this specification, design a normalization procedure that must be applied before testing for repeats. In the example above, this normalization procedure is very simple: just convert everything to uppercase. (An entirely equivalent procedure would be to convert everything to lowercase.) But you may require more elaborate normalization requirements; e.g. you may want to treat the pairs (Edward Estlin Cummings|12.345.678.901) and (e.e.cummings|12.345.678.901) as equivalent.

    the lowliest monk