comment on

To avoid duplicate user/IP pairs, you could use something of this general form

my %seen;
while ( my ( $user, $ip ) = next_pair() ) {
  next if $seen{ $user }{ $ip }++;
  insert( $user, $ip );
}
[download]

Now you need to come up with sensible definitions for next_pair() and insert().

Just to be clear, the above will prevent any pairing from being inserted more than once, but it is possible users and IPs to be inserted multiple times, as long as in each insertion they are associated with different IPs and users, respectively. If you want to make sure the users are inserted only once, irrespective of IP address, then the first line in the loop above would become

  next if $seen{ $user }++;
[download]

Siimilarly, if you want to make sure IPs are inserted only once, that line would instead be

  next if $seen{ $ip }++;
[download]

One common gotcha whenever you are trying to avoid duplicates results from not having a sufficiently clear specification of what items should be regarded as equivalent. For example, how should your program deal with the pairs (john doe|12.345.678.901) and (John Doe|12.345.678.901). The code above, as written, would result in two insertions, but maybe you want to avoid any case distinctions in the name (and thus avoid the second insertion). If so, you'd need to change the first line in the loop to something like:

  next if $seen{ uc $user }{ $ip }++;
[download]

This ensures that your duplicate control scheme detects user names case-insensitively.

This small example illustrates the need to specify exactly what one means by "duplicates", and from this specification, design a normalization procedure that must be applied before testing for repeats. In the example above, this normalization procedure is very simple: just convert everything to uppercase. (An entirely equivalent procedure would be to convert everything to lowercase.) But you may require more elaborate normalization requirements; e.g. you may want to treat the pairs (Edward Estlin Cummings|12.345.678.901) and (e.e.cummings|12.345.678.901) as equivalent.

the lowliest monk

In reply to Re: Using hashes or arrays to remove duplicate entries by tlm
in thread Using hashes or arrays to remove duplicate entries by ghettofinger

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.