in reply to Re^2: Character Length Requirement & String Conversion
in thread Character Length Requirement & String Conversion

I wanted to update that I have found the solution I was looking for, using String:CRC32 or String::CRC.

Thank you again to everyone for your comments!

-BP-

  • Comment on Re^3: Character Length Requirement & String Conversion

Replies are listed 'Best First'.
Re^4: Character Length Requirement & String Conversion
by BrowserUk (Patriarch) on Mar 13, 2012 at 04:09 UTC
    I have found the solution I was looking for, using String:CRC32 or String::CRC.

    That is a very bad idea! CRC32 is designed for detecting bit corruptions in single strings, not hashing many strings.

    The following code checks for duplicates using just 5 character strings: 'aaaaa' .. 'zzzzz', and finds thousands. The first after just 18026 tries;

    use String::CRC32;; @v = ( chr(0) ) x 256; $_ x= 2*1024*1024 for @v;; sub testAndSet{ my( $hi, $lo ) = ( $_[0] >> 24, $_[0] & 0x00ffffff ); return 1 if vec( $v[$hi], $lo, 1 ); vec( $v[$hi], $lo, 1 )=1; return; };; $n=0; testAndSet( crc32( $_, ++$n ) ) and warn "Dup after $n strings" for 'a +aaaa'..'zzzzz';; Dup after 18026 strings at (eval 12) line 1, <STDIN> line 4. Dup after 18027 strings at (eval 12) line 1, <STDIN> line 4. Dup after 18042 strings at (eval 12) line 1, <STDIN> line 4. Dup after 18043 strings at (eval 12) line 1, <STDIN> line 4. Dup after 18728 strings at (eval 12) line 1, <STDIN> line 4. Dup after 18729 strings at (eval 12) line 1, <STDIN> line 4. Dup after 18744 strings at (eval 12) line 1, <STDIN> line 4. Dup after 18745 strings at (eval 12) line 1, <STDIN> line 4. Dup after 19378 strings at (eval 12) line 1, <STDIN> line 4. Dup after 19379 strings at (eval 12) line 1, <STDIN> line 4. Dup after 116559 strings at (eval 12) line 1, <STDIN> line 4. Dup after 116574 strings at (eval 12) line 1, <STDIN> line 4. Dup after 117261 strings at (eval 12) line 1, <STDIN> line 4. Dup after 117276 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126026 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126027 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126030 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126031 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126042 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126043 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126046 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126047 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126728 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126729 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126732 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126733 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126744 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126745 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126748 strings at (eval 12) line 1, <STDIN> line 4. Dup after 126749 strings at (eval 12) line 1, <STDIN> line 4. Dup after 176385 strings at (eval 12) line 1, <STDIN> line 4. Dup after 176388 strings at (eval 12) line 1, <STDIN> line 4. Dup after 176389 strings at (eval 12) line 1, <STDIN> line 4. Dup after 176400 strings at (eval 12) line 1, <STDIN> line 4. Dup after 176404 strings at (eval 12) line 1, <STDIN> line 4. Dup after 176405 strings at (eval 12) line 1, <STDIN> line 4. Dup after 250001 strings at (eval 12) line 1, <STDIN> line 4. Dup after 250512 strings at (eval 12) line 1, <STDIN> line 4. Dup after 250513 strings at (eval 12) line 1, <STDIN> line 4. Dup after 250516 strings at (eval 12) line 1, <STDIN> line 4.

    Frankly, you'd be better off just truncating the urls to 4 or 5 characters. (That is not a recommendation!)

    And its not much better with long strings:

    use String::CRC32;; @v = ( chr(0) ) x 256; $_ x= 2*1024*1024 for @v;; sub testAndSet{ my( $hi, $lo ) = ( $_[0] >> 24, $_[0] & 0x00ffffff ); return 1 if vec( $v[$hi], $lo, 1 ); vec( $v[$hi], $lo, 1 )=1; return; };; $n=0; testAndSet( crc32( $_, ++$n ) ) and warn "Dup after $n strings" for 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'..'zzzzzzzzzzzzzzzzzzzzzzzz +zzzzzzzz';; Dup after 142376 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551424 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551425 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551426 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551427 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551428 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551429 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551430 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551431 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551684 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551685 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551686 strings at (eval 13) line 1, <STDIN> line 5. Dup after 551687 strings at (eval 13) line 1, <STDIN> line 5. Dup after 587768 strings at (eval 13) line 1, <STDIN> line 5. Dup after 587769 strings at (eval 13) line 1, <STDIN> line 5. Dup after 587770 strings at (eval 13) line 1, <STDIN> line 5. Dup after 587771 strings at (eval 13) line 1, <STDIN> line 5. Dup after 832410 strings at (eval 13) line 1, <STDIN> line 5. Dup after 832411 strings at (eval 13) line 1, <STDIN> line 5. Dup after 832472 strings at (eval 13) line 1, <STDIN> line 5. Dup after 832473 strings at (eval 13) line 1, <STDIN> line 5. Dup after 833434 strings at (eval 13) line 1, <STDIN> line 5. Dup after 833435 strings at (eval 13) line 1, <STDIN> line 5. Dup after 833502 strings at (eval 13) line 1, <STDIN> line 5. Dup after 833503 strings at (eval 13) line 1, <STDIN> line 5. Dup after 903490 strings at (eval 13) line 1, <STDIN> line 5. Dup after 903491 strings at (eval 13) line 1, <STDIN> line 5. Dup after 903494 strings at (eval 13) line 1, <STDIN> line 5. Dup after 903495 strings at (eval 13) line 1, <STDIN> line 5. Dup after 903498 strings at (eval 13) line 1, <STDIN> line 5. Dup after 903501 strings at (eval 13) line 1, <STDIN> line 5. Dup after 903516 strings at (eval 13) line 1, <STDIN> line 5. Dup after 903517 strings at (eval 13) line 1, <STDIN> line 5. Dup after 994476 strings at (eval 13) line 1, <STDIN> line 5. Dup after 994477 strings at (eval 13) line 1, <STDIN> line 5. Dup after 994788 strings at (eval 13) line 1, <STDIN> line 5. Dup after 994789 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019528 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019529 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019532 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019533 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019536 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019537 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019560 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019565 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019840 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019841 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019844 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019845 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019848 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019849 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019872 strings at (eval 13) line 1, <STDIN> line 5. Dup after 1019877 strings at (eval 13) line 1, <STDIN> line 5.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re^4: Character Length Requirement & String Conversion
by ikegami (Patriarch) on Mar 13, 2012 at 05:09 UTC

    Don't!!!

    CRC isn't a hashing algorithm. It's not designed for your purpose at all.

    But let's assume it's somehow as good as MD5 (bit for bit) despite that, MD5's first 19 hex characters would be 18 trillion times better than CRC32.

    Num of possible Times better Times better hashes than CRC32 md5_hex =================== =============== ============ ============ CRC32 4.3E09 MD5 (19 of base 16) 7.6E22 1.8E13 MD5 (19 of base 62) 1.1E34 6.5E20 1.5E11