Re: Hash Search is VERY slow

My guess is also that your memory consumption leads to excessive swapping.

The general idea is presorting, for instance by iterating multiple times over the file and only processing one IP range after the other.

This is expensive in IO but will create smaller data structures.

Only if ...

... you really need all the data present in memory at once, consider breaking up the ranges into a tree of nested data structures and processing them in linear order.

like $hash->{'192'}{'168'}{'101'}{'208'} or $hash->{'192.168'}{'101.208'} instead of $hash->{'192.168.101.208'} °

If you now process all IPs in order , then Perl (well the OS) will be able to swap all memory-pages with unrelated sub-hashes out. This will be cheap because the number of swaps is minimized by the sorting. (see also Re: Small Hash a Gateway to Large Hash? )

An additional approach is using more compact data structures, hashes are efficient for sparse data. But if your IPs range from 0-255 an array is certainly more efficient.

Furthermore, there is no point in repeating URLs like "logmeinrescue.com" in your array, counting them is more memory efficient.

Anyway 800k lines input doesn't sound heavy though, not sure if we have the full picture (???)

edit

Like choroba already said, preloading the input completely into memory sounds like a waste of resources, you should check how much that costs.

OTOH if you decided to implement my initial idea to process one IP range after the other, it'll reduce IO if (and only if) all fits into memory.

Cheers Rolf
_{(addicted to the Perl Programming Language :)

Wikisyntax for the Monastery}

°) I'm aware that 192.168.*.* is very common

Comment on Re: Hash Search is VERY slow Select or Download Code

Replies are listed 'Best First'.
Re^2: Hash Search is VERY slow by Tux (Canon) on Sep 30, 2021 at 07:59 UTC
If the number of keys is the cause of the memory cunsumption, make the keys smaller and more efficient. It might matter if you are low on memory. Best reason to pack the IP's is the it is much easier to sort use List::Util qw( shuffle ); use Devel::Size qw( total_size ); my ($x, %ip0, %ip1, %ip2, %ip3, %ip4, %ip5) = 31; foreach my $i0 ((shuffle 1..254)[0..$x]) { foreach my $i1 ((shuffle 1..254)[0..$x]) { foreach my $i2 ((shuffle 1..254)[0..$x]) { foreach my $i3 ((shuffle 1..254)[0..$x]) { $ip0{"$i0.$i1.$i2.$i3"}++; $ip1{"$i0.$i1"}{"$i2.$i3"}++; $ip2{$i0}{$i1}{$i2}{$i3}++; $ip3{pack "CCCC", $i0, $i1, $i2, $i3}++; $ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3}++; $ip5{pack "C", $i0}{pack "C", $i1}{pack "C", $i2}{pack "C", $i3}++ +; }}}} printf "{192.168.101.208} : %9d\n", total_size (\%ip0); printf "{192.168}{101.208} : %9d\n", total_size (\%ip1); printf "{192}{168}{101}{208} : %9d\n", total_size (\%ip2); printf "{xC0xA8x65xD0} : %9d\n", total_size (\%ip3); printf "{xC0xA8}{x65xD0} : %9d\n", total_size (\%ip4); printf "{xC0}{xA8}{x65}{xD0} : %9d\n", total_size (\%ip5); ==> {192.168.101.208} : 116483356 {192.168}{101.208} : 69879492 {192}{168}{101}{208} : 71176834 {xC0xA8x65xD0} : 106954864 {xC0xA8}{x65xD0} : 69611776 {xC0}{xA8}{x65}{xD0} : 71176690 [download] On my machine using perl-5.28.0, `$ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3}` proves to be the most memory-friendly approach Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: Hash Search is VERY slow by LanX (Saint) on Sep 30, 2021 at 11:19 UTC
> `$ip4{pack "CC", $i0, $i1}{pack "CC", $i2, $i3}` proves to be the most memory-friendly approach This might be true for long keys, but the original strings of the keys are only stored once. `{192.168}{101.208} : 69879492 ... {xC0xA8}{x65xD0} : 69611776` [download] that's a gain of ~0.3% ° I'd call this micro optimization, or did I miss something? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} °) 0.383% to be more precise, so rather ~0.4%	[reply] [d/l] [select]
Re^4: Hash Search is VERY slow by Tux (Canon) on Sep 30, 2021 at 11:54 UTC
Your suggestion of using `{"192.168"}{"101.208"}` is indeed almost as small as my `pack` version, so it shows close to no gain in memory. Both are however 40% smaller than the original `{"192.168.101.208"}`. I just included all of those to show the differences in size. I first thought I'd use `Socket::inet_aton`, but that of course does not support the `{A.B}{C.D}` split. The win in using `pack` over plain is not the .3% gain in memory size, but the ease of sorting. Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^5: Hash Search is VERY slow by LanX (Saint) on Sep 30, 2021 at 11:59 UTC
Re^6: Hash Search is VERY slow by Tux (Canon) on Sep 30, 2021 at 13:07 UTC
Some notes below your chosen depth have not been shown here
Re^6: Hash Search is VERY slow by soonix (Chancellor) on Sep 30, 2021 at 13:02 UTC