Re: Working with large amount of data
by moritz (Cardinal) on Sep 20, 2009 at 19:35 UTC
|
Do you need to count the number of different IPs, or also how often each of these IPs occured?
If it's the former, you can create a bit vector, each bit corresponding to one IP address. An IPv4 address has 4 bytes or 32 bites, so you need 2**32/8 = 0.5G bytes. See vec for a function that manipulates bit vectors in Perl.
You can also use out-of-memory storage such as GDBM_File, but that will probably slow down your program.
Perl 6 - links to (nearly) everything that is Perl 6.
| [reply] |
|
|
| [reply] |
Re: Working with large amount of data
by merlyn (Sage) on Sep 20, 2009 at 21:55 UTC
|
Does this question strike anyone else as a bit odd? I mean, an organization dealing with more unique IPs than the entire world has in use, and yet has only a single machine with 1GB of memory to count these?
So I'm wondering which of those aren't accurate? Do they actually have Oracle sitting around, and the poster doesn't know how to get to it? Or is it more like a million IPs? Or is this homework?
It just doesn't fit. It doesn't.
-- Randal L. Schwartz, Perl hacker
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
| [reply] |
|
|
| [reply] |
|
|
I think i have read about this big file some time ago (can't find this link right now). It was project of two guys: they start crawler which save to log file each GET response so they try to collect all domains default page (with HTTP headers). As i read about this they talk about 1Tb file... But this file mostly contain page content not IP address info.
Update: I have finded this link: http://www.dotnetdotcom.org/ but this is only 16Gb file :-( So i was wrong it's not Internet Index file.
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
|
|
|
|
|
|
|
I'd suspect it's an interview question. A couple years back, Google and I talked about some open jobs and the (IIRC) second-round tech interview included a couple questions very similar to this one.
| [reply] |
Re: Working with large amount of data
by jwkrahn (Abbot) on Sep 20, 2009 at 19:32 UTC
|
Use a hash tied to a database and keep a running count when new keys are added.
Or perhaps convert IPs to integers and use a bit vector to keep track of unique IPs.
| [reply] |
|
|
I meant I need to count quantity of every unique IP in a log. But thank you, I think I should look in the direction of tying hash to DB files.
| [reply] |
|
|
Actually, using an in disk database will be quite inefficient!
Every entry will require, at minimum, 4 bytes to store the IP address plus 4 bytes for the counter, so 8 bytes for 1 billion entries results in 8GB. Statistically, that means that for a completely random access, and with 1GB of memory cache, you have >85% of misses... and we are talking of an ideal scenario, in practice it will probably be one or two orders of magnitude worse!!!
Another (IMO better) approach:
- Divide the IPs in ranges that can fit in the available memory. For instance, in 512MB, you can fit 128M counters, so divide the 2**32 IP address space in 32 ranges (32 * 128M = 2**32).
- Read the log file and save the IPs as you find them in one of 32 files associated to the ranges.
- Process the files using a packed array (check Tie::Array::Packed or use vec) to count the IP ocurrences.
| [reply] [d/l] |
Re: Working with large amount of data
by salva (Canon) on Sep 20, 2009 at 19:40 UTC
|
1 billion is roughly 25% of the total number of possible IPs (2**32), so your best option would probably be to use a bit vector that will take 1/2 GB of memory (2 ** 32 / 8).
It seems that the biggest index that can be handled by Perl vec() built-in is 2**31 - 1, so, as a work around you could use two vectors, one for IPs in the range 0 to 0x7fffffff, and other for 0x80000000 to 0xffffffff. | [reply] [d/l] |
Re: Working with large amount of data
by BrowserUk (Patriarch) on Sep 21, 2009 at 08:43 UTC
|
There is an implicit assumption within most of the responses in this thread that your talk of "ip adresses" is confined to IPv4 addresses. Is this a correct assumption?
I ask, because if correct, it is emminently feasible to consider using direct addressing to a 16GB file to hold 32-bit counts for each of the 4GB IPv4 addresses.This would require minimal memory, but reading-incrementing-writing 4 bytes within a 16GB file for each of 1 billion--essentially random--ips discovered would be too costly to contemplate. And far more so if the reads and writes are to any kind of DB.
The secret to making this work efficiently would be to
- Memory map the 16GB file.
- Cache the ips in memory as a sparse array (a hash in Perl's terms), as the are discovered, and only write them to disk when a given number of unique IPs have been counted.
Once the cache limit is reached, you sort the uniquie IPs numerically and then read-update-write largish (say 64K) lumps of the file. You then reset (undef) the hash and continue.
By using memory mapped files, you let the virtual/physical address mapping capabilities of the OS and hardware take care of accessing the appropriate chunk of the file efficiently.
By caching the counts, you can control the amount of memory the process uses, whilst avoiding one-seek-per-IP scenario, which is the killer for disk-based solutions.
By sorting the accumulated IPs before reading and writing large chunks, you make best possible use of the systems filecaching and L1/l2/L3 caching to only write to real memory (or disk) when necessary.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: Working with large amount of data
by graff (Chancellor) on Sep 20, 2009 at 19:37 UTC
|
Use a good database that enforces unique-key constraints, and use the IP addresses as keys. You may want to divide the problem into separate tables according to the first portion of the address, because it's easy to manage 100 or so tables, and handling the hashing (checking uniqueness) will be easier/faster when there's a smaller number of entries in a given table.
(And with the database solution, you have the option of actually storing something informative about each IP address, and more flexibility in pulling stuff out of the list once you're with the log file, in case that's helpful to you.)
UPDATE: CountZero's skepticism about my suggestions aside, I'll say now that I think my suggestions can be discarded in favor of the proposal described below by BrowserUK -- if his assumptions about the OP task are valid, his plan is profoundly more effective, efficient and satisfying. | [reply] |
|
|
because it's easy to manage 100 or so tables, and handling the hashing (checking uniqueness) will be easier/faster when there's a smaller number of entries in a given table I do not agree.Doing some queries over data spread out over 100 tables is way more difficult (and slow) than doing the same query on a single table. And isn't hashing supposed to be working as fast for large or small addres spaces? The hashing function should be O(1), if I remember well.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] |
|
|
Hashing is O(1), but the constant involves a seek to memory. When the hash is too big to fit in RAM, that turns into a seek to disk. On a typical hard drive you should plan on an average of 0.01 seconds for the drive head to be correctly positioned for your read.
If you do a few billion lookups each taking 0.01 seconds, you're talking about a year. That's usually not acceptable.
Splitting up the problem into a series of subproblems that fit in RAM is a huge performance win. Despite the added complexity.
| [reply] |
|
|
|
|
|
Re: Working with large amount of data
by massa (Hermit) on Sep 21, 2009 at 10:56 UTC
|
#!/usr/bin/perl
use strict;
use Socket;
unlink for </tmp/prefix.*.ips>;
my %prefixes;
while( <> ) {
while( /((\d+)\.\d+\.\d+\.\d+)/g ) {
open my $f, '>>', "/tmp/prefix.$2.ips";
$prefixes{$2}++;
print $f inet_aton($1), "\n"
}
}
for ( sort {$a <=> $b} keys %prefixes ) {
my %addresses;
open my $f, '<', "/tmp/prefix.$_.ips";
while( <$f> ) {
chomp;
$addresses{$_}++
}
printf "%-20.20s => %d\n", inet_ntoa($_), $addresses{$_} for sort {$
+a <=> $b} keys %addresses
}
unlink "/tmp/prefix.$_.ips" for keys %prefixes
__END__
64.233.169.17 => 1
64.233.169.18 => 1
75.101.152.211 => 2
85.17.189.130 => 4
127.0.0.1 => 8
174.129.112.136 => 1
174.129.233.130 => 1
should work Ok (tested here with far less than a billion IPs). If you think it's still hogging the memory, you can just substritute while( /((\d+)\.\d+\.\d+\.\d+)/g ) {
for while( /((\d+\.\d+)\.\d+\.\d+)/g ) {
and be done with it.
[]s, HTH, Massa (κς,πμ,πλ)
| [reply] [d/l] [select] |
Re: Working with large amount of data
by talexb (Chancellor) on Sep 21, 2009 at 14:28 UTC
|
You've received a bunch of good suggestions. Myself, I'd go for the database approach, but perhaps that's because I'm a database guy. But don't take my word for it.
Grab a small chunk of that terabyte file and push it through each of the suggested approaches, keeping track of speed and memory usage. Slowly increase the chunk size -- ideally you'll be able to see what kind of resource curve each approach will require.
Go with the approach that seems that it will work 'best' for your definition of 'best'. (If you have time, try each approach on the whole file and see what performance you get -- we'd love to hear the results.)
Final word: it's OK to be in love with a particular approach, but you have to be scientific about this kind of thing. Abandon emotion, and apply logic and measurement to the problem. The number's don't lie. Much. :)
Alex / talexb / Toronto
"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds
| [reply] |
Re: Working with large amount of data
by ikegami (Patriarch) on Sep 22, 2009 at 15:30 UTC
|
Three steps:
-
Convert the log into a sortable format, resolving domain names if necessary.
use Socket qw( inet_aton );
while (<>) {
my $host = extract_host($_);
print(unpack('H8', inet_aton($host)), $_);
}
-
Sort the modified log using unix command line util sort.
-
Count the dups in the sorted file
my $last;
my $count;
while (<>) {
my $ip = extract_ip($_);
if (defined($last) && $last ne $ip) {
print("$last: $count\n");
$count = 0;
}
$last = $ip;
++$count;
}
if (defined($last)) {
print("$last: $count\n");
}
The Perl bits use O(1) memory. Unix command line util sort will efficiently use the disk to sort when necessary.
Bonus: You are given the log entries that correspond to each ip address at basically no cost. If you don't need that, you can leave out that information from the file to be sorted.
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |