Working with large amount of data

just1fix has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Working with large amount of data by moritz (Cardinal) on Sep 20, 2009 at 19:35 UTC
Do you need to count the number of different IPs, or also how often each of these IPs occured? If it's the former, you can create a bit vector, each bit corresponding to one IP address. An IPv4 address has 4 bytes or 32 bites, so you need 2**32/8 = 0.5G bytes. See vec for a function that manipulates bit vectors in Perl. You can also use out-of-memory storage such as GDBM_File, but that will probably slow down your program. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^2: Working with large amount of data by tilly (Archbishop) on Sep 21, 2009 at 06:50 UTC
s/probably/definitely/ If you need to count how often each IP occurred, I'd strongly recommend implementing a variant of merge sort where in the merging process duplicate keys means you merge records and sum the counts. I've done that before for this kind of problem and got quite reasonable performance.	[reply]
Re: Working with large amount of data by merlyn (Sage) on Sep 20, 2009 at 21:55 UTC
Does this question strike anyone else as a bit odd? I mean, an organization dealing with more unique IPs than the entire world has in use, and yet has only a single machine with 1GB of memory to count these? So I'm wondering which of those aren't accurate? Do they actually have Oracle sitting around, and the poster doesn't know how to get to it? Or is it more like a million IPs? Or is this homework? It just doesn't fit. It doesn't. -- Randal L. Schwartz, Perl hacker The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.	[reply]
Re^2: Working with large amount of data by BrowserUk (Patriarch) on Sep 20, 2009 at 23:58 UTC
Your point about the machine limitations of the organisation undiminished, doesn't this quote from your own reference suggest that there are ~1.6 billion IPs in use? Of the 2063.60 million addresses delegated to the five Regional Internet Registries, 1685.69 million have been delegated to end-users or ISPs by the RIRs, Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW It is as I've been saying!(Audio until 20090817)	[reply]
Re^2: Working with large amount of data by Gangabass (Vicar) on Sep 21, 2009 at 04:53 UTC
I think i have read about this big file some time ago (can't find this link right now). It was project of two guys: they start crawler which save to log file each GET response so they try to collect all domains default page (with HTTP headers). As i read about this they talk about 1Tb file... But this file mostly contain page content not IP address info. Update: I have finded this link: http://www.dotnetdotcom.org/ but this is only 16Gb file :-( So i was wrong it's not Internet Index file.	[reply]
Re^2: Working with large amount of data by Marshall (Canon) on Sep 20, 2009 at 23:32 UTC
I agree with this! I have a relatively old Win XP machine and on this "wimp machine" a user can use 2GB of the physical memory (aside from what WinXP uses). I would also ask what file system is being used to read this TB monster file?Update: this limit is not right: ~~(My NTFS is limited to 2GB file size).~~ This part is right: What really blows my mind is that this is a log file! Log file of WHAT?	[reply]
Re^3: Working with large amount of data by BrowserUk (Patriarch) on Sep 20, 2009 at 23:49 UTC
(My NTFS is limited to 2GB file size). Huh? NTFS has been able to handle individual files up to a gnat's todger under 16TB forever. (At least since NT4 days; and I believe since 3.51). Even FAT16 & FAT32 can handle 4GB. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW It is as I've been saying!(Audio until 20090817)	[reply]
Re^4: Working with large amount of data by Marshall (Canon) on Sep 20, 2009 at 23:58 UTC
Re^5: Working with large amount of data by BrowserUk (Patriarch) on Sep 21, 2009 at 00:25 UTC
Some notes below your chosen depth have not been shown here
Re^2: Working with large amount of data by dsheroh (Monsignor) on Sep 21, 2009 at 11:15 UTC
I'd suspect it's an interview question. A couple years back, Google and I talked about some open jobs and the (IIRC) second-round tech interview included a couple questions very similar to this one.	[reply]
Re: Working with large amount of data by jwkrahn (Abbot) on Sep 20, 2009 at 19:32 UTC
Use a hash tied to a database and keep a running count when new keys are added. Or perhaps convert IPs to integers and use a bit vector to keep track of unique IPs.	[reply]
Re^2: Working with large amount of data by just1fix (Novice) on Sep 20, 2009 at 19:40 UTC
I meant I need to count quantity of every unique IP in a log. But thank you, I think I should look in the direction of tying hash to DB files.	[reply]
Re^3: Working with large amount of data by salva (Canon) on Sep 20, 2009 at 21:47 UTC
Actually, using an in disk database will be quite inefficient! Every entry will require, at minimum, 4 bytes to store the IP address plus 4 bytes for the counter, so 8 bytes for 1 billion entries results in 8GB. Statistically, that means that for a completely random access, and with 1GB of memory cache, you have >85% of misses... and we are talking of an ideal scenario, in practice it will probably be one or two orders of magnitude worse!!! Another (IMO better) approach: Divide the IPs in ranges that can fit in the available memory. For instance, in 512MB, you can fit 128M counters, so divide the 2*32 IP address space in 32 ranges (32 128M = 2**32). Read the log file and save the IPs as you find them in one of 32 files associated to the ranges. Process the files using a packed array (check Tie::Array::Packed or use `vec`) to count the IP ocurrences.	[reply] [d/l]
Re: Working with large amount of data by salva (Canon) on Sep 20, 2009 at 19:40 UTC
1 billion is roughly 25% of the total number of possible IPs (232), so your best option would probably be to use a bit vector that will take 1/2 GB of memory (2 32 / 8). It seems that the biggest index that can be handled by Perl `vec()` built-in is 2**31 - 1, so, as a work around you could use two vectors, one for IPs in the range 0 to 0x7fffffff, and other for 0x80000000 to 0xffffffff.	[reply] [d/l]
Re: Working with large amount of data by BrowserUk (Patriarch) on Sep 21, 2009 at 08:43 UTC
There is an implicit assumption within most of the responses in this thread that your talk of "ip adresses" is confined to IPv4 addresses. Is this a correct assumption? I ask, because if correct, it is emminently feasible to consider using direct addressing to a 16GB file to hold 32-bit counts for each of the 4GB IPv4 addresses.This would require minimal memory, but reading-incrementing-writing 4 bytes within a 16GB file for each of 1 billion--essentially random--ips discovered would be too costly to contemplate. And far more so if the reads and writes are to any kind of DB. The secret to making this work efficiently would be to Memory map the 16GB file. Cache the ips in memory as a sparse array (a hash in Perl's terms), as the are discovered, and only write them to disk when a given number of unique IPs have been counted. Once the cache limit is reached, you sort the uniquie IPs numerically and then read-update-write largish (say 64K) lumps of the file. You then reset (undef) the hash and continue. By using memory mapped files, you let the virtual/physical address mapping capabilities of the OS and hardware take care of accessing the appropriate chunk of the file efficiently. By caching the counts, you can control the amount of memory the process uses, whilst avoiding one-seek-per-IP scenario, which is the killer for disk-based solutions. By sorting the accumulated IPs before reading and writing large chunks, you make best possible use of the systems filecaching and L1/l2/L3 caching to only write to real memory (or disk) when necessary. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW It is as I've been saying!(Audio until 20090817)	[reply]
Re: Working with large amount of data by graff (Chancellor) on Sep 20, 2009 at 19:37 UTC
Use a good database that enforces unique-key constraints, and use the IP addresses as keys. You may want to divide the problem into separate tables according to the first portion of the address, because it's easy to manage 100 or so tables, and handling the hashing (checking uniqueness) will be easier/faster when there's a smaller number of entries in a given table. (And with the database solution, you have the option of actually storing something informative about each IP address, and more flexibility in pulling stuff out of the list once you're with the log file, in case that's helpful to you.) UPDATE: CountZero's skepticism about my suggestions aside, I'll say now that I think my suggestions can be discarded in favor of the proposal described below by BrowserUK -- if his assumptions about the OP task are valid, his plan is profoundly more effective, efficient and satisfying.	[reply]
Re^2: Working with large amount of data by CountZero (Bishop) on Sep 21, 2009 at 06:17 UTC
because it's easy to manage 100 or so tables, and handling the hashing (checking uniqueness) will be easier/faster when there's a smaller number of entries in a given table I do not agree. Doing some queries over data spread out over 100 tables is way more difficult (and slow) than doing the same query on a single table. And isn't hashing supposed to be working as fast for large or small addres spaces? The hashing function should be O(1), if I remember well. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^3: Working with large amount of data by tilly (Archbishop) on Sep 21, 2009 at 06:44 UTC
Hashing is O(1), but the constant involves a seek to memory. When the hash is too big to fit in RAM, that turns into a seek to disk. On a typical hard drive you should plan on an average of 0.01 seconds for the drive head to be correctly positioned for your read. If you do a few billion lookups each taking 0.01 seconds, you're talking about a year. That's usually not acceptable. Splitting up the problem into a series of subproblems that fit in RAM is a huge performance win. Despite the added complexity.	[reply]
Re^4: Working with large amount of data by CountZero (Bishop) on Sep 21, 2009 at 08:22 UTC
Re^5: Working with large amount of data by tilly (Archbishop) on Sep 21, 2009 at 16:06 UTC
Some notes below your chosen depth have not been shown here
Re: Working with large amount of data by massa (Hermit) on Sep 21, 2009 at 10:56 UTC
This: #!/usr/bin/perl use strict; use Socket; unlink for </tmp/prefix.*.ips>; my %prefixes; while( <> ) { while( /((\d+)\.\d+\.\d+\.\d+)/g ) { open my $f, '>>', "/tmp/prefix.$2.ips"; $prefixes{$2}++; print $f inet_aton($1), "\n" } } for ( sort {$a <=> $b} keys %prefixes ) { my %addresses; open my $f, '<', "/tmp/prefix.$_.ips"; while( <$f> ) { chomp; $addresses{$_}++ } printf "%-20.20s => %d\n", inet_ntoa($_), $addresses{$_} for sort {$ +a <=> $b} keys %addresses } unlink "/tmp/prefix.$_.ips" for keys %prefixes __END__ 64.233.169.17 => 1 64.233.169.18 => 1 75.101.152.211 => 2 85.17.189.130 => 4 127.0.0.1 => 8 174.129.112.136 => 1 174.129.233.130 => 1 [download] should work Ok (tested here with far less than a billion IPs). If you think it's still hogging the memory, you can just substritute `while( /((\d+)\.\d+\.\d+\.\d+)/g ) {` [download] for `while( /((\d+\.\d+)\.\d+\.\d+)/g ) {` [download] and be done with it. []s, HTH, Massa (κς,πμ,πλ)	[reply] [d/l] [select]
Re: Working with large amount of data by talexb (Chancellor) on Sep 21, 2009 at 14:28 UTC
You've received a bunch of good suggestions. Myself, I'd go for the database approach, but perhaps that's because I'm a database guy. But don't take my word for it. Grab a small chunk of that terabyte file and push it through each of the suggested approaches, keeping track of speed and memory usage. Slowly increase the chunk size -- ideally you'll be able to see what kind of resource curve each approach will require. Go with the approach that seems that it will work 'best' for your definition of 'best'. (If you have time, try each approach on the whole file and see what performance you get -- we'd love to hear the results.) Final word: it's OK to be in love with a particular approach, but you have to be scientific about this kind of thing. Abandon emotion, and apply logic and measurement to the problem. The number's don't lie. Much. :) Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re: Working with large amount of data by ikegami (Patriarch) on Sep 22, 2009 at 15:30 UTC
Three steps: Convert the log into a sortable format, resolving domain names if necessary. `use Socket qw( inet_aton ); while (<>) { my $host = extract_host($_); print(unpack('H8', inet_aton($host)), $_); }` [download] Sort the modified log using unix command line util `sort`. Count the dups in the sorted file `my $last; my $count; while (<>) { my $ip = extract_ip($_); if (defined($last) && $last ne $ip) { print("$last: $count\n"); $count = 0; } $last = $ip; ++$count; } if (defined($last)) { print("$last: $count\n"); }` [download] The Perl bits use O(1) memory. Unix command line util `sort` will efficiently use the disk to sort when necessary. Bonus: You are given the log entries that correspond to each ip address at basically no cost. If you don't need that, you can leave out that information from the file to be sorted.	[reply] [d/l] [select]
Re^2: Working with large amount of data by BrowserUk (Patriarch) on Sep 22, 2009 at 17:44 UTC
Switch `$_` for `tell ARGV` in the print statement and you get not only what but where, save some disk space and speed up the sort. (Question:how long does it take to sort a billion lines?) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW It is as I've been saying!(Audio until 20090817)	[reply] [d/l] [select]