comment on

After writing everything below it occurred to me that your original solution might work just fine if you used a DB_File tied hash rather than a Perl in-memory hash. Roughly:

use DB_File;
my $hash;
tie %hash, DB_File, "first_file.hash";
while (<FIRST_FILE>)
{
  $hash{$_} = undef;
}

while (<SECOND_FILE>)
{
  print $_ if exists $hash{$_};
}
[download]

Below is what I wrote first of all, it's still interesting as it explains a bit about how to make your own hash.

Your idea of a hash is correct but as you say, Perl's internal hash is not suitable. Instead, you could build your own hash.

First you need a hash function. This could be fairly simple. For example just add up the ascii values of the characters in each line and then find the remaineder mod 256

my $h = 0;
foreach (split("", $line)) {
  $h+=ord($_)
}
$h = $h % 256;
[download]

with this you can assign each line to a particular bucket and hopefully there should be a fairly even spread over the 256 possible buckets (if not you can use something more complicated for your hash function). Another good one to use is part of a time value - if each line contains a timestamp then taking the seconds value or the minutes and seconds together might give you a fairly even spread.

The thing now is that 2 lines which have differing hash values must be different, there is no need to compare them character by character. If the hash values are the same you do have to compare the lines character by character . So comparing the hash value first eliminates a large amount of work.

What you have now is exactly the same as a Perl internal hash except that Perl's hash buckets are stored in memory, yours are stored on disk. Also, Perl uses a more complicated function to make sure that things are spread evenly around all the buckets.

There are a couple of different things you can do now.

You could make 2 directories, one for each big file. Each directory will have 256 files - 1 for each bucket. Then you go through the big files, writing each line out into it's correct bucket file. You'll be left with 256 pairs of files, something like

file1/0
file1/1
...
file1/255
file2/0
file2/1
...
file2/255
[download]

now you need to compare file1/0 and file2/0 to find common lines and then file1/1 and file2/1 and so on. Each file should be 256 times smaller than the orignal files so hopefully they're small enough that they can be compared easily using Perl's internal hash function. If dividing by 256 isn't enough then you might need a different hash function.

Another approach is to split just 1 of the files into buckets and then go through the other file line by line, figuring out which bucket each line would be in and then checking to see if it is. This will be pretty slow. To speed it up, instead of each bucket just being a flat file, you could make each bucket a DB_File file, then you can use DB_File's hash lookup to see if a particular line is there. Basically you're using a 2-layered hash.

In reply to Re: Efficient search through a huge dataset by fergal
in thread Efficient search through a huge dataset by johnnywang

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.