comment on

The %seen hash is growing to the point that your memory is saturated.

My initial impression is that this is probably a good task for a database -- database engines know how to manage, index, and search large data sets. But I also had another thought. A file-based approach (discussed below) is going to have many performance drawbacks, but one drawback it will not have is running out of RAM.

What if instead of storing the $key followed by a list of $value's in each hash element, you instead wrote the $key into its own file named by $key1, and then append $value to that file? Here's an example:

$/ = "";
while(<>) {
    chomp;
    my ($k, $v) = split /\t/;
    my $k1 = (split /\n/, $k, 3)[1];
    my $out = -e $k1 ? "$value\n" : "$key:$value\n";
    open my $ofh, '>>', $k1 or die $!;
    print $ofh $out;
    close $ofh or die $!;
}

foreach my $file (sort glob('./')) {
    open my $ifh, '<', $file or die $!;
    # ... and so on...
}
[download]

This approach uses the data storage device as a seen hash, and each file as an element. The file stores exactly what you were pushing into your %seen elements. Now you won't get bogged down by RAM. Of course on the other hand you will become IO bound, and our opening and closing of files inside of a loop is sadly inefficient. But at least you won't be spending all your time in swap (in other words, you were IO bound anyway, at least this way you're controlling what gets committed to storage and when to read it).

If this is too complex to work out (it does seem like a lot of effort, after all), you might just decide it's better to commit your data to an SQL-based database where you can set up indices, and perform queries, allowing the database implementation worry about how to manage your memory and storage resources. If it's just being run once, committing everything to a database is probably not efficient either. But if you anticipate running queries on the data set (even an ever-expanding data set) more than once, the database approach is likely optimal.

Dave

In reply to Re: comparing multiple files by davido
in thread comparing multiple files by umaykulsum

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.