comment on

Morning everyone! I need some help on indexing two large text files, thank in advance!

What I am trying to do is: I have two large text files (file1 is 150M and file2 is 350M), I want to use the key in file1 to find the associated value in file2. The format of file1 and file2 look like (fields are delimited by "*"):

file1(150M):
key1*field2*field3
key2*field2*field3
...
keyn*field2*field3

file2 (350M):
key1*field2
key2*field2
...
keyn*field2

For each line of file1, use the key as the index, find the associated value in file2. If the key is found in file2, update filed2 in file1. My current solution is :

open my $if1, '<', $input_f1 or die "Can't open $input_f1: $!\n";
open my $if2, '<', $input_f2 or die "Can't open $input_f2: $!\n";
while(<$if1>) {  # Read each line of file1
    my $line = $_;
    chomp($line);
    my ($key1, $vf1, $vf2)  = split(/\*/, $line);
    seek($if2, 0, 0); # Make sure file handle point to the beginning o
+f the file 
    while (<$if2>) {  # Read each line of file2
        my $line2 = $_;
        chomp($line2);
        my ($key2, $value) = split(/\*/, $line2);
        if ($key1 eq $key2) {
            $vf1 = $value;
        } else {
            $vf1 = ' ';
        }
    }
}
[download]

Due to the size of the two files, I can not save either file1 or file2's content into hash, I have to process each file line by line. And this is taking too much time to run.

Any suggestions?

In reply to Indexing two large text files by never_more

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.