never_more has asked for the wisdom of the Perl Monks concerning the following question:

Morning everyone! I need some help on indexing two large text files, thank in advance!

What I am trying to do is: I have two large text files (file1 is 150M and file2 is 350M), I want to use the key in file1 to find the associated value in file2. The format of file1 and file2 look like (fields are delimited by "*"):

file1(150M):
key1*field2*field3
key2*field2*field3
...
keyn*field2*field3

file2 (350M):
key1*field2
key2*field2
...
keyn*field2

For each line of file1, use the key as the index, find the associated value in file2. If the key is found in file2, update filed2 in file1. My current solution is :

open my $if1, '<', $input_f1 or die "Can't open $input_f1: $!\n"; open my $if2, '<', $input_f2 or die "Can't open $input_f2: $!\n"; while(<$if1>) { # Read each line of file1 my $line = $_; chomp($line); my ($key1, $vf1, $vf2) = split(/\*/, $line); seek($if2, 0, 0); # Make sure file handle point to the beginning o +f the file while (<$if2>) { # Read each line of file2 my $line2 = $_; chomp($line2); my ($key2, $value) = split(/\*/, $line2); if ($key1 eq $key2) { $vf1 = $value; } else { $vf1 = ' '; } } }

Due to the size of the two files, I can not save either file1 or file2's content into hash, I have to process each file line by line. And this is taking too much time to run.

Any suggestions?

Replies are listed 'Best First'.
Re: Indexing two large text files
by MidLifeXis (Monsignor) on Apr 09, 2012 at 13:21 UTC

    A couple of questions on the structures of your files:

    • Are the source files sorted by the key value? If so, you may be able to use some sort of binary search algorithm.
    • Is the data static? Perhaps some other storage format for the file (database, DBM::Deep, etc) might be a 'better' way of storing the data. Additionally, if a text format is the 'correct' way of storing the data, perhaps pre-sorting it would provide the ability to use a more efficient search algorithm.
    • Are the records fixed size? If so, it makes the binary search (above) easier to implement, otherwise you will need to possibly index your text files. You would do that by reading through one of the files and recording the index (tell) of where each key is found in the file, and use that to read only the single line needed back into memory (seek). If the number of lines in the data files is significantly large, it is possible that even your indexes will exhaust your available memory.
    • If both files are sorted by the keys, you can just step through them in tandem, skipping records that are missing from one or the other until you reach the end of the data. This would enable you to only have one line from each file in memory at a time, and make only a single pass through the data files.
    • If you are on unix, this could be accomplished with sort and join without the memory constraints, given sufficient disk space.

    The way you currently have this implemented is approximately O(N**2) (assuming that the number of lines in each file are approximately equal). For data files on disk, this is not a good situation. Sorted data files can reduce this to O(N), which is about as good as you are going to get.

    Update: Reread the original code, saw what it was actually doing rather than the apparent intent of what it should do:

    open my $if1, '<', $input_f1 or die "Can't open $input_f1: $!\n"; open my $if2, '<', $input_f2 or die "Can't open $input_f2: $!\n"; while(<$if1>) { # Read each line of file1 my $line = $_; chomp($line); my ($key1, $vf1, $vf2) = split(/\*/, $line); seek($if2, 0, 0); # Make sure file handle point to the beginning o +f the file while (<$if2>) { # Read each line of file2 my $line2 = $_; chomp($line2); my ($key2, $value) = split(/\*/, $line2); if ($key1 eq $key2) { $vf1 = $value; ############ <strike> # } else { # $vf1 = ' '; ############ </strike> } } ############ <add> print join( '*', $key1, $vf1, $vf2 ), "\n"; ############ </add> }

    The inner loop does not quite do what you state you want to do. You will only get an updated value for the last key in file1, and only then if the last key in file2 is also the same. Otherwise, you are clearing each and every value for $vf1. Strike out the marked section, and I think your script's logic will be correct, although it may not work quickly on larger data sets.

    --MidLifeXis

      Thanks MidLifeXis for your quick response. Here are the answers to your questions:

      • file2 is sorted by the key value, but I think I can sort file1 before processing it.
      • Depends on the date, the content of file1 and file2 might change. And for module DBM::Deep, because it is an old Unix environment, DBM::Deep is not available, and I don't have the permission to install it. :(
      • Yes, these records are fixed size, I am gonna try binary search.
      • This sounds a good idea, trying...
      • sort and join would be my last choice :)
      Thanks again
Re: Indexing two large text files
by BrowserUk (Patriarch) on Apr 09, 2012 at 16:56 UTC

    You can do this in multiple passes by reading a batch of the records from file2 into a hash and then processing file1 against them. Any records from file 1 matched get written to the new file, modified. Any that don't match get written to a spill file.

    When you've processed all the records from file1 against the first batch, you then close it; read the next batch from file2; and process the spill file against the new batch. Rinse and repeat until there are no records left in file2. At that point, any records left in the spill file have no matches and can be output to the new file unmodified.

    This sounds complex, but actually codes fairly simply. One thing to be aware of is that file1 will be overwritten in the processing. If you need to retain it, copy it first.

    UPDATE{ 18:06 GMT }: DO NOT USE! Has a bug! .................

    UPDATE{ 18:26 GMT }: BUG FIXED!

    Here's the code:

    #! perl -slw use strict; our $N //= 1e6; my $file1 = 'file1'; my $spill = 'temp'; open FILE2, '<', 'file2' or die $!; my $done = 0; until( $done ) { ## Index a batch of records from file 2 my %idx; for( 1 .. $N ) { chomp( my @bits = split /\*/, <FILE2> ); $idx{ $bits[ 0 ] } = $bits[ 1 ]; $done = 1, last if eof FILE2; } open FILE1, '<', $file1 or die $!; open SPILL, '>', $spill or die $!; ## process records from file (or spill) file while( <FILE1> ) { chomp( my @bits = split /\*/ ); ## If it exists in this batch, write the modified record to ST +DOUT if( exists $idx{ $bits[ 0 ] } ) { print join '*', $bits[ 0 ], $idx{ $bits[ 0 ] }, $bits[ 2 ] +; } else { ## otherwise put it in the spill file for next pass local $\; print SPILL; } } close SPILL; close FILE1; ## switch the input(file1) and spill file names ( $file1, $spill ) = ( $spill, $file1 ) unless $done; } ## Anything left in the spill file goes through unmodified local $\; open SPILL, '<', $spill; print while <SPILL>; close SPILL; ## Remove the tempories unlink $spill; unlink $file1;

    To use it: theScript -N=1e6 > theNewFile. You can adjust -N=1e6 to increase the number of file2 records processed in each pass, thus reducing the number of passes. A 32-bit Perl will probably be able to handle at least 5e6 per pass.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Thanks BrowserUK, your solution is very useful. Processing records by batches speed up the execution tremendously. My old program runs 30+ minutes but with your method, less than 2 minutes! Thanks again for the help!
Re: Indexing two large text files
by Anonymous Monk on Apr 09, 2012 at 13:23 UTC
      Thanks I will give a try! :)
Re: Indexing two large text files
by bimleshsharma (Beadle) on Apr 09, 2012 at 13:29 UTC

    ok, you can put those lines into temp file and after indexing over you can replace file1 with temp file. For example:

    While<file1> { $f1=$_; While<file1> { $f2= $_; if($f1 eq $f2) { print TF "$f2\n"; } } print TF "$f1\n"; } close TF; open F_1,"file1.txt"; while(<F_1>) { print F_1 "$_\n"; } close F_1;
      Thanks bimleshsharma, but even writing to a file, the complexity is still O(N**2)? and when I reach the end of file2, I think I still need to move file2's handler to the beginning of file2?
Re: Indexing two large text files
by locked_user sundialsvc4 (Abbot) on Apr 09, 2012 at 18:16 UTC

    Also note that you can prepare the hash so that it contains, as the “value” that is associated with the key, the position of the start of the record in the file and maybe also the length of that record.   Thus, even if the file itself is “large” (and by modern standards, 350M really isn’t), you are only keeping track of the keys and the location of the rest of it ... which you can retrieve as needed.

    If you intend to do this frequently, consider also using, say, an SQLite database (file...) for all or part of the task, which gives you (for example) the easy option of indexing the data in several different ways at the same time, all of them ultimately using the “key + starting location + size of record” strategy.   A single sequential pass through the file is enough to (re-)load these indexes, and now you can use SQL JOINs to start asking questions without doing any further “programming, per se.“   The file being indexed, and the records therein, could be huge, and the index-databases would remain small and of course, persistent.

    Naturally, it depends entirely on what you intend to do ... on what part of the technical trade-off you intend to emphasize, vs. what price you can afford to pay to get it.

      you can prepare the hash so that it contains, as the “value” that is associated with the key, the position of the start of the record in the file and maybe also the length of that record.

      The problem with that is, very little of the size of hash in memory is occupied by its values. And the internal overhead associated with each value is such that until the value gets to be over something like 16 bytes, there is no savings in storing a shorter value.

      C:\test>p1 $h{ $_ } = 'xxxxxxxx' for 1 .. 1e6;; print total_size( \%h );; 112277576 C:\test>p1 $h{ $_ } = 1 for 1 .. 1e6;; print total_size( \%h );; 72277576

      So unless the records are particularly long, there is little savings to be made by storing the file offset over storing the record itself.

      I'm using a 64-bit perl, which means that integers occupy 8 bytes, which flatters my point somewhat, but even on a 32-bit perl, the savings are far from what you might at first expect. For example, the same 1 million key hash with "no values" at all still occupies a substantial the same as one with values:

      C:\test>p1 undef $h{ $_ } for 1 .. 1e6;; print total_size( \%h );; 72277576

      In summary: You can gain some space by storing record positions instead of the records themselves, but it doesn't buy you as much as you at first might think it would.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Granted... particularly in this case, I agree completely.   A 350MB total-size file can simply fit in memory and be done with it.   (I know that you have recently dealt with files that are several orders of magnitude larger.)

        The notion of using an SQLite file, literally as a persistent index covering as many keys as may be necessary, is actually the one that I tend to come back to, over and over again, when dealing with files like these.   I need to know where to find, via direct access, whatever it is I am looking for.   One pass through the file locates everything.   The “interesting stuff” now gets done with JOINs, often in a command-line ad hoc fashion.   Not in the “gigantic” case you recently spoke of, but maybe a very useful idea in this one.

        Not kosher to SQL tricks?   And, (a mere...) 350 megs?   “If you’ve got the RAM, then by all means use it and be done.”   Perl won’t blink an eye.