in reply to Read two files and print

Trying to process the two files in parallel is a bad idea. Assuming that text2 is not huge, read it first and create a hash entry for each file entry.

Then read text1 and use the hash to check to see if there is a matching entry in text2:

use strict; use warnings; my $text1 = <<'END_TEXT'; 2343/45/45/cal/ca-1.xml 2343/45/45/ca-1 6534/534/34/car/ca-5.xml 6534/534/34/ca-5 END_TEXT my $text2 = <<'END_TEXT'; 6534/534/34/ca-5 5676/435/734/da-1 END_TEXT my %text2Entries; # Build a hash table of text2 entries open my $inFile, '<', \$text2 or die "Unable to read text2: $!\n"; while (<$inFile>) { chomp; ++$text2Entries{lc $_}; } close $inFile; # Read the text1 entries and print any that don't have a text2 entry open $inFile, '<', \$text1 or die "Unable to read text1: $!\n"; while (my $line = <$inFile>) { my ($part1, $part2) = split /\s+/, $line; chomp $part2; next if ! exists $text2Entries{lc $part2}; print $line; } close $inFile;

Prints:

6534/534/34/car/ca-5.xml 6534/534/34/ca-5

Don't get hung up on the two $text strings - that's just so I can provide a runnable test script without requiring external files.


True laziness is hard work

Replies are listed 'Best First'.
Re^2: Read two files and print
by Marshall (Canon) on Feb 26, 2009 at 15:50 UTC
    This is a great idea. I would just add a few comments that might clarify a few things for previous posts.

    1. split /\s+/, $line; splits on any whitespace character, this includes space,\f,\r,\n,\t. Since \n is in this set, you don't need to chomp($part2); doesn't hurt but it is not necessary here. The reason in previous post that "\t" didn't work is that you need a regex for the first arg to split./\t/ would have worked but /\s+/ is usually better. The \t idea would result in a \n in $part2 and of course since you can't see these non-printing characters it is possible that there is are some plain spaces in there!

    2.The best way to get the 2nd thing from the split is with list slice. my $part2 = (split /\s+/, $line)[1]; Since you don't use $part1, there is no need to assign it. It often occurs that you are working with a line with a bunch of things on it and you just want a couple of them. Using list slice allows you to assign meaningful names to these things like maybe: my($temperature,$city)=(split /\s+/,$line)[3,8];. This is a lot better than say, $line[3] because you don't need any comments to explain that thing 3 means temperature.

    Of course here the op probably has some other name in mind for $part2 that would make the code even more clear.

      The files are very huge. I tried something like
      open FH, '<file1.txt'; @data = <FH>; open FH1, '<file2.txt'; @data1=<FH1>; my $text1 = <<END_TEXT; @data END_TEXT my $text2 = <<END_TEXT1; @data1; END_TEXT1
      @data inside <<END_TEXT prints only one row. How can I print entire array inside <<END_TEXT

        Define 'huge'. For any value of huge over a few hundred megabytes you really don't want to slurp the files into memory! In fact at that size you are getting into file sizes where you should be using a database. Perhaps you better give us a little more information about the size and true nature of the files you are dealing with and the task you actually need to perform.


        True laziness is hard work
        1. Since you are replying to my comments, I will comment: To print @data, just use: print @data;
        When you "slurped" file1 into memory that would have included the "\n"'s. @data = <FH>; will read all lines from <FH> and put them into the @data list.
        The <<END_TEXT sort of idea will have no place in your code. That was just a way that grandfather embedded a short test file into the code.

        2. Having said that about printing @data, this is NOT what you want to do! grandfather's code reads the text2 file one line at a time and creates a hash table. It does NOT save a verbatim copy of either the text2 or text1 input files into an array!

        3. Create 2 small files, say 100 lines each and get grandfather's code running on your machine. The code will run in a few seconds. Then turn it loose on the full size files that you have. The FIRST STEP before optimizing is to get running code!

        From looking at the code, I doubt that you will see much difference between 100 lines and 10,000 lines in file2. I suspect that this thing will run in much less than 10 seconds. If the program runs within what is acceptable time frame to you, there is probably no need to optimize it.

        4. HUGE is relative! This program algorithm will not slow down appreciably until the size of the hash of file 2 (the smallest file) exceeds what you can have memory resident. I just opened one of my apps that creates a hash table of about 120K entries and sorts/displays in a Tk GUI, takes less than 0.5 seconds and the processing that is being done is FAR more than in your application.

        5. So get working code with small set of data and then report back about problems and size issues when you scale it.