Re: Dealing with large files in Perl

I think I understand this part of your description:

if value exist in file1 and found in file2 than take both lines containing that value and merge into one single file

But there's nothing in your code to support this kind of operation. What is the value that you're looking for in the two input files?

It looks like both input files are lists of table-like data, with three fields per line ("id up down"). If the value you're looking for is unique to one "cell" in each table (that is, it would occur only once per input file, if at all), then you're really talking about doing a "grep" operation. In fact, if you're using a unix-like OS, just use the "grep" command-line utility; if you're using MS-Windows, there are versions of grep available for free.

But if you want to see how it's done in perl, here's one way:

#!/usr/bin/perl

use strict;

my $Usage = "Usage: $0 value file1 file2\n";
die $Usage unless ( @ARGV == 3 and -f $ARGV[1] and -f $ARGV[2] );

my $value = shift;  # removes first element from @ARGV

my @match;  # will hold matching line from each file

for my $file ( @ARGV ) {  # loop over remaining two ARG's
    open( IN, $file ) or die "$file: $!";
    while (<IN>) {
        if ( /$value/ ) {
            chomp;
            push @match, $_;
            last;
        }
    }
}

print join( " ", @match ), "\n";
[download]

Now, if you were to try using the unix "grep" command, it would be:

 grep value file1 file2
[download]

Note that both the perl script and the grep command shown above will output the matching lines to STDOUT (the grep command will not join the two into a single line -- it will also include the file name at the beginning of each line, to show where the line came from).

Also, your output might not be what you expect, if the value you're searching for contains characters that have special meanings in a regex (period, plus-sign, asterisk, question mark, brackets, braces, parens, "^", "$", "@" or "%", some others, depending on context). For such things, put "\Q" and "\E" around $value in the perl script.

If you want the matches to be saved in a separate file, just use redirection on the command line:

  perlscript value file1 file2 > matched.lines
# or
  grep value file1 file2 > matched.lines
[download]

Comment on Re: Dealing with large files in Perl Select or Download Code

Replies are listed 'Best First'.
Re^2: Dealing with large files in Perl by tester786 (Initiate) on May 16, 2005 at 05:38 UTC
You're absolutely right. your code is matches exactly what I'm looking for however, not getting the result I suspect. so here's the output after executing what you listed. <snip> 00e06f16b25 41000 306000 00112f9486bf 412 1696 </snip> what I'm looking for is searching for this value 00e06f16b25 and match with file2, than take both matching lines from file1 and file2 and merge it to file3. so the result should be: 00e06f16b25 41000 306000 00e06f16b25 389 5000	[reply]
Re^3: Dealing with large files in Perl by graff (Chancellor) on May 16, 2005 at 21:43 UTC
If you really ran the code exactly as I posted it, and your first command-line arg (assigned to $value) was really "00e06f16b25", then I just don't see how you could come up with the output that you cited inside your "snip" tags. Please double-check that you didn't alter the code, and that you ran it as intended. But now that you have provided more information about your data -- that the value you want to match is the first token on each data line, and this consists of a long hex number -- you can speed things up and make it more trustworthy by using "substr" and "eq" instead of a regex match: use strict; my $Usage = "Usage: $0 value file1 file2\n"; die $Usage unless ( @ARGV == 3 and -f $ARGV[1] and -f $ARGV[2] ); my $value = shift; # removes first element from @ARGV my $chklen = length( $value ); my @match; # will hold matching line from each file for my $file ( @ARGV ) { # loop over remaining two ARG's open( IN, $file ) or die "$file: $!"; while (<IN>) { if ( substr( $_, 0, $chklen ) eq $value ) { chomp; push @match, $_; last; } } close IN; # (this was implicit in the earlier version) } print join( " ", @match ), "\n"; [download] Note that in either version, if the value you provide on the command line turns out to be shorter than the initial hex number on each line of the input files, there's a chance that you'll get a "false alarm" match. For example, in the initial regex version, if the search value on the command line was just "6b" or "00", this could explain why the record from the second file was not right -- "6b" and "00" are found in both records.	[reply] [d/l]
finding highest and lowest number by tester786 (Initiate) on May 23, 2005 at 23:24 UTC
Good evening all, No need to response to this as again I got my script to work.. once completed I'll post the codes. Regards,	[reply]
Re^5: Dealing with large files in Perl by jZed (Prior) on May 23, 2005 at 23:33 UTC
Re^6: Dealing with large files in Perl by tester786 (Initiate) on May 26, 2005 at 06:12 UTC
Re^4: Dealing with large files in Perl by tester786 (Initiate) on May 17, 2005 at 20:02 UTC
This code didn't return anything. however, the last sample data output I have provided was the grep I did against the file and I just copied with your generated output. I agree with what you indicated using substr as oppose to regex. what I need to know if need to implement this within the script as another sub how can i do that. please get back to me.	[reply]