to continue on: (untested)
open( my $FH1 , '<', $file1 ) or die "Couldn't open file \"$file1\": $
+!";
open( my $FH2 , '<', $file2 ) or die "Couldn't open file \"$file2\": $
+!";
my %ids;
while <my $id = <$FH2>)
{
chomp $id; #remove line ending
$ids{$id} = 1;
}
my $line = <$FH1>; #throw away first header line
while ($line = <$FH1>)
{
#get 'rs2342349' from: "chr1 11223 11224 rs2342349\n"
my ($id) = (split /\s+/,$line)[3]; #whitespace chars also includes
+ tabs
print $line if exists $ids{$id};
}
Update: You will notice that I removed the "\n" from the "die" statement. die will put an \n in by default. If you explicitly put in an \n that changes what the "die" prints! Whoa! Here is a short demo:
#open IN, '<', 'somename' or die "xxx $!\n";
#prints xxx No such file or directory
open IN, '<', 'somename' or die "xxx $!";
#prints xxx No such file or directory at C:\Projects_Perl\testing\junk
+.pl line 4.
Update 2: RE: the statistics
If "Number_pos_1st_file" is just the line count, then that is easy. If these
pos values are not unique, then I see problems because the file is so large that
it is likely that a hash to count them won't fit into memory. In that case,
I would do a system sort on the file and then read through it to find the
unique pos values.
"Number_pos_2nd_file" is just keys %ids? Or perhaps it is the line count?
|