The file1 is probably too large to remember the ids, but you haven't told us how large file2 is. I guess it's much smaller, so you could try hashing the ids from it and then reading the large file line by line and check whether the id was mentioned in file2:
#!/usr/bin/perl
use warnings;
use strict;
my %ids;
open my $IDS, '<', 'file2' or die $!;
while (<$IDS>) {
chomp;
$ids{$_} = 1;
}
open my $OUT1, '>', 'output1' or die $!;
open my $LARGE, '<', 'file1' or die $!;
my $count_matches = 0;
while (<$LARGE>) {
my ($pos, $start, $end, $id) = split;
++$count_matches, print {$OUT1} $_ if $ids{$id};
}
close $OUT1 or die $!;
open my $OUT2, '>', 'output2' or die $!;
print {$OUT2} "Number_pos_1st_file\t", $LARGE->input_line_number, "\n"
+;
print {$OUT2} "Number_pos_2nd_file\t", $IDS->input_line_number, "\n";
print {$OUT2} "Nr_Matching\t", $count_matches, "\n";
print {$OUT2} "Nr_Non_matching\t",
$IDS->input_line_number - $count_matches,
"\n";
close $OUT2 or die $!;
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] [select] |
How often does this sort of task need to be done? If the answer is more than once with similar data then you should seriously consider using a database to do the heavy lifting. To get your eye in take a look at Databases made easy.
Premature optimization is the root of all job security
| [reply] |
#! perl -slw
use strict;
use Inline::Files;
my $tally = '';
m[rs(\d{7})] and vec( $tally, $1 - 1e6, 1 ) = 1 while <FILE2>;
m[rs(\d{7})] and vec( $tally, $1 - 1e6, 1 ) and print while <FILE1>;
__DATA__
__FILE1__
chr1 11223 11224 rs2342349
chr2 23423 23424 rs6345435
chr3 64564 64565 rs3432456
chr4 56456 56457 rs7979979
__FILE2__
rs2342349
rs3274234
rs2342344
The second output file makes no sense. The position of what in the first file? The position of what in the second? Number of matching what? Number of non-matching what?
Is that 4 lines (including all the boiler-plate text) for every line in file 1; or file 2? Or every line in file 2 that was matched? Or...
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.
| [reply] [d/l] |
$ grep -F -f file2 file1 > outputfile
This should construct your first output file and likely do it faster than perl (benchmarking left as an exercise). If you can come up with an algorithm for your second output file then we can look at that in more detail. | [reply] [d/l] |
HI Elninh05! It would be of enormous help if you used <code>...</code> tags for the column formats of these files. As written, your post is hard to understand. If you format it better, you will certainly get better answers. | [reply] [d/l] |
Hi Marshall,
thanks for the tip. Im new to perl and the community. But hope can improve my perl skills more and more.
| [reply] |
I see that file 1 is humongous (6 GB). How big is file 2? I guess output 1 is "extract records from file 1 that match an id in file2?". I am unclear as to the algorithm for output 2.
Have you tried any code yet? If so post it and your thoughts on algorithms.
Update: The size of file 2 matters in terms of whether this can be kept in memory or not. If so, this output 1 is relatively easy. If not, then some pre-sorting or a DB approach would be necessary.
| [reply] |
| [reply] |
Hi pme,
the second file would be approx. 200 MB.
| [reply] |
Dear Monks,
Im very grateful for your help! By the script of choroba I could solve the Problem in a efficient computational way. By only grepping both files I run out of memory and the command dies. I thank every one again for your help! | [reply] |