Elninh05 has asked for the wisdom of the Perl Monks concerning the following question:

I have a really large tab delimited file (about 6 GB) having 4 columns (pos, start, end, id) and another file containing only ids in one column which is about 200 MB. I want to output a new file with matching ids and the corresponding positions. It also should output a second file wich should contain the line number of the first and second input file, the number of matching and non matching positions by ids. Thanks a lot in advance. Appreciate your help very much.
first file Format pos start end id chr1 11223 11224 rs2342349 chr2 23423 23424 rs6345435 chr3 64564 64565 rs3432456 chr4 56456 56457 rs7979979 second file Format id rs2342349 #only match rs3274234 rs2342344 Output1 chr1 11223 11224 rs2342349 Output2 Number_pos_1st_file 4 Number_pos_2nd_file 3 Nr_Matching 1 Nr_Non_matching 2

Replies are listed 'Best First'.
Re: Query large tab delimited file by a list
by choroba (Cardinal) on Jul 03, 2016 at 16:39 UTC
    The file1 is probably too large to remember the ids, but you haven't told us how large file2 is. I guess it's much smaller, so you could try hashing the ids from it and then reading the large file line by line and check whether the id was mentioned in file2:

    #!/usr/bin/perl use warnings; use strict; my %ids; open my $IDS, '<', 'file2' or die $!; while (<$IDS>) { chomp; $ids{$_} = 1; } open my $OUT1, '>', 'output1' or die $!; open my $LARGE, '<', 'file1' or die $!; my $count_matches = 0; while (<$LARGE>) { my ($pos, $start, $end, $id) = split; ++$count_matches, print {$OUT1} $_ if $ids{$id}; } close $OUT1 or die $!; open my $OUT2, '>', 'output2' or die $!; print {$OUT2} "Number_pos_1st_file\t", $LARGE->input_line_number, "\n" +; print {$OUT2} "Number_pos_2nd_file\t", $IDS->input_line_number, "\n"; print {$OUT2} "Nr_Matching\t", $count_matches, "\n"; print {$OUT2} "Nr_Non_matching\t", $IDS->input_line_number - $count_matches, "\n"; close $OUT2 or die $!;

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Query large tab delimited file by a list
by GrandFather (Saint) on Jul 03, 2016 at 21:37 UTC

    How often does this sort of task need to be done? If the answer is more than once with similar data then you should seriously consider using a database to do the heavy lifting. To get your eye in take a look at Databases made easy.

    Premature optimization is the root of all job security
Re: Query large tab delimited file by a list
by BrowserUk (Patriarch) on Jul 03, 2016 at 17:17 UTC

    This will take care of the first output file quickly and easily:

    #! perl -slw use strict; use Inline::Files; my $tally = ''; m[rs(\d{7})] and vec( $tally, $1 - 1e6, 1 ) = 1 while <FILE2>; m[rs(\d{7})] and vec( $tally, $1 - 1e6, 1 ) and print while <FILE1>; __DATA__ __FILE1__ chr1 11223 11224 rs2342349 chr2 23423 23424 rs6345435 chr3 64564 64565 rs3432456 chr4 56456 56457 rs7979979 __FILE2__ rs2342349 rs3274234 rs2342344

    The second output file makes no sense. The position of what in the first file? The position of what in the second? Number of matching what? Number of non-matching what?

    Is that 4 lines (including all the boiler-plate text) for every line in file 1; or file 2? Or every line in file 2 that was matched? Or...


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.
Re: Query large tab delimited file by a list
by hippo (Archbishop) on Jul 04, 2016 at 09:06 UTC

    I see no reason to avoid the obvious tool: grep.

    $ grep -F -f file2 file1 > outputfile

    This should construct your first output file and likely do it faster than perl (benchmarking left as an exercise). If you can come up with an algorithm for your second output file then we can look at that in more detail.

Re: Query large tab delimited file by a list
by Marshall (Canon) on Jul 03, 2016 at 16:00 UTC
    HI Elninh05! It would be of enormous help if you used <code>...</code> tags for the column formats of these files. As written, your post is hard to understand. If you format it better, you will certainly get better answers.
      Hi Marshall, thanks for the tip. Im new to perl and the community. But hope can improve my perl skills more and more.
        I see that file 1 is humongous (6 GB). How big is file 2? I guess output 1 is "extract records from file 1 that match an id in file2?". I am unclear as to the algorithm for output 2.

        Have you tried any code yet? If so post it and your thoughts on algorithms.

        Update: The size of file 2 matters in terms of whether this can be kept in memory or not. If so, this output 1 is relatively easy. If not, then some pre-sorting or a DB approach would be necessary.

Re: Query large tab delimited file by a list
by pme (Monsignor) on Jul 03, 2016 at 16:33 UTC
    Hi Elninh05,

    How big is the another file containing only ids in one column?

      Hi pme, the second file would be approx. 200 MB.
Re: Query large tab delimited file by a list
by Elninh05 (Novice) on Jul 04, 2016 at 19:33 UTC
    Dear Monks, Im very grateful for your help! By the script of choroba I could solve the Problem in a efficient computational way. By only grepping both files I run out of memory and the command dies. I thank every one again for your help!