in reply to General program and related problems

You can certainly iterate over the other file, but that might be quite slow.

I haven't quite understood how your input format looks like, but it might be possible to throw you data into a database and let it do a JOIN operation.

Or if you have enough memory you could read the second file into a hash and then access it. But without knowing more about the input and desired output it's hard to give a good advice.

  • Comment on Re: General program and related problems

Replies are listed 'Best First'.
Re^2: General program and related problems
by Anonymous Monk on Aug 03, 2009 at 13:24 UTC
    Thanks for the reply basically the problem is that I do not need most of the fields contained in file 1 and most of the fields in file 2 Few lines of file1:
    169: rs60465173 has merged into rs8057341 Homo sapiensCAGCTGACTGAGGCAGCGGGAGTTGAA/GAAGAAACGATATTAGTTCATGGTGA ABI, AFFY, ILLUMINA-UK, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA170: rs17312781 has merged into rs8057341 Homo sapiensCAGCTGACTGAGGCAGCGGGAGTTGAA/GAAGAAACGATATTAGTTCATGGTGA ABI, AFFY, ILLUMINA-UK, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA171: rs8057341 Homo sapiensCAGCTGACTGAGGCAGCGGGAGTTGAA/GAAGAAACGATATTAGTTCATGGTGA ABI, AFFY, ILLUMINA-UK, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA, ILLUMINA172: rs60162986 has merged into rs8046608 Homo sapiensCCCTACTTACTTGTGGCCTGTCCCCTC/TGTGAATGTGTCTCATGTCCCCAGTG AFFY173: rs8046608 Homo sapiensCCCTACTTACTTGTGGCCTGTCCCCTC/TGTGAATGTGTCTCATGTCCCCAGTG
    From there I need the rs value. And for this I made the code I wrote about. Now I have rs values in an array and I need to grab only the lines which contain the rsnumbers from a second huge txt file (1 GB) The second file looks like


    First row XXX XXX XXX XXX XXX XXX XXX XXX (1050 cells)

    rsnumber AA AG AG AG AA AG AG AG (1050 times)

    rsnumber TT AT AA AT AT .....

    500 times more
    I need to get from this file the rsnumbers stored in the array from file 1 toghether with the 1050 values on the string
      Maybe I was not enough clear Form file1 (you can see few lines) i need only the rs with 5-8 numbers field. The same field is the first column of file 2 Here we are
      Thanks every body for the help Basically my output file 1 at the moment is a file with 1 column with rs values like


      rs3547689

      rs325678912

      rs36789012

      etc

      I need now to find these value in file 2 and print out the line or in a separated file
      The file 2 looks like
      XXX XXX XXX XXX XXX XXX (1050 times)

      rs3507865 AA AT AT AT TT AA (1050 values)

      rs3456189 GG GC GG CC CC .....

      more than 700 rows
      Can you gimme a suggestion for keys for the hash? Can be row number even if I can not write on file 2? Cheers again

        Only 700 (even if long) rows? Then you don't need any disc-based hash. Just create a hash with the rs value (i.e. for example the 'rs3507865') of a row in file2 as key and the position in the file as data. The position in the file you can find with tell() (before reading the line).

        Then just read the numbers in file1 and look up their position in the hash and use seek() in file2 to go there

        open(FILE2,... my $position=tell(FILE2); my %rs; while ($line=<FILE2>) { my ($key)= $line=~/^(rs\d{5,})\b/; if (defined $key) { $rs{$key}= $position; } $position=tell(FILE2); } ... while (defined ($line= <FILE>)) { ... foreach (@output) { if (exists $rs{$_}) { seek(FILE2,$rs{$_},0); my $line= <FILE2>; print FD $line; } }

        UPDATE: Added a '^' to the regex in the script