in reply to compare a list from one file with another text file

It'd be nice to see some attempts you've made or to at least get some parameters like the size of the lists. In the absence of details, here's a few hints:

  1. Split the second file by the ">" at the beginning on a line. The ">" can easily be replaced on output.
  2. Turn this array into a "hash or arrays" (HoA). The key will be the sequence tag after the "R.fna." and the value will either be the raw sequence or the whole record (sans ">") if the "R.fna." isn't consistent. You can check the docs on HoA but the general idea is to use push when loading the hash values.
  3. Run through your first file and pull out all matches to the hash keys.
  4. Pick what sort of output format and remember to add the ">" to the start of each line.

Replies are listed 'Best First'.
Re^2: compare a list from one file with another text file
by sm2004 (Acolyte) on Apr 05, 2008 at 23:39 UTC
    The sequence file could be as large as 40Mb and the list could contain as many as 35000 sequences. I'm new to perl and have been struggling with this for a couple of days. Here's the part I wrote, but it doesn't give me the full list I want:(
    my $nohit_list = "data.txt"; open (NOHIT, "<$nohit_list")or die "can't open file: $!"; foreach $line (<NOHIT>) { print $line; my $all = "all.txt"; open (ALL, "<$all") or die "can't open file: $!"; { local $/ = '>'; @fasta = <ALL>; } my $requery = "requery.txt"; open (FASTA, ">$requery")or die "can't open file: $!"; my @nohit_fasta = (); @nohit_fasta = grep /$line/,@fasta; print FASTA @nohit_fasta;
      Arunbear already provided an efficient solution, but if you are interested, here is the reason why your version didn't work:

      1) First of all there is a '}' missing at the end, but I guess this is the fault of the copy and paste

      2) For every line in data.txt you reread all.txt, grep for the result and then write the data into requery.txt (highly inefficient but a working idea). Sadly every time you open requery.txt again you delete the previous version of requery.txt, so that eventually only the last result is in there. Your program would work by simply changing the '>' in
      open (FASTA, ">requery.txt" ...
      to '>>'

      Even better would be to open requery.txt before the foreach loop (same place where you open data.txt). That way it gets opened only once

        Thanks a lot. That was very helpful. It works now. I really appreciate all the feedback.
Re^2: compare a list from one file with another text file
by jethro (Monsignor) on Apr 05, 2008 at 23:37 UTC
    Hash of Arrays shouldn't be necessary, one whole multiline sequence can be stored in the value of the hash. Should be easier to program

    igelkott already mentioned that the size of the files is important. His algorithm works best if the first file is really big.

    If the second file is really big, his algorithm can be turned around so that the first file is put into a hash and then the second file is read line by line to check for a match.