sm2004 has asked for the wisdom of the Perl Monks concerning the following question:

On one file (data.txt) I have a list like the following:
Contig25381
Contig25396
Contig25469
On another file (all.txt) I have a list of sequences like:
>R.fna.Contig25353
AAGCAGTGGTATCAACGCAGTAGTTGGTTCCATTACGGCCGGGCTCTGTT TCAGAATTTTAGATCCACCATCCGAGTTATTGAAACGCCAAGACCATAGA
>R.fna.Contig25381
GGACGAGATTTAACGACATCCATAAGCAACTCTGCTAATCATTCGATCTG CTTGGAGGTGTTTTTCCCCCATTTCCCTTAACCATGTCTCAGACTGTGGT
>R.fna.Contig25396
GGGATCTTTGGACGAAGGGGGGAAAAAGATGTCAACTTTAAGCATTCCAC CAATGCTTACTTCCCCTAGAGATGATGCCATTCAACTGTACAAGGCTTTC AAGGGATTTGGATGTGACACTTCTGCAGTAATCCATATCTTAGCTCGTCG
>R.fna.Contig25356
GGGAAGCAACCTGCCCTTCTCAGGCTTGCTCTAGATGATGTGCTTGAGGT GCCTTGATTAGTAGAGGTAGAAGAAGCAGAACAAAGGATTCACCTCGTTT
>R.fna.Contig25469
GGCTCTCTACCTATCTGTCTCTCTCTACCTCTCTCTCCTTTCACGCACAC
>R.fna.Contig25358
GGGAAGACGACGTCGTCACAACAAACCTCTCTTGAGGTTGGCAGCTTCCG
I would like to extract the sequences from the second file that would have the number listed in the first file. I tried to write all the lines in the first file into $line and then write sequences in the second file with elements seperated by ">". Then I tried to use grep and $line to get the matching elements from the second file. This approach only gives me the last match.
How can I get the full matched list? Each match on the list should have the line starting with '>' and then the full sequence until the line starting with the next '>'. Any tips would be appreciated. Thanks.
  • Comment on compare a list from one file with another text file

Replies are listed 'Best First'.
Re: compare a list from one file with another text file
by Arunbear (Prior) on Apr 05, 2008 at 23:30 UTC
    A slightly different approach than that suggested by igelkott: first read in your first file and store the keys in a hash (of arrays). Then read the second file and store the sequence in the hash if the hash has a matching key.
    use strict; use warnings; use Data::Dumper; use Fatal qw[open]; my %key; open my $data, '<', 'data.txt'; while(<$data>) { chomp; $key{$_} = []; } my %line; open my $all, '<', 'all.txt'; while(<$all>) { chomp; $line{key} = $_; $line{seq} = <$all>; # read the sequence chomp $line{seq}; my ($key) = ($line{key} =~ /(Contig\d*)/); if(exists $key{$key}) { push @{$key{$key}}, $line{seq}; } } print Dumper(\%key) ."\n";
    with data.txt and all.txt as given this produces:
    $VAR1 = { 'Contig25396' => [ 'GGGATCTTTGGACGAAGGGGGGAAAAAGATGTCAACTTTA +AGCATTCCAC CAATGCTTACTTCCCCTAGAGATGATGCCATTCAACTGTACAAGGCTTTC AAGGGAT +TTGGATGTGACACTTCTGCAGTAATCCATATCTTAGCTCGTCG' ], 'Contig25381' => [ 'GGACGAGATTTAACGACATCCATAAGCAACTCTGCTAATC +ATTCGATCTG CTTGGAGGTGTTTTTCCCCCATTTCCCTTAACCATGTCTCAGACTGTGGT' ], 'Contig25469' => [ 'GGCTCTCTACCTATCTGTCTCTCTCTACCTCTCTCTCCTT +TCACGCACAC' ] };
    Update: amended version per Re^2: compare a list from one file with another text file

      Thanks a lot for taking the time to write the code. I'm new to perl and do not quite understand statements like 'use Data::Dumper', but thanks so much for giving a solution.
Re: compare a list from one file with another text file
by igelkott (Priest) on Apr 05, 2008 at 22:50 UTC

    It'd be nice to see some attempts you've made or to at least get some parameters like the size of the lists. In the absence of details, here's a few hints:

    1. Split the second file by the ">" at the beginning on a line. The ">" can easily be replaced on output.
    2. Turn this array into a "hash or arrays" (HoA). The key will be the sequence tag after the "R.fna." and the value will either be the raw sequence or the whole record (sans ">") if the "R.fna." isn't consistent. You can check the docs on HoA but the general idea is to use push when loading the hash values.
    3. Run through your first file and pull out all matches to the hash keys.
    4. Pick what sort of output format and remember to add the ">" to the start of each line.
      The sequence file could be as large as 40Mb and the list could contain as many as 35000 sequences. I'm new to perl and have been struggling with this for a couple of days. Here's the part I wrote, but it doesn't give me the full list I want:(
      my $nohit_list = "data.txt"; open (NOHIT, "<$nohit_list")or die "can't open file: $!"; foreach $line (<NOHIT>) { print $line; my $all = "all.txt"; open (ALL, "<$all") or die "can't open file: $!"; { local $/ = '>'; @fasta = <ALL>; } my $requery = "requery.txt"; open (FASTA, ">$requery")or die "can't open file: $!"; my @nohit_fasta = (); @nohit_fasta = grep /$line/,@fasta; print FASTA @nohit_fasta;
        Arunbear already provided an efficient solution, but if you are interested, here is the reason why your version didn't work:

        1) First of all there is a '}' missing at the end, but I guess this is the fault of the copy and paste

        2) For every line in data.txt you reread all.txt, grep for the result and then write the data into requery.txt (highly inefficient but a working idea). Sadly every time you open requery.txt again you delete the previous version of requery.txt, so that eventually only the last result is in there. Your program would work by simply changing the '>' in
        open (FASTA, ">requery.txt" ...
        to '>>'

        Even better would be to open requery.txt before the foreach loop (same place where you open data.txt). That way it gets opened only once

      Hash of Arrays shouldn't be necessary, one whole multiline sequence can be stored in the value of the hash. Should be easier to program

      igelkott already mentioned that the size of the files is important. His algorithm works best if the first file is really big.

      If the second file is really big, his algorithm can be turned around so that the first file is put into a hash and then the second file is read line by line to check for a match.

Re: compare a list from one file with another text file
by cosmicperl (Chaplain) on Apr 05, 2008 at 22:33 UTC
    Look into DIFFs
    A diff program will show you the difference between two text files. Text::Diff on CPAN seems to cover it.