Re: Comparing Files

As originally posted (with no perl code of your own, and no sample data), your question isn't quite clear and lacks important details. I would assume that you don't intend to append all the contents of the 2nd file to the 1st one, but only those records of the 2nd file that don't occur in the 1st.

In general, comparing two files and producing a union of the two as output it pretty simple. If there are special conditions involved in comparing FASTA data, then you'd have to know (or make it clear to us) what those are.

The basic approach is to store the contents of the two files into a single hash, then print out the contents of the hash.

Based on googling "FASTA", I gather this is a simple text file format with one or more blocks of data structured as follows:

>SEQUENCE-ID-STRING commentary string
SEQUENCE_LETTERS_AT_MOST_80_PER_LINE...
[download]

So if a line of text starts with ">", it contains an ID string plus optional commentary, and defines the start of a new sequence string. Each line that starts with some letter is a successive chunk of a sequence string.

It's not clear whether the ID-strings are of any use in comparing different files. Are they? If you see the same ID-string in two different files, can you be confident that the sequence strings that follow it in each file are identical? If so, you can use the ID-strings as hash keys and sequence strings as hash values when you read the two files; if not, you'll need to use the sequence strings as hash keys, and can save the ID-strings as hash values.

See whether the following pseudo-code matches what you intend to do -- if so, you should be able to write working perl code to implement it:

# require that there be three file name args on the command line:
#  -- input file1
#  -- input file2
#  -- output file

# for each input file:
#   open it
#   read one line at a time
#     chomp the line to remove final line-feed
#     if line starts with ">", it's the start of a new sequence, so:
#        if there was a sequence before this one, save it in  %seq_has
+h
#        set current sequence to an empty string

#     otherwise, line starts with a letter, so:
#        append contents of line to current sequence

# After reaching the end of the second file, save current sequence in 
+%seq_hash

# open the output file
# Loop over the keys of %seq_hash
#    write the ID-string to the output file
#    if necessary, add line-feeds into the sequence string at 80-char 
+intervals 
#    write the sequence string to the output file
[download]

Maybe you'll need to adjust the spec to cover issues that were not explained in your initial question. In any case, try writing some code, and post that when you have trouble with it.

Comment on Re: Comparing Files Select or Download Code

Replies are listed 'Best First'.
Re^2: Comparing Files by stajich (Chaplain) on Jan 24, 2005 at 19:24 UTC
Sadly it is more fun to use the same word for multiple things in biology... FASTA can be the sequence format (also called Pearson format). FASTA is also a program for searching sequences by aligning them. As to what the poster is really after, I think it is the output of the latter in order to generate a list of sequences which are significantly similar the input query set. This is something you might do if you are trying to build the a gene family which is made up of similar sequences. But at this point we are just playing guessing games so it will have to be clarified by the poster as to what they want. Honestly this is not the best forum to ask these questions - consider posting to the Bioperl list if you have bioinformatics+perl questions or else spend a little more time explaining the algorithm you are trying to write. As I have already posted Re: Fasta Using Perl, it is possible to parse the output from FASTA with Bio::SearchIO and to parse the sequence files with Bio::SeqIO, databases of sequences with local Indexes Bio::DB::Fasta, Bio::Index::Fasta and friends.	[reply]

Replies are listed 'Best First'.

Re^2: Comparing Files
by stajich (Chaplain) on Jan 24, 2005 at 19:24 UTC

FASTA can be the sequence format (also called Pearson format). FASTA is also a program for searching sequences by aligning them.

As to what the poster is really after, I think it is the output of the latter in order to generate a list of sequences which are significantly similar the input query set. This is something you might do if you are trying to build the a gene family which is made up of similar sequences.

But at this point we are just playing guessing games so it will have to be clarified by the poster as to what they want. Honestly this is not the best forum to ask these questions - consider posting to the Bioperl list if you have bioinformatics+perl questions or else spend a little more time explaining the algorithm you are trying to write.

As I have already posted Re: Fasta Using Perl, it is possible to parse the output from FASTA with Bio::SearchIO and to parse the sequence files with Bio::SeqIO, databases of sequences with local Indexes Bio::DB::Fasta, Bio::Index::Fasta and friends.

[reply]