FarTech has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Comparing Files
by graff (Chancellor) on Jan 24, 2005 at 04:24 UTC | |
In general, comparing two files and producing a union of the two as output it pretty simple. If there are special conditions involved in comparing FASTA data, then you'd have to know (or make it clear to us) what those are. The basic approach is to store the contents of the two files into a single hash, then print out the contents of the hash. Based on googling "FASTA", I gather this is a simple text file format with one or more blocks of data structured as follows: So if a line of text starts with ">", it contains an ID string plus optional commentary, and defines the start of a new sequence string. Each line that starts with some letter is a successive chunk of a sequence string. It's not clear whether the ID-strings are of any use in comparing different files. Are they? If you see the same ID-string in two different files, can you be confident that the sequence strings that follow it in each file are identical? If so, you can use the ID-strings as hash keys and sequence strings as hash values when you read the two files; if not, you'll need to use the sequence strings as hash keys, and can save the ID-strings as hash values. See whether the following pseudo-code matches what you intend to do -- if so, you should be able to write working perl code to implement it: Maybe you'll need to adjust the spec to cover issues that were not explained in your initial question. In any case, try writing some code, and post that when you have trouble with it. | [reply] [d/l] [select] |
by stajich (Chaplain) on Jan 24, 2005 at 19:24 UTC | |
FASTA can be the sequence format (also called Pearson format). FASTA is also a program for searching sequences by aligning them. As to what the poster is really after, I think it is the output of the latter in order to generate a list of sequences which are significantly similar the input query set. This is something you might do if you are trying to build the a gene family which is made up of similar sequences. But at this point we are just playing guessing games so it will have to be clarified by the poster as to what they want. Honestly this is not the best forum to ask these questions - consider posting to the Bioperl list if you have bioinformatics+perl questions or else spend a little more time explaining the algorithm you are trying to write. As I have already posted Re: Fasta Using Perl, it is possible to parse the output from FASTA with Bio::SearchIO and to parse the sequence files with Bio::SeqIO, databases of sequences with local Indexes Bio::DB::Fasta, Bio::Index::Fasta and friends. | [reply] |
|
Re: Comparing Files
by stajich (Chaplain) on Jan 24, 2005 at 00:48 UTC | |
you are going to have to try a little harder before we help you with your homework. show what you have already tried, pseudocode, anything. | [reply] |