in reply to Comparing Files
In general, comparing two files and producing a union of the two as output it pretty simple. If there are special conditions involved in comparing FASTA data, then you'd have to know (or make it clear to us) what those are.
The basic approach is to store the contents of the two files into a single hash, then print out the contents of the hash.
Based on googling "FASTA", I gather this is a simple text file format with one or more blocks of data structured as follows:
So if a line of text starts with ">", it contains an ID string plus optional commentary, and defines the start of a new sequence string. Each line that starts with some letter is a successive chunk of a sequence string.>SEQUENCE-ID-STRING commentary string SEQUENCE_LETTERS_AT_MOST_80_PER_LINE...
It's not clear whether the ID-strings are of any use in comparing different files. Are they? If you see the same ID-string in two different files, can you be confident that the sequence strings that follow it in each file are identical? If so, you can use the ID-strings as hash keys and sequence strings as hash values when you read the two files; if not, you'll need to use the sequence strings as hash keys, and can save the ID-strings as hash values.
See whether the following pseudo-code matches what you intend to do -- if so, you should be able to write working perl code to implement it:
Maybe you'll need to adjust the spec to cover issues that were not explained in your initial question. In any case, try writing some code, and post that when you have trouble with it.# require that there be three file name args on the command line: # -- input file1 # -- input file2 # -- output file # for each input file: # open it # read one line at a time # chomp the line to remove final line-feed # if line starts with ">", it's the start of a new sequence, so: # if there was a sequence before this one, save it in %seq_has +h # set current sequence to an empty string # otherwise, line starts with a letter, so: # append contents of line to current sequence # After reaching the end of the second file, save current sequence in +%seq_hash # open the output file # Loop over the keys of %seq_hash # write the ID-string to the output file # if necessary, add line-feeds into the sequence string at 80-char +intervals # write the sequence string to the output file
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Comparing Files
by stajich (Chaplain) on Jan 24, 2005 at 19:24 UTC |