comment on

As originally posted (with no perl code of your own, and no sample data), your question isn't quite clear and lacks important details. I would assume that you don't intend to append all the contents of the 2nd file to the 1st one, but only those records of the 2nd file that don't occur in the 1st.

In general, comparing two files and producing a union of the two as output it pretty simple. If there are special conditions involved in comparing FASTA data, then you'd have to know (or make it clear to us) what those are.

The basic approach is to store the contents of the two files into a single hash, then print out the contents of the hash.

Based on googling "FASTA", I gather this is a simple text file format with one or more blocks of data structured as follows:

>SEQUENCE-ID-STRING commentary string
SEQUENCE_LETTERS_AT_MOST_80_PER_LINE...
[download]

So if a line of text starts with ">", it contains an ID string plus optional commentary, and defines the start of a new sequence string. Each line that starts with some letter is a successive chunk of a sequence string.

It's not clear whether the ID-strings are of any use in comparing different files. Are they? If you see the same ID-string in two different files, can you be confident that the sequence strings that follow it in each file are identical? If so, you can use the ID-strings as hash keys and sequence strings as hash values when you read the two files; if not, you'll need to use the sequence strings as hash keys, and can save the ID-strings as hash values.

See whether the following pseudo-code matches what you intend to do -- if so, you should be able to write working perl code to implement it:

# require that there be three file name args on the command line:
#  -- input file1
#  -- input file2
#  -- output file

# for each input file:
#   open it
#   read one line at a time
#     chomp the line to remove final line-feed
#     if line starts with ">", it's the start of a new sequence, so:
#        if there was a sequence before this one, save it in  %seq_has
+h
#        set current sequence to an empty string

#     otherwise, line starts with a letter, so:
#        append contents of line to current sequence

# After reaching the end of the second file, save current sequence in 
+%seq_hash

# open the output file
# Loop over the keys of %seq_hash
#    write the ID-string to the output file
#    if necessary, add line-feeds into the sequence string at 80-char 
+intervals 
#    write the sequence string to the output file
[download]

Maybe you'll need to adjust the spec to cover issues that were not explained in your initial question. In any case, try writing some code, and post that when you have trouble with it.

In reply to Re: Comparing Files by graff
in thread Comparing Files by FarTech

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.