usless_blonde has asked for the wisdom of the Perl Monks concerning the following question:

hello
I need help comparing all .seq files to each other. What kind of loop would I need. I wish to do a pairwise alignment and discard any that are too closely related. I'll use the BL2seq bioerl module.
would it be better if i read each seq into a hash. that way i would know the name of the seq?
ta!

Replies are listed 'Best First'.
Re: looking at everything
by dragonchild (Archbishop) on Oct 02, 2003 at 13:57 UTC
    You haven't posted any code, so we cannot help you by critiquing. However, it sounds like you're having pre-coding issues - namely, design.

    The basic algorithm you're looking for is:

    1. Get a list of the things you want to compare.
    2. Iterate through that list, one at a time.
    3. Within that loop, iterate through the list again. Make sure to skip the one you picked in the outer loop.
    4. Do your compare.

    my @list_of_stuff = get_my_list(); foreach my $i (0 .. $#list_of_stuff) { foreach my $j (0 .. $#list_of_stuff) { next if $i == $j; do_compare($list_of_stuff[$i], $list_of_stuff[$j]); } }

    This algorithm will be very slow, especially if you're comparing more than 15-20 things. Remember, you're doing N * (N - 1) comparisons. So:

    Things Comparisons 2 1 3 6 4 12 5 20 10 90 15 210 20 380 25 600 50 2450 75 5550 100 9900

    It might be useful to do this kind of comparison on subsets of your data, then look at comparing typical items from each subset. So, if you could break 100 items down into 10 subsets of 10, then compare the typical item from each subset with each other, you reduce 9900 comparisons to 990. That's a 90% savings in time - both for the computer and for you as the user. (Remember, you are the one that has to deal with these comparisons.)

    Of course, using another program to wade through the comparisons and discard the uninteresting ones can also be handy. I've done that many times. Where I work right now, we have a process that generates a set of logs. I have several scripts that do analysis on those logs and double-check the process's work. I even have a script that analyzes the results of the log analyzers. :-)

    As to your second question - depending on the size of the things you're working with, you might not have enough memory to read everything in. Often, keeping those things on disk and reading them in when you want to deal with them is the appropriate thing to do. You might have to read things in over and over, but that's ok.

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.