in reply to Re: how to get clusters?
in thread how to get clusters?

hello, But the point is the dataset is large and its not restricted to the two example clusters that i have mentioned. ok, let me be more clear. I have got a dataset that is 900MB large.so will have more of thousands and thousands of clusters.
when parsing through the file, you have to read the first line for example let the first line be:
1 800 816 23
and we have to concentrate on secind and third columns. the 800 is the hit_start and 816 is the hit_stop.and if the next line has hits lying less than 200basepairs then add them to the first and go on unless and until you could not find any hits with in the 200basepairs gap.
so if you have encountered another hit that is like
1 802 818 24 1 804 820 32 1 804 820 44
then you have to make all these in to one cluster ranging from 800 -820.
and in this case your cluster_start would be 800 and your cluster_stop would be 820
liek this you have to move on and on. and if there isn't any hits with in this range then you have to start creating a next cluster. with a different cluster_start and cluster_stop.

Replies are listed 'Best First'.
Re^3: how to get clusters?
by Not_a_Number (Prior) on Jan 04, 2008 at 19:42 UTC

    I can only agree with amarquis. You are still far from clear.

    I appreciate that English is not your native language (here's a friendly hint: sentences start with capital letters) but you're not making it easy for us to help you. In particular, you still haven't replied to the very first question that you were asked in an attempt to get you to clarify your problem:

    What is a bp and how are you calculating it?

    Presuming that 'bp' is the same as 'basepair', such information could possibly help us understand what you mean by, for example:

    and if the next line has hits lying less than 200basepairs then add them to the first and go on unless and until you could not find any hits with in the 200basepairs gap.

    Less than 200 'basepairs' from what? 'Go on' doing what? What is a '200 basepairs gap'?

Re^3: how to get clusters?
by amarquis (Curate) on Jan 04, 2008 at 19:04 UTC

    I'm still not exactly sure how you want the matches picked. In the above example, you've started at a hit that goes from 800-816, and you've matched 802-818, and 804-820. What does it mean to lie less than 200 base pairs away?

Re^3: how to get clusters?
by eric256 (Parson) on Jan 04, 2008 at 20:36 UTC

    That makes sense only for those pairs that the second number is higher. What about these pairs where the second number is lower? Is that a negative length?

    I'm picturing your pairs as ranges on a line. So you have one line and each line represents a range on that line. If i understand correctly you want to group the rows of data together such that the ranges they represent are all withen a 200 range. What do you do with rows of data that are longer than 200 all by themselves and do you want to group by the centers or the end points some how? If you treat the start and stop as coordinates it becomes easier but i wonder if it has any meaning then?

    No matter what you need to define a formula that gives use the distance between two data points. Then clustering is just a matter of apply an algorithm using that distance function. So how far apart are 802,818 and 804,820? I might be inclined to call the distance the average distance between ends or (abs(802-804) + abs(818-820)) / 2 = 2. So then the distance between 105,1 and 802,818 is 757, but that doesn't realy make a ton of sense either and you still have the confusion of a pair where the end is before the beginning.

    Update: The other option might be to do the distance between the centers abs( (802 + 818)/2 - (804+820)/2) = 5 apart, that would make (105,1) and (802,818) 757 bp apart.


    ___________
    Eric Hodges