sirna has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I am working on a dataset that contains 4 columns chromosom_id, fstart, fstop, counts. These datasets are something like this:
Chromosom_id fstart fstop Count 1 105 1 14.5 1 105 1 14.5 1 105 1 14.5 1 813 797 4 1 813 797 22 1 813 797 4 1 813 797 22 1 800 816 23 1 802 818 24 1 804 820 32 1 804 820 44 1 813 797 4 1 813 797 22
I would like to get a cluster that contains the hits that are less than 200bp in length.So as for as in this case the sets the first three hits should be made as one cluster and the rest other should be made as another cluster. Could i get suggestions from you guys.And i have succeeded in just parsing the file.
Thanks

Replies are listed 'Best First'.
Re: how to get clusters?
by eric256 (Parson) on Jan 04, 2008 at 14:57 UTC

    What is a bp and how are you calculating it? For clustering I would start with one cluster, and add each point to it. Then check to see if the cluster is too scattered, if so use some algorithm to divide it into two clusters, then add the next point to the closest cluster, but I think I'm not understanding your question.


    ___________
    Eric Hodges
Re: how to get clusters?
by apl (Monsignor) on Jan 04, 2008 at 15:18 UTC
    The answer depends on your definitions of cluster, hits and bp.

    Like eric256, I don't believe I understand your question.

      hi, the hits are the matches of smallRNAs against the genome(have their coordinates like start and stop in a file example: 105, 1) . the clusters are grouping of these hits into one big cluster if only the other cluster is 200 basepairs away from the initial hit. as i have mentioned inthe previous post, the first three should be made into one cluster ie the cordinates of the first cluster would be 1 , 105.

        You still haven't provided enough context free of jargon from your problem domain to let people clearly answer your question. Reading I know what I mean. Why don't you? and How (Not) To Ask A Question would probably be helpful at this point. Don't make people drag everything out of you tooth and nail in order to give you help.

        Update: Not to mention given the apparent problem domain if this is a common file format it's likely BioPerl has an off the shelf module to do this already . . .

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

        You're still making a lot of assumptions. Is 105,1 == 1,105? How do you determine the distance in basepairs between clusters? How is a cluster represented when it is displayed?

        I suspect that if you could explain how to solve the problem you'd be able to write the program yourself.

Re: how to get clusters?
by Old_Gray_Bear (Bishop) on Jan 04, 2008 at 20:24 UTC
    Caveat: I Am Not A Biologist.

    From the discussion, it sounds like you may be interested in the module Bio::Cluster::UniGene, particularly the Cluster Methods (cluster_score() looked promising). This also sounds like something the BioPerl mailing list (bioperl-l@bioperl.org) might have more information on.

    ----
    I Go Back to Sleep, Now.

    OGB

Re: how to get clusters?
by jdporter (Paladin) on Jan 04, 2008 at 15:35 UTC
Re: how to get clusters?
by starX (Chaplain) on Jan 04, 2008 at 16:44 UTC
    I think I get what you mean. A simple way of doing it would be to group them into two different arrays of arrays. Like so...
    #!/usr/bin/perl use strict; use warnings; my (@cluster1, @cluster2); open FILE, 'dataset.txt'; while (<FILE>){ my @cols = split /\s+/; if ($cols[0] =~ m/\d/){ if ($cols[2] < 200){ push @cluster1, [@cols]; } else { push @cluster2, [@cols]; } } } # simple test to make sure it worked. foreach my $row (@cluster1) { #print "$cluster1[$row]\n"; print "@$row\n"; }
      hello, But the point is the dataset is large and its not restricted to the two example clusters that i have mentioned. ok, let me be more clear. I have got a dataset that is 900MB large.so will have more of thousands and thousands of clusters.
      when parsing through the file, you have to read the first line for example let the first line be:
      1 800 816 23
      and we have to concentrate on secind and third columns. the 800 is the hit_start and 816 is the hit_stop.and if the next line has hits lying less than 200basepairs then add them to the first and go on unless and until you could not find any hits with in the 200basepairs gap.
      so if you have encountered another hit that is like
      1 802 818 24 1 804 820 32 1 804 820 44
      then you have to make all these in to one cluster ranging from 800 -820.
      and in this case your cluster_start would be 800 and your cluster_stop would be 820
      liek this you have to move on and on. and if there isn't any hits with in this range then you have to start creating a next cluster. with a different cluster_start and cluster_stop.

        I can only agree with amarquis. You are still far from clear.

        I appreciate that English is not your native language (here's a friendly hint: sentences start with capital letters) but you're not making it easy for us to help you. In particular, you still haven't replied to the very first question that you were asked in an attempt to get you to clarify your problem:

        What is a bp and how are you calculating it?

        Presuming that 'bp' is the same as 'basepair', such information could possibly help us understand what you mean by, for example:

        and if the next line has hits lying less than 200basepairs then add them to the first and go on unless and until you could not find any hits with in the 200basepairs gap.

        Less than 200 'basepairs' from what? 'Go on' doing what? What is a '200 basepairs gap'?

        That makes sense only for those pairs that the second number is higher. What about these pairs where the second number is lower? Is that a negative length?

        I'm picturing your pairs as ranges on a line. So you have one line and each line represents a range on that line. If i understand correctly you want to group the rows of data together such that the ranges they represent are all withen a 200 range. What do you do with rows of data that are longer than 200 all by themselves and do you want to group by the centers or the end points some how? If you treat the start and stop as coordinates it becomes easier but i wonder if it has any meaning then?

        No matter what you need to define a formula that gives use the distance between two data points. Then clustering is just a matter of apply an algorithm using that distance function. So how far apart are 802,818 and 804,820? I might be inclined to call the distance the average distance between ends or (abs(802-804) + abs(818-820)) / 2 = 2. So then the distance between 105,1 and 802,818 is 757, but that doesn't realy make a ton of sense either and you still have the confusion of a pair where the end is before the beginning.

        Update: The other option might be to do the distance between the centers abs( (802 + 818)/2 - (804+820)/2) = 5 apart, that would make (105,1) and (802,818) 757 bp apart.


        ___________
        Eric Hodges

        I'm still not exactly sure how you want the matches picked. In the above example, you've started at a hit that goes from 800-816, and you've matched 802-818, and 804-820. What does it mean to lie less than 200 base pairs away?