Morten_S has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am doing lots of molecular simulations. I often just use native tools, but have recently started developing my own tools. As such I have a situation where i have 32 different molecules. I am interested in determining the number of clusters. So far my script works like so that i have the data angle and distance between each molecule in a txt file at each timestep.

an example could be the angle txt file:

143.2

13.4

55.6

.

.

where each new line is the subsequent timestep. The distance file is the same. The name of the files relate to the molecule 1_2.txt is the relation between molecule 1 and 2. 1_32 is relation between molecule 1 and 32. In order for the molecules to be part of a cluster the molecules need to have a angle less than 20 degrees and a distance less than 2nm. So fare I have made a very poor script that determines the molecule relations that pass these conditions and prints all of the relations to a single txt file with the notation:

1_3 1_5 5_7 9_23

1_3 1_5 5_7 9_23 14_23

1_3 1_5 5_7 9_23 14_23 17_20

.

.

each line represents the subsequent timestep. What I need to do is determine connectivity and cluster sizes. Like 1_3 1_5 5_7 9_23 14_23 17_20, molecule 1, 3, 5 and 7 is a single cluster, molecule 9, 14 and 23 is another cluster and molecule 17 and 20 is another cluster. what i need is the perl script to return something like

file for number of clusters

3

.

.

file for cluster sizes

4 3 2

.

.

sorry for the complicated question :) I have no idea how to do this :) Hope you can help.

Best regards

Morten

Replies are listed 'Best First'.
Re: Conditional connectivety
by 1nickt (Canon) on Nov 12, 2017 at 19:12 UTC

    Hi, welcome. It's not a complicated question, it's an unclear one.

    • What is a timestep?
    • Are the files guaranteed to have synchronized timesteps, or are the lines timestamped?
    • What are the rules that define a "cluster"?
    • Why do you want separate output files for clusters and cluster sizes? You'll need another script to read that data.

    Please supply:

    • Data samples with at least 10 - 20 timesteps represented
    • ... inside <code></code> tags as shown on the posting form
    • The code you have now
    • ... and how it does not do what you want (error messages if any)
    • A higher-level description of what you plan to do with the output. There may be a better strategy than the one you've developed so far

    Make your question clearer and you'll get better answers. Thanks!


    The way forward always starts with a minimal test.
Re: Conditional connectivety
by Anonymous Monk on Nov 12, 2017 at 17:58 UTC

    You could use the Graph module for this task. Completely untested:

    use Graph; ... sub process_one_line { my @relations = split /\s+/, shift; my $g = Graph::Undirected->new; for (@relations) { my ($x, $y) = split /_/; $g->add_edge($x, $y); } my @cc = $g->connected_components(); my $num_clusters = int @cc; my @cluster_sizes = map { int @$_ } @cc; return +{ num => $num_clusters, sizes => \@cluster_sizes }; } ...

      Thanks for the suggestion, I will try it out as soon as possible and let you know
Re: Conditional connectivety
by vr (Curate) on Nov 12, 2017 at 23:17 UTC

    I may be reading it wrong, but do "clusters" form and dissolve as they please as time goes?

    So fare I have made a very poor script that determines the molecule relations that pass these conditions and prints all of the relations to a single txt file with the notation:
    
    1_3 1_5 5_7 9_23
    
    1_3 1_5 5_7 9_23 14_23
    
    1_3 1_5 5_7 9_23 14_23 17_20
    
    .
    
    .
    
    each line represents the subsequent timestep.
    

    So you mine for clusters per timestep in your 992*2 (?) files, and then you splat this info back into single line per timestep? Why? Then

    Like 1_3 1_5 5_7 9_23 14_23 17_20, molecule 1, 3, 5 and 7 is a single cluster, molecule 9, 14 and 23 is another cluster and molecule 17 and 20 is another cluster. 
    

    may be true, or as easily not true

      Hi, tahnks for the reply.

      At the moment I have calculated the distance between molecules and the angle. This information is stored in 992*2 files. The reason is more related to the way I calculated the data, but I could easily store this in a matrix in a single file too. However my challenge is to translate the analysed data 1_3 1_5 5_7 9_23 14_23 17_20 into number of clusters and cluster sizes at each timestep. I dont care if this is stored in a single file, but it is the two parameters that informs me if a simulations is successful.

      yep, clusters may be formed and dissolved over time.

      Hope this helps abit I know it is difficult when I have no initial code :)