comment on

UPDATE FIRST:

I reworked this program and significantly improved performance. There were some mysterious discrepancies in the result set between the old version and the new on one run, but I believe I have those 'figured out.'

Partial profiles of new and old follow. I am cautiously considering this a success:

First version of program:

time elapsed (wall):   1473.9343
time running program:  1473.2193  (99.95%)
time profiling (est.): 0.7150  (0.05%)
number of calls:       59722

%Time    Sec.     #calls   sec/call  F  name
92.33 1360.2230     2427   0.560454     DBI::st::execute
 3.64   53.5727     2027   0.026430     main::process_x
 3.58   52.7029     2007   0.026260     main::process_y
 0.15    2.2193        1   2.219282     Term::ReadKey::ReadLine
 0.10    1.4189        0   1.418933  *  <other>
 0.06    0.8885    24294   0.000037     DBI::st::fetchrow_array
[download]

Revised program:

time elapsed (wall):   408.6156
time running program:  408.2747  (99.92%)
time profiling (est.): 0.3409  (0.08%)
number of calls:       32883

%Time    Sec.     #calls   sec/call  F  name
70.21  286.6553      510   0.562069     DBI::st::execute
24.20   98.7912     4034   0.024490     main::process
 4.79   19.5629        1  19.562895     Term::ReadKey::ReadLine
 0.27    1.1126        0   1.112580  *  <other>
 0.16    0.6666    20460   0.000033     DBI::st::fetchrow_array
[download]

NOW ON TO THE ORIGINAL POST ...

Good morning Monks -

The poet Charles Olson once wrote, memorably:

I have had to learn the simplest things
last. Which made for difficulties.

This kind of sums up my situation vis-a-vis Perl, I think. I have been flummoxed for the past few days: my lack of substantive CS background has (once again) been chewing a hole in my ... er, back.

This post is in a sense a followup to my earlier post about profiling, and yet isn't about DBI at all, but more about data structures.

I have found that I can essentially grab ALL the data I need to process (for the task outlined in the previous post) with ONE database call per line of input. What comes down from that series of calls looks like this:

21 DET-2 896.657564735788 678.83860967799
21 DET-3 32.0939023018969 621.656550474314
21 DET-3 42.0741462550974 834.842294892622
21 DET-3 218.814294809857 450.606540154849
21 DET-3 228.88830316475 625.939190221948
21 DET-3 630.472705847461 220.839350101088
21 DET-5 152.988115061449 156.31861287082
21 DET-5 730.997702224652 507.421683707195
21 DET-6 506.364456847517 587.275663167673
21 DET-6 573.109998216762 116.126667780714
21 DET-6 885.306844616344 411.352928714465
21 DET-6 959.150025915228 845.316911114704
21 DET-7 62.7170088137102 593.424801945024
21 DET-7 110.245168119381 788.219885220784
21 DET-7 159.254569896235 386.365906980404
21 DET-7 377.53529067825 163.659365696494
21 DET-7 736.734267414092 129.235251032426
21 DET-7 836.081539763363 401.860540038111
21 DET-8 736.566372536132 247.410290038796
47 DET-7 189.488040387042 500.316501378612
47 DET-7 251.972954527148 519.649226713148
71 DET-7 188.133043154801 499.94217650742
71 DET-7 251.06636137579 519.007465693828
88 DET-0 0.70684189743067 391.883292824418
88 DET-0 114.871177986263 212.959076023136
88 DET-0 219.421725079137 710.314439572696
88 DET-0 257.837516726887 594.376577764894
88 DET-1 119.630462310966 260.433234269099
...
[download]

In each line, the first value is an "observation number," the second a "detector number" and the third and fourth values are the x and y coordinates of actual "hits" on the detectors.

I have edited some in this sample of the roughly 19,000 lines but wanted to leave enough to show that:

There are multiple lines where the first item (the "observation number") is the same;
For each observation number, there are multiple lines where the second item (the "detector number") is the same.

So I have been facing the roaring Godzilla that is my lack of experience with data structures, and trying to figure out what might be the best structure I could put this in for processing ...

My first attempt was a hash of arrays, which yielded something like this ...

21
=>
 DET-2, 896.657564735788, 678.83860967799, DET-3, 32.0939023018969, 62
+1.656550474314, DET-3, 42.0741462550974, 834.842294892622, DET-3, 87.
+5412177704422, 684.850417188863, DET-3, 92.9823463716063, 216.3390205
+94075, DET-3, 175.151394732114, 525.441189179707, DET-3, 218.81429480
+9857, 450.606540154849, DET-3, 228.88830316475, 625.939190221948, DET
+-3, 630.472705847461, 220.839350101088, DET-5, 152.988115061449, 156.
+31861287082, DET-5, 730.997702224652, 507.421683707195, DET-6, 784.60
+8063532865, 688.699410601935, DET-6, 885.306844616344, 411.3529287144
+65, DET-6, 959.150025915228, 845.316911114704, DET-7, 62.717008813710
+2, 593.424801945024, 

47
=>
 DET-7, 189.488040387042, 500.316501378612, DET-7, 251.972954527148, 5
+19.649226713148,

71
=>
 DET-7, 188.133043154801, 499.94217650742, DET-7, 251.06636137579, 519
+.007465693828,
[download]

... note: this data may not quite agree with that above, I am cutting for clarity and this is mostly for illustration purposes.

But it at least looks like this is not processed enough, because of those repeated "DET" values, and that what I really want is to "deepen" the structure one more level, to "pull out" as it were the detector numbers. And it is here that I get stuck, both in terms of "what would be best" and "how do I do that?"

Even perldsc only goes so far in terms of complexity.

At first I thought "it must be a hash of hashes of arrays that I want," and I uncovered this node showing how to create such a thing. BUT to be quite honest, I didn't or couldn't or can't or currently am not able to truly grok the solutions presented at that node. And IS this the best structure for me?

So my questions, I am afraid, are three, which is perhaps a function of the lack of clarity in my thinking:

Given that I want to study the distribution of these x and y points on each detector in each observation, what would be the best structure for this data?
What do I study so I can better see such things (e.g. how do I get there from here)? Are there general books on data structures, or will this knowledge just come with experience?
Is it okay (I know, "according to whom?") to utilize solutions/tools/schemata one does not yet truly understand, and hope for enlightenment to come later?
There is no number 4. Note I am studiedly avoiding asking "so, now do I create the structure suggested by (1)?"

Apologies for the length of this post. I hope there is something of interest in it. I am, once again, feeling stuck and frustrated. I know its no one's responsibility to help me out of my thought ditch, but if anyone has any maps to recommend, I would be grateful.

Regards,

An extremely humble Monk

In reply to structuring data: aka walk first, grok later by chexmix

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.