Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

structuring data: aka walk first, grok later

by chexmix (Hermit)
on May 30, 2008 at 16:09 UTC ( [id://689265]=perlquestion: print w/replies, xml ) Need Help??

chexmix has asked for the wisdom of the Perl Monks concerning the following question:

UPDATE FIRST:

I reworked this program and significantly improved performance. There were some mysterious discrepancies in the result set between the old version and the new on one run, but I believe I have those 'figured out.'

Partial profiles of new and old follow. I am cautiously considering this a success:

First version of program:

time elapsed (wall): 1473.9343 time running program: 1473.2193 (99.95%) time profiling (est.): 0.7150 (0.05%) number of calls: 59722 %Time Sec. #calls sec/call F name 92.33 1360.2230 2427 0.560454 DBI::st::execute 3.64 53.5727 2027 0.026430 main::process_x 3.58 52.7029 2007 0.026260 main::process_y 0.15 2.2193 1 2.219282 Term::ReadKey::ReadLine 0.10 1.4189 0 1.418933 * <other> 0.06 0.8885 24294 0.000037 DBI::st::fetchrow_array

Revised program:

time elapsed (wall): 408.6156 time running program: 408.2747 (99.92%) time profiling (est.): 0.3409 (0.08%) number of calls: 32883 %Time Sec. #calls sec/call F name 70.21 286.6553 510 0.562069 DBI::st::execute 24.20 98.7912 4034 0.024490 main::process 4.79 19.5629 1 19.562895 Term::ReadKey::ReadLine 0.27 1.1126 0 1.112580 * <other> 0.16 0.6666 20460 0.000033 DBI::st::fetchrow_array

NOW ON TO THE ORIGINAL POST ...

Good morning Monks -

The poet Charles Olson once wrote, memorably:

I have had to learn the simplest things
last. Which made for difficulties.

This kind of sums up my situation vis-a-vis Perl, I think. I have been flummoxed for the past few days: my lack of substantive CS background has (once again) been chewing a hole in my ... er, back.

This post is in a sense a followup to my earlier post about profiling, and yet isn't about DBI at all, but more about data structures.

I have found that I can essentially grab ALL the data I need to process (for the task outlined in the previous post) with ONE database call per line of input. What comes down from that series of calls looks like this:

21 DET-2 896.657564735788 678.83860967799 21 DET-3 32.0939023018969 621.656550474314 21 DET-3 42.0741462550974 834.842294892622 21 DET-3 218.814294809857 450.606540154849 21 DET-3 228.88830316475 625.939190221948 21 DET-3 630.472705847461 220.839350101088 21 DET-5 152.988115061449 156.31861287082 21 DET-5 730.997702224652 507.421683707195 21 DET-6 506.364456847517 587.275663167673 21 DET-6 573.109998216762 116.126667780714 21 DET-6 885.306844616344 411.352928714465 21 DET-6 959.150025915228 845.316911114704 21 DET-7 62.7170088137102 593.424801945024 21 DET-7 110.245168119381 788.219885220784 21 DET-7 159.254569896235 386.365906980404 21 DET-7 377.53529067825 163.659365696494 21 DET-7 736.734267414092 129.235251032426 21 DET-7 836.081539763363 401.860540038111 21 DET-8 736.566372536132 247.410290038796 47 DET-7 189.488040387042 500.316501378612 47 DET-7 251.972954527148 519.649226713148 71 DET-7 188.133043154801 499.94217650742 71 DET-7 251.06636137579 519.007465693828 88 DET-0 0.70684189743067 391.883292824418 88 DET-0 114.871177986263 212.959076023136 88 DET-0 219.421725079137 710.314439572696 88 DET-0 257.837516726887 594.376577764894 88 DET-1 119.630462310966 260.433234269099 ...

In each line, the first value is an "observation number," the second a "detector number" and the third and fourth values are the x and y coordinates of actual "hits" on the detectors.

I have edited some in this sample of the roughly 19,000 lines but wanted to leave enough to show that:

  1. There are multiple lines where the first item (the "observation number") is the same;
  2. For each observation number, there are multiple lines where the second item (the "detector number") is the same.

So I have been facing the roaring Godzilla that is my lack of experience with data structures, and trying to figure out what might be the best structure I could put this in for processing ...

My first attempt was a hash of arrays, which yielded something like this ...

21 => DET-2, 896.657564735788, 678.83860967799, DET-3, 32.0939023018969, 62 +1.656550474314, DET-3, 42.0741462550974, 834.842294892622, DET-3, 87. +5412177704422, 684.850417188863, DET-3, 92.9823463716063, 216.3390205 +94075, DET-3, 175.151394732114, 525.441189179707, DET-3, 218.81429480 +9857, 450.606540154849, DET-3, 228.88830316475, 625.939190221948, DET +-3, 630.472705847461, 220.839350101088, DET-5, 152.988115061449, 156. +31861287082, DET-5, 730.997702224652, 507.421683707195, DET-6, 784.60 +8063532865, 688.699410601935, DET-6, 885.306844616344, 411.3529287144 +65, DET-6, 959.150025915228, 845.316911114704, DET-7, 62.717008813710 +2, 593.424801945024, 47 => DET-7, 189.488040387042, 500.316501378612, DET-7, 251.972954527148, 5 +19.649226713148, 71 => DET-7, 188.133043154801, 499.94217650742, DET-7, 251.06636137579, 519 +.007465693828,

... note: this data may not quite agree with that above, I am cutting for clarity and this is mostly for illustration purposes.

But it at least looks like this is not processed enough, because of those repeated "DET" values, and that what I really want is to "deepen" the structure one more level, to "pull out" as it were the detector numbers. And it is here that I get stuck, both in terms of "what would be best" and "how do I do that?"

Even perldsc only goes so far in terms of complexity.

At first I thought "it must be a hash of hashes of arrays that I want," and I uncovered this node showing how to create such a thing. BUT to be quite honest, I didn't or couldn't or can't or currently am not able to truly grok the solutions presented at that node. And IS this the best structure for me?

So my questions, I am afraid, are three, which is perhaps a function of the lack of clarity in my thinking:

  1. Given that I want to study the distribution of these x and y points on each detector in each observation, what would be the best structure for this data?
  2. What do I study so I can better see such things (e.g. how do I get there from here)? Are there general books on data structures, or will this knowledge just come with experience?
  3. Is it okay (I know, "according to whom?") to utilize solutions/tools/schemata one does not yet truly understand, and hope for enlightenment to come later?
  4. There is no number 4. Note I am studiedly avoiding asking "so, now do I create the structure suggested by (1)?"

Apologies for the length of this post. I hope there is something of interest in it. I am, once again, feeling stuck and frustrated. I know its no one's responsibility to help me out of my thought ditch, but if anyone has any maps to recommend, I would be grateful.

Regards,

An extremely humble Monk

Replies are listed 'Best First'.
Re: structuring data: aka walk first, grok later
by FunkyMonk (Chancellor) on May 30, 2008 at 16:39 UTC
    1. Given that I want to study the distribution of these x and y points on each detector in each observation, what would be the best structure for this data?
    I'd suggest a HoHoAoH (see the spoiler below) :-)
    2. What do I study so I can better see such things (e.g. how do I get there from here)? Are there general books on data structures, or will this knowledge just come with experience?
    There are books on data structures, but real knowledge only comes from experience.
    3. Is it okay (I know, "according to whom?") to utilize solutions/tools/schemata one does not yet truly understand, and hope for enlightenment to come later?
    That's a decision for you to make. But without the understanding, how do you expect to make even trivial changes to the code?
    4. There is no number 4. Note I am studiedly avoiding asking "so, now do I create the structure suggested by (1)?"

    Unless I state otherwise, all my code runs with strict and warnings
Re: structuring data: aka walk first, grok later
by BrowserUk (Patriarch) on May 30, 2008 at 21:24 UTC

    I've read this post through five times (as I did the previous one) and I get the same feeling each time. You're asking "What's the best data structure", but until we know how you are going to process the data, that's an impossible question to answer. The right data structure for a given task, always depends upon the details of the task.

    In A profiling surprise ... you said:

    The program then crunches over the 'virtual CCDs' looking for anomalous 'hit pileups,' #s of 'hits' that go over a certain predefined threshold and _may_ indicate some sort of technical problem with that detector (but which at least deserve closer scrutiny). It does this by passing a sliding window a certain # of px wide over the virtual CCD, in both the x and y axes, and reporting back the # of hits within the window, comparing the count, then, to the established threshold.

    But that, especially when combined with the sample data above, leaves so many unanswered (and as yet, unasked) questioned about the nature of this data, and the processing you need to perform on it, that it make me think that the replies so far are entirely premature.

    Some questions that I think need answering before any good reply can be made.

    1. What are the datapoints in the sample data? What do they represent?

      For example: As far as I know, CCD means 'charge-coupled device'. These are (basically, very simplified for discussion purposes only), XxY grids of cells that accumulate a charge proportional to the number of photons that hit them during exposure. These (analogue) charges can then be 'read' from the grid, cell by cell, via an n-bit d/a converter to produce an XxY array of values.

      You mention that XxY is 1024x1024 pixels. On a digital device you cannot have partial pixels. So how come your dataset has X:Y pairs like: 896.657564735788 678.83860967799?

      And where are the charge values?

    2. Why are you building a data structure in the first place?

      Might sound like a dumb question, but from your description (and ignoring the decimal pixels anomoly for now), it sounds like you ought to be able to process the data line-by-line to produce the output you need. And thereby avoid having to "build a data structure from the input data" at all.

      If that is the case (and we'd need a far better description of the actual processing required to say for sure), then the more important problem to solve is: what data structure is required to accumulate and present the results of the processing.

      For example. (again, ignoring the decimal pixels anomoly), let's assume that each of your data points represents a 'hit' on a particular (integer) pixel of a given detector during a given observation. And the purpose of your task is to count the hits, per pixel, per detector, over the accumulated observations.

      In this case, the data structure needed to accumulate that information is fairly obvious. You need an XxY (1024x1024) array of counts, per device (you say 7 detectors in the earlier post, but show 9 (DET-0 thru DET-8) in the sample data.

      A first pass would suggest an AoAoA (7(or 9) x 1024 x 1024 ), but as the pixel data seems to be quite sparse, you'd probably save space by using hashes instead of arrays for the X & Y components. And as detector IDs are textual, we can save a bit of effort by using a hash at the base level also.

      So that gives us a HoHoH. This structure makes it easy to accumulate the counts:

      my @dets; while( my( $obs, $det, $x, $y ) = getNextRow() ){ $dets{ $det }{ int $x }{ int $y }++; }

      And that would be pretty much it. You could now perform your 2-dimensional sliding window processing using a few nested loops. But, any example would be pointless, as there is too much speculation in the above as to what you are actually trying to achieve.

    Conclusion: I think you are asking the wrong questions and not supplying enough information for us to guess what answers will really help you. I think you are worrying about how to store the data read in, when you may not need to store the data at all. You probably should be more concerned with how to accumulate and store your results, but it's impossible to make good suggestions on the basis of the limited information about the processing you are trying to do, and the nature of that raw data you are starting with.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      "Might sound like a dumb question, but from your description (and ignoring the decimal pixels anomoly for now), it sounds like you ought to be able to process the data line-by-line to produce the output you need. And thereby avoid having to "build a data structure from the input data" at all."

      And ... yeah. I started to think this today, too. :^| It all could be a case of needless 'complexifying' on my part.

      Understood, and thanks. I think I get frozen in the headlights, get all bunched-up mentally, and then post ... well, semi-incoherently. Then thanks to some Monk-ly patience I will get some advice, walk away, keep hammering, and eventually get there ... and often regret the semi-incoherent posts later.

      As is the case here.

      There are other factors (particulars re: my job), but I'll leave it there for now, and ponder how to be more specific and focused with posts in future. :)

      I'm gonna try to keep this short b/c it seems I get into trouble when I "yammer on".

      In re: the non-integer values for x and y. The full explanation is mathematically forbidding but has to do with the nature of the detecting instrument: coordinate systems are actually transformed somewhat during data processing.

      But to cut it short: in my first version of the program I converted these values to integers anyway (a sanctioned move - I didn't just decide to do that on my own). I muddied the issue here by posting the full precision values in this post.

      In re: what I am trying to do. Essentially I am trying to find places on the detectors where 'hits' represented by the pixel values seem to "bunch up." These 'hits' represent places on the detectors where photons have struck. A number of "hits" in x or y that goes over a predetermined value _may_ indicate something that needs to be looked at more closely (e.g. by human eyes).

      The first version of the program took a list of observation sessions, represented by numbers, as input. For each observation session, it did a database call to find out which of the seven detectors/CCDs were involved.

      THEN, for each detector in that observation, it did a database call to pull in the data for the "hits", populating an array for the x axis and one for the y axis of the detector.

      Then it iterated over those built-up arrays for x and y, kind of doing a histogram in memory (repeat for each detector, then move on to the next observation) ...

      I must emphasize: this approach worked. But it's apparently inefficient, especially in terms of time (total run time: 19 minutes) spent doing db calls. So I figured out how to pull all the data in first. This takes only 2 minutes.

      All the lines of the lump are like this:

      $observation, $detector, $x_coord, $y_coord

      Now I keep getting stuck trying to get the big lump to do what I want:

      ... to give me an array of the x values and an array of the y values for a SINGLE detector in a SINGLE observation. And so on, through the lump, until I am done. I need to examine the DISTRIBUTION of values in x and y axes of each detector, in each observation, individually.

      Maybe I should be satisfied with my 19 minute runtime, and leave the data munging / structures alone until I am more experienced ... ? I don't know.

      Do I need a data structure? I don't know that either. It feels like I do, because without one I don't know how to "address" subsets of the lump of data.

      I hope that's clearer, anyway. I don't know why I am so stuck, and I am sorry I am.

        If I understand you correctly, then each datapoint (O,D,X,Y) represents one photon hitting a pixel(XY) of a detector (D) during an observation (O). And that pixel may be struck zero, one or many times during a given observation. If it is hit more than once, then there will be multiple, identical, (O.D.X.Y) datapoints in the dataset for that detector/observation pairing?

        Is that correct?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: structuring data: aka walk first, grok later
by oko1 (Deacon) on May 30, 2008 at 20:49 UTC
    > What do I study so I can better see such things (e.g. how do I get there from here)? Are there general books 
    on data structures, or will this knowledge just come with experience?
    

    From my perspective, that's the most important question you've asked. It's not that hard - but it does take a little work. First, look at the data "per level" (as you already have) and consider the most appropriate structure for it - e.g., if it's a simple ordered list, where you don't really care about easy retrieval for specific, invidual items, then use a list(ref). Otherwise, use a hash(ref) - or, if you have a single item, terminate that branch with a scalar. E.g., if you have a list of employees with all their info, consider it this way:

    1) Employees are identified by their unique company ID numbers - so the ID# can be a "pointer" to the rest of the data for a given employee. That sounds like a set of keys and values - i.e., a hash. So, we have:

    %empl = ( 100000 => "data for employee 100000 will go here", 100001 => "data for employee 100001 will go here", ... );

    2) The employee's name should be easily retrievable - in fact, all the parts of his name should be - so they need individual labels. So, we have a hashref:

    %empl = ( '100000' => { first_name => 'Joe', last_name => 'Smith', ...

    3) The employee might have an address that you'd want to retrieve line by line (say, so you could add individual formatting) - but you don't really care about a specific line all by itself. This argues for either a scalar, like 'Smith' above - or a listref, which would allow you to do that formatting. So:

    %empl = ( '100000' => { first_name => 'Joe', last_name => 'Smith', address => [ '99 Main St.', 'Apartment 1209', 'Anytown, USA 999999' ], ...

    4) If you felt that a structure needed a little elaboration in depth - i.e., you wanted to have an entry for all the phone numbers, but wanted to denote what kind of numbers they were - then the key ('phones') would become a pointer to another hashref:

    %empl = ( '100000' => { first_name => 'Joe', last_name => 'Smith', address => [ '99 Main St.', 'Apartment 1209', 'Anytown, USA 999999' ], phones => { home => '111-111-1111' +, cell => '111-111-1112' +, fax => '111-111-1113' } ...

    There aren't any guarantees, and things can get a little more complex than that - but this is a pretty good approach to designing or figuring out the data structure you need. Play with it for a while - then, go re-read the perldocs. I think you'll find that they make a lot more sense than they did before.

    
    -- 
    Human history becomes more and more a race between education and catastrophe. -- HG Wells
    
Re: structuring data: aka walk first, grok later
by pc88mxer (Vicar) on May 30, 2008 at 16:25 UTC
    Perl makes this easy with its autovivification feature:
    my $DATA_X; my $DATA_Y; while (<>) { my ($obs, $det, $x, $y) = split(' ', $_); push(@{$DATA_X->{$obs}->{$det}}, $x); push(@{$DATA_Y->{$obs}->{$det}}, $y); }
    This uses two HoHoA - one for the X coordinate data and another for the Y coordinate data. Or you could store the coordinates in a data structure:
    my $DATA; while (<>) { my ($obs, $det, $x, $y) = split(' ', $_); push(@{$DATA->{$obs}->{$det}}, {x => $x, y => $y}); }
    And here's how to access the data (using the second approach):
    print $DATA->{21}->{'DET-2'}->[0]->{x}; # prints 896.657564735788 print $DATA->{47}->{'DET-7'}->[1]->{y}; # prints 519.649226713148
    Whether or not this is the best data structure for your data depends on what you want to do with it.
      I'd probably just make the coordinates into an array (and maybe define constants):
      use constant { X => 0, Y => 1, }; my %DATA; while (<>) { my ($obs, $det, $x, $y) = split ' '; push(@{$DATA{$obs}{$det}}, [ $x, $y ]); } print "$DATA{21}{'DET-2'}[0][X]\n";
      Thanks. I recall hitting 'autovivification' in the Alpaca book and coming away babbling of sentient Froot Loops smoking Meerschaum pipes, since I could make neither head nor tail of it ... I think I posted something about this some time ago, here on PM.

      I look at it now and I can kinda wrap my mushy brain around it. I don't think this has to do with the Alpaca book being opaque (I need to go back to it, probably), but is perhaps a measure that even _I_ have learned something ... :)

      In this case $DATA is a reference to an HoHoA, right?

        No, in pc88mxer's node it's a reference to a HoHoH. In mine, it's (not a reference to) a HoHoA.
      OK. Using either approach, how do I do something with (e.g. treat as a unit or set) all the $x for a given $det in a given $obs?

      That is, how do I get that particular array back out of the data structure that has been built?

      Because that's how I need to process this data. It's the distribution of points on the axes of individual detectors that is of interest.

      My latest stab is this:

      ($obs, $det, $x, $y) = split /\ /; push @{ $histx{$obs}{$det} }, $x; push @{ $histy{$obs}{$det} }, $y; } for my $i (0..$#{ $histx{$obs}{$det} }) { print "${ $histx{$obs}{$det} }[$i]\n"; }

      I'm doing this bit of printing simply to see if I am catching all the data ... but this only prints out the stuff for the LAST $obs.

      NOTE: I should never have referenced / included the 'full precision' values in my discussion. The reasons for that are technical (beyond the need for this discussion), but it's entirely valid to just convert those to integers. I am sorry for the confusion that caused.

Re: structuring data: aka walk first, grok later
by donp (Initiate) on May 30, 2008 at 17:06 UTC

    Hi there,

    "Given that I want to study the distribution of these x and y points on each detector in each observation, what would be the best structure for this data?"

    I'd suggest XML. You could end up with a structure like this:

    <observation id="21"> <detector id="DET-2"> <hitX>896.657564735788</hitX> <hitY>678.83860967799</hitY> </observation> <observation id="21"> <detector id="DET-3"> <hitX>32.0939023018969</hitX> <hitY>621.656550474314</hitY> </observation> <!-- ... and so on ... -->

    Such a structure can then easily be queried using XPath expressiones.

    There are powerful and easy to use Perl modules for such tasks, e.g. XML::Simple.
    See also:
    Writing SAX Drivers for Non-XML Data
    Stepping up from XML::Simple to XML::LibXML
    XPath Tutorial

    donp

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://689265]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (12)
As of 2024-04-23 14:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found