comment on

tye: this isn't usenet. It doesn't have to be "perl specific".

(Careful what you ask for ;))

I've got some data, and I'm having some trouble figuring out how to effectively report on it.

The source data itself:

I have a dataset (let's call the file "dataset.dat".) All numbers and data have been falsified. Here's a slice:

IDa,foo1,100,100,0
IDa,foo2,200,301,101
IDa,foo3,300,300,0
IDa,foo4,400,501,101

IDb,foo1,100,100,0
IDb,foo2,200,301,101
IDb,foo3,300,300,0
IDb,foo4,400,500,100

IDc,foo1,200,200,0
IDc,foo2,200,301,101
IDc,foo3,300,300,0
IDc,foo4,400,500,100

IDd,foo1,900,900,0
IDd,foo2,200,301,101
IDd,foo3,300,301,1
IDd,foo4,400,400,0
[download]

What this represents is two giant test analytics runs. The first and second columns represent the test inputs. The third and fourth are the outputs of the baseline run and the "test" run. The fifth is the simple difference between them.

In the real data there are 15,000 foos permuted into 550 IDs. For our purposes the list of Foos is precisely the same between runs (i.e. differences have been slurped out of the file.)

The problem I'm trying to solve:

The first quetsion was: Which Foos show impact between the two runs?

That's trivial:

grep -v ",0$" dataset.dat | awk -F',' '{print $2}' | sort | uniq

foo2
foo3
foo4
[download]

That's all well and good.

What is misleading about the result of this search is that there's no way at this level to distinguish between the "foo3" with a one-shot impact in "IDd" and "foo2" which has impact everwhere it appears.

So there are two additional dimensions of analysis which are important.

What is the distribution of the number of impacts of Foos across IDs.
What is the distribution of the impacts themselves (by percentage buckets) Across Foos.

I can see a grid with buckets of percent impact across (say... 20 columns of 5% slices) then percent buckets of "percentages of IDs thusly impacted." But my concerned is that it then becomes too abstract to be useful.

I'm just lost in the mire of this stuff.

Any ideas?

EPILOGUE: I did end up going with a heavily permuted version of blue_cowdog's solution (thanks again o/ ) since the data itself isn't really continuous enough for 'clustering' that would be revealed by a graphic solution to make much sense. (Though I'm morally obligated as a nerd to noodle around with roboticus and pvaldes' ideas. Thanks for those too o/.)

What ended up happening is this: Friday night at 5:30 I was working from home, running one more cross-section required for audit verification of the release that was already under way, when I suddenly couldn't find the data. (I had been working in a local workspace and went back to the server for a couple more gigs of data to sift through.)

Turns out, a n00b in Houston decided that the error he was getting running his reports were due to disc space. So he, without so much as a peep, deleted everything... just nuked the whole tree.

The upshot of this is I actually have to start from scratch.

*twitch*

In reply to Bucketing,Slicing and Reporting data across multiple dimensions by Voronich

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.