comment on

Hoping not to get flogged here, but I wanted to post the question tonight so maybe I would have a starting point when I come back in tomorrow.
I'm an absolute newb when it comes to programming, but I think PERL will be good for what I'm trying to do...and I have the books, so I'm hoping to fumble through this.

I'm trying to load in data from two files. One file has the category ID for each item I am interested in. (eg. item.1.ext = square, item.2.ext = circle, etc.)
The second file contains all the attributes for these items. Each attribute has a binary yes/no represented by 1 or 0. My files can have a couple hundred items, with a million attributes for each item.

What I am looking to do is find a good way to process through the attributes by category and score them. I was thinking that I would read in the files and attempt to create a count for each group. And then use the number of times the attribute was present in a category set over the number of items in that category to create a series of scoring criteria.(Like which attributes occur in each category more than 75% of the time, but less than 25% of the time in any other category. Basically looking for category unique attributes.)

But as I've learned with PERL, there are 7000 different ways to skin a cat, so I'm up for any suggestions. I'm trying to make this a fairly quick process because it will be repeated OFTEN.(Datasets will be ~200 items, 4-10 categories, and 1 million attributes.)

data example in readmore.

#ID's
File    ID
1.file.ext    Square
2.file.ext    Triangle
3.file.ext    Circle
4.file.ext    Square
5.file.ext    Triangle
6.file.ext    Circle
7.file.ext    Circle
8.file.ext    Rectangle
9.file.ext    Rectangle
10.file.ext    Circle
11.file.ext    Triangle
12.file.ext    Triangle
13.file.ext    Square
14.file.ext    Rectangle
15.file.ext    Rectangle
16.file.et    Square

#Attributes
attribute    1.file.ext    2.file.ext    3.file.ext    4.file.ext    5
+.file.ext    6.file.ext    7.file.ext    8.file.ext    9.file.ext    
+10.file.ext    11.file.ext    12.file.ext    13.file.ext    14.file.e
+xt    15.file.ext    16.file.et                
1    1    0    1    1    0    1    1    1    1    1    0    0    1    
+1    1    1                
2    1    0    1    1    0    1    1    0    0    1    0    0    1    
+0    0    1                
3    0    1    0    0    1    0    0    1    1    0    1    1    0    
+1    1    0                
4    0    1    1    0    1    1    1    1    1    1    1    1    0    
+1    1    0                
5    0    1    0    0    1    0    0    0    0    0    1    1    0    
+0    0    0                
6    0    0    0    0    0    0    0    1    1    0    0    0    0    
+1    1    0                
7    0    0    1    0    0    1    1    1    1    1    0    0    0    
+1    1    0                
8    1    0    1    1    0    1    1    1    1    1    0    0    1    
+1    1    1                
9    0    0    0    0    0    0    0    1    1    0    0    0    0    
+1    1    0                
10    0    1    0    0    1    0    0    0    0    0    1    1    0   
+ 0    0    0                
11    0    1    0    0    1    0    0    1    1    0    1    1    0   
+ 1    1    0                
12    1    1    1    1    1    1    1    0    0    1    1    1    1   
+ 0    0    1                
13    0    0    1    0    0    1    1    0    0    1    0    0    0   
+ 0    0    0                
14    0    0    1    0    0    1    1    1    1    1    0    0    0   
+ 1    1    0                
15    0    0    1    0    0    1    1    0    0    1    0    0    0   
+ 0    0    0                
16    1    0    0    1    0    0    0    0    0    0    0    0    1   
+ 0    0    1                
17    1    0    0    1    0    0    0    0    0    0    0    0    1   
+ 0    0    1                
18    0    0    1    0    0    1    1    0    0    1    0    0    0   
+ 0    0    0                
19    1    1    1    1    1    1    1    1    1    1    1    1    1   
+ 1    1    1                
20    0    1    1    0    1    1    1    1    1    1    1    1    0   
+ 1    1    0                
21    0    0    0    0    0    0    0    1    1    0    0    0    0   
+ 1    1    0                
22    1    1    1    1    1    1    1    1    1    1    1    1    1   
+ 1    1    1                
23    1    1    1    1    1    1    1    1    1    1    1    1    1   
+ 1    1    1                
24    0    0    0    0    0    0    0    0    0    0    0    0    0   
+ 0    0    0                
25    0    0    0    0    0    0    0    0    0    0    0    0    0   
+ 0    0    0                
26    1    1    1    1    1    1    1    0    0    1    1    1    1   
+ 0    0    1                
27    0    1    0    0    1    0    0    0    0    0    1    1    0   
+ 0    0    0                
28    0    0    0    1    0    0    0    1    1    0    0    0    1   
+ 1    1    1                
29    0    0    0    0    0    0    0    1    1    0    0    0    0   
+ 1    1    0                
30    0    0    0    1    0    0    0    1    1    0    0    0    1   
+ 1    1    1
[download]

In reply to Best way to store/access large dataset? by Speed_Freak

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.