Clustering/classifying recommendations

f77coder has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I'm interested in recommendations for clustering with attributes of being fast over lightweight/small. So I'd prefer loops over one-liners if the loop can be executed faster. Now I'm looking through the large list of various CPAN archives (AI, Bayes, Cluster, etc) and would like narrow down the search. I don't mind getting the source code and having to hack if doesn't quite match what I need to do rather than having an expectation of something work as is.

The input data is a mixture of integers and strings, all categorical data. I'd like to look at each data line as an array and do vector processing, think of it as a 1d image processing problem, how many pixels are different.

For example,

line1=> cat1=123, cat2=92, cat3=5, cat4='0xffa411', cat5='0x221133', cat6='0xa291f1'

line2=> cat1=3, cat2=92, cat3=5, cat4='0xaf1401', cat5='0xaaffcc', cat6='0xa23af1'

I'd like to create a distance measurement based only on the number of categories that are different, in this case, the distance map would be (cat2,cat3,4). There will probably be a weighting function applied to this metric as well.

Once the training is complete then for a new line make a prediction with the classify/cluster.

Thanks

Comment on Clustering/classifying recommendations

Replies are listed 'Best First'.
Re: Clustering/classifying recommendations by Laurent_R (Canon) on Aug 19, 2014 at 20:24 UTC
Hmm, your requirement is not very clear to me (and probably to other monks as well, judging from the answers you've got so far), but if you want to compare lists, it seems to me that the List::Util and List::MoreUtils CPAN modules might be the first place to go.	[reply]
Re^2: Clustering/classifying recommendations by f77coder (Beadle) on Aug 20, 2014 at 03:51 UTC
Essentially think of each input line represents a point in an N-dimensional space. I want to classify/cluster these points and need a metric to measure the separation/distance. one point= (category 1, category 2, category 3…. category N) Bioperl tied with Bayes might do the trick	[reply]