artist has asked for the wisdom of the Perl Monks concerning the following question:

I have a database table 'Sales', with 2 fields Product and Customer. It shows the product purchased by customers. I like to find subsets of this table. Each subset has Product and Customers. Each customer must have purchased all the product in a given subset. It's very trivial to create subsets (Product X Customer) like 1x1 or NPx1 or 1 x NC. (NP = Number of Products, NC = Number of Customers) I am looking forward to find out subsets where NP and NC have higher values. Let's say top 10 values for NP in NP x 2 or NP x 3, or even NP x NC ie.. top 10 values for NP for each of top 10 values of NC.

Pointers, directions, code :-) ?
Thanks..

Update: I am not necessarily looking for SQL solution.

Replies are listed 'Best First'.
Re: Popular Products
by bart (Canon) on Mar 09, 2005 at 20:11 UTC
    So you want to find popular combinations of products?

    I'd try the following approach, and see where it leads you. Make a join from this table to itself, linked by customer. Require that the id for the product in the second copy of the table is higher than the one in the first copy. That way you'll eliminate links of products to themselves, as well as removing duplicates a->b and b->a. See what products are still very popular, and what combinations come out a lot.

    From there, it's not to hard to identify those customers that bought both (or all three...) products.

    Roughly tested SQL:

    SELECT a.product as product_a, a.customer, b.product as product_b FROM Sales a INNER JOIN Sales b ON a.Customer = b.Customer WHERE a.product < b.product
    and this variation:
    SELECT a.product as product_a, b.product as product_b, count(*) as pop +ularity FROM Sales a INNER JOIN Sales b ON a.Customer = b.Customer WHERE a.pro +duct < b.product GROUP BY a.product, b.product
Re: Popular Products
by osunderdog (Deacon) on Mar 09, 2005 at 19:34 UTC

    I would probably approach this with Graph::Base, more specifically, Graph::Undirected. Create a vertex for each customer and product. Add Edges between customer and products that they use. Then you could determine the favorite products by looking at the degree for each product node.

    On the other hand, you could probably get what you want in the database world by using a 'group by' clause.

    Just some thoughts.


    "Look, Shiny Things!" is not a better business strategy than compatibility and reuse.