Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Data Bases > Pgsql Sql > Re: Removing re...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 3 of 5 Topic 3375 of 3799
Post > Topic >>

Re: Removing redundant itemsets

by craig@[EMAIL PROTECTED] (Craig Ringer) Mar 31, 2008 at 07:29 PM

Craig Ringer wrote:
> Allan Kamau wrote:
>> Hi all,
>> I have a list of purchases (market basket) and I would like to select
>> non redundant longest possible patterns by eliminating
>> (creating/populating other table to contain only non redandant
itemsets)
>> purchases having item lists which are fully included in at least one
>> other purchase.
> 
> Here's a possibly slow and surely ugly solution (I think it's right,
> though I haven't done more than passing testing):
> 
> 
> 
> CREATE VIEW togo_as_arr AS
>   SELECT a.tid,
>     ARRAY(SELECT item FROM togo b WHERE b.tid = a.tid ORDER BY item)
>     AS items
>   FROM togo a GROUP BY tid;
> 
> SELECT arr_a.tid AS redundant_tid, arr_b.tid AS contained_by
> FROM togo_as_arr arr_a CROSS JOIN togo_as_arr arr_b
> WHERE arr_a.tid <> arr_b.tid AND arr_a.items <@[EMAIL PROTECTED]
 arr_b.items;

Alternately:

-- Helps with the massively repeated subquery below
CREATE INDEX togo_by_tid_and_item ON togo(tid,item);

-- Find any `a' for which `item_from_a_is_in_b' is
-- true for all items in `a'
SELECT a_tid AS is_redundant, b_tid AS contained_by
FROM (
  -- For every item in every pair of purchases,
  -- determine whether the item in purchase `a'
  -- was also in purchase `b'.
  SELECT
    a.tid AS a_tid,
    b.tid AS b_tid,
    a.item AS item,
    EXISTS(
      -- Was this item from `a' also in the `b' purchase?
      SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item
    ) AS item_from_a_is_in_b
  FROM togo a INNER JOIN togo b ON (a.tid <> b.tid)
  GROUP BY a.tid, b.tid, a.item) AS item_containment
GROUP BY a_tid, b_tid
HAVING every(item_from_a_is_in_b);


.... which avoids the array building, but is actually considerably slower
on the trivial test data. That's not too surprising given that this
approach requires a subquery:

   SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item

for EVERY item to be tested. Twice, actually, as each record appears in
both `a' and `b' positions.

I'd be very interested to see what happened on real world test data,
especially compared to doing the array ac***ulation based query off a
tem****ary table instead of a view.

I suspect it'll depend on the average number of items per purchase -
lots of items per purchase and the array building cost will dominate.
That's really just a guess, though.

I'm sure there's a properly smart way to do this that I just can't
figure out, but this is the best I've come up with so far.

--
Craig Ringer

-- 
Sent via pgsql-sql mailing list (pgsql-sql@[EMAIL PROTECTED]
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql
 




 5 Posts in Topic:
Removing redundant itemsets
allank@[EMAIL PROTECTED]   2008-03-31 10:16:17 
Re: Removing redundant itemsets
craig@[EMAIL PROTECTED]   2008-03-31 18:53:28 
Re: Removing redundant itemsets
craig@[EMAIL PROTECTED]   2008-03-31 19:29:22 
Re: Removing redundant itemsets
craig@[EMAIL PROTECTED]   2008-03-31 19:48:34 
Re: Removing redundant itemsets
allank@[EMAIL PROTECTED]   2008-03-31 11:58:50 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Mon Dec 1 21:21:05 CST 2008.