Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Data Bases > Pgsql Hackers > Re: gsoc, text ...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 8 of 28 Topic 9625 of 11013
Post > Topic >>

Re: gsoc, text search selectivity and dllist enhancments

by j.urbanski@[EMAIL PROTECTED] (=?UTF-8?B?SmFuIFVyYmHFhHNraQ== Jul 9, 2008 at 12:33 AM

Jan Urba=C5=84ski wrote:
> If you think the Lossy Counting method has potential, I could test it=20
> somehow. Using my current work I could extract a stream of lexemes as=20
> ANALYZE sees it and run it through a python implementation of the=20
> algorithm to see if the result makes sense.

I hacked together a simplistic python implementation and ran it on a=20
table with 244901 tsvectors, 45624891 lexemes total. I was comparing=20
results from my current approach with the results I'd get from a Lossy=20
Counting algorithm.
I experimented with statistics_target set to 10 and 100, and ran
pruning=20
in the LC algorithm every 3, 10 or 100 tsvectors.
The sample size with statistics_target set to 100 was 30000 rows and=20
that's what the input to the script was - lexemes from these 30000=20
tsvectors.
I found out that with pruning happening every 10 tsvectors I got=20
precisely the same results as with the original algorithm (same most=20
common lexemes, same frequencies). When I tried pruning after every 100=20
tsvectors the results changed very slightly (they were a tiny bit more=20
distant from the ones from the original algorithm, and I think a tiny=20
bit more precise, but I didn't give it much attention).

Bottom line seems to be: the Lossy Counting algorithm gives roughly the=20
same results as the algorithm used currently and is also possibly
faster=20
(and more scalable wrt. statistics_target).

This should probably get more testing than just running some script 5=20
times over a fixed set of data, but I had trouble already sucking ~300=20
MB of tsvectors from one of my production sites, putting it on my
laptop=20
and so on.
Do you think it's worthwhile to implement the LC algorithm in C and
send=20
it out, so others could try it out? Heck, maybe it's worthwhile to=20
replace the current compute_minimal_stats() algorithm with LC and see=20
how that compares?

Anyway, I can share the python script if someone would like to do some=20
more tests (I suppose no-one would,  'cause you first need to apply my=20
ts_typanalyze patch and then change it some more to extract lexemes
from=20
the sample).

Cheers,
Jan

--=20
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--=20
Sent via pgsql-hackers mailing list (pgsql-hackers@[EMAIL PROTECTED]
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
 




 28 Posts in Topic:
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-03 15:25:44 
Re: gsoc, text search selectivity and dllist enhancments
heikki@[EMAIL PROTECTED]   2008-07-04 10:32:32 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-04 11:53:56 
Re: gsoc, text search selectivity and dllist enhancments
heikki@[EMAIL PROTECTED]   2008-07-04 22:20:08 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-06 11:43:20 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-07 11:58:45 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-07 23:53:48 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-09 00:33:48 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-08 18:43:19 
Re: gsoc, text search selectivity and dllist enhancments
alvherre@[EMAIL PROTECTED  2008-07-10 16:27:31 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-10 22:32:26 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-10 17:02:36 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-10 23:26:35 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-10 18:19:36 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-11 08:18:25 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-10 16:37:28 
Re: gsoc, text search selectivity and dllist enhancments
oleg@[EMAIL PROTECTED] (  2008-07-14 11:47:17 
Re: gsoc, text search selectivity and dllist enhancments
rlippan@[EMAIL PROTECTED]  2008-07-14 07:51:36 
Re: gsoc, text search selectivity and dllist enhancments
oleg@[EMAIL PROTECTED] (  2008-07-14 16:38:30 
Re: gsoc, text search selectivity and dllist enhancments
oleg@[EMAIL PROTECTED] (  2008-07-11 03:12:48 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-11 08:23:05 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-11 02:30:26 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-11 17:31:00 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-13 20:54:19 
Re: gsoc, text search selectivity and dllist enhancments
alvherre@[EMAIL PROTECTED  2008-07-13 23:52:43 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-14 00:02:39 
Re: gsoc, text search selectivity and dllist enhancments
alvherre@[EMAIL PROTECTED  2008-07-14 01:01:20 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-14 01:06:43 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Fri Dec 5 9:05:12 CST 2008.