Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Data Bases > Pgsql Hackers > Re: gsoc, text ...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 6 of 28 Topic 9625 of 11009
Post > Topic >>

Re: gsoc, text search selectivity and dllist enhancments

by tgl@[EMAIL PROTECTED] (Tom Lane) Jul 7, 2008 at 11:58 AM

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j.urbanski@[EMAIL PROTECTED]
> writes:
> Once again: the whole algorithm is a ripoff from analyze.c, with the 
> dllist instead of an array because you don't know how big tracking list 
> will you need and with a hashtable, because the tracking list will 
> probably be large and looking up things linearly wouldn't be fun.

Hmmm ... I had forgotten how compute_minimal_stats() works, and looking
at it now I'm just itching to rewrite it.  You have to realize that that
code was written with the assumptions that (1) it was only going to be
used for a few weird second-class-citizen data types, and (2) the stats
target would typically be around 10 or so.  It really wasn't designed to
be industrial-strength code.  (It was still a big improvement over what
we'd had before :-(.)  So I'm not very comfortable with taking it as the
design base for something that does need to be industrial-strength.

Your point about not wanting the most-recently-added lexeme to drop off
first is a good one, but the approach is awfully fragile.  Consider what
happens if (or when) all the lexemes in the list reach count 2 or more.
All other lexemes seen later will fight over the last list slot, and
have no chance of getting higher in the list; so the algorithm will
completely fail to adapt to any changes in the input statistics that
occur after that point is reached.  I'm thinking that the idea of
periodically cutting off a bunch of low-scoring items, instead of trying
to do it "exactly", would actually be better because it'd still have a
chance of tracking new data even when the counts get larger.

I don't recommend doing two p***** over the data because it's fairly
likely that tsvectors would be large enough to be toasted, which'd make
fetching them expensive.  One idea is to start out with some reasonable
estimate of the max lexemes per tsvector (say a couple hundred) and
realloc the list bigger in sizable jumps (say 2X) when the estimate is
seen to be exceeded.  Another possibility is to have a prescan that only
determines the width of the physically largest tsvector (you can get the
width without detoasting, so this would be quite cheap), and then
estimate the number of lexemes corresponding to that using some
fudge-factor guess about lexeme sizes, and then stick with the resulting
list size regardless of what the actual lexeme counts are.  I kinda like
the latter since its behavior is order-independent.

			regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@[EMAIL PROTECTED]
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
 




 28 Posts in Topic:
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-03 15:25:44 
Re: gsoc, text search selectivity and dllist enhancments
heikki@[EMAIL PROTECTED]   2008-07-04 10:32:32 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-04 11:53:56 
Re: gsoc, text search selectivity and dllist enhancments
heikki@[EMAIL PROTECTED]   2008-07-04 22:20:08 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-06 11:43:20 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-07 11:58:45 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-07 23:53:48 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-09 00:33:48 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-08 18:43:19 
Re: gsoc, text search selectivity and dllist enhancments
alvherre@[EMAIL PROTECTED  2008-07-10 16:27:31 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-10 22:32:26 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-10 17:02:36 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-10 23:26:35 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-10 18:19:36 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-11 08:18:25 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-10 16:37:28 
Re: gsoc, text search selectivity and dllist enhancments
oleg@[EMAIL PROTECTED] (  2008-07-14 11:47:17 
Re: gsoc, text search selectivity and dllist enhancments
rlippan@[EMAIL PROTECTED]  2008-07-14 07:51:36 
Re: gsoc, text search selectivity and dllist enhancments
oleg@[EMAIL PROTECTED] (  2008-07-14 16:38:30 
Re: gsoc, text search selectivity and dllist enhancments
oleg@[EMAIL PROTECTED] (  2008-07-11 03:12:48 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-11 08:23:05 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-11 02:30:26 
Re: gsoc, text search selectivity and dllist enhancments
j.urbanski@[EMAIL PROTECT  2008-07-11 17:31:00 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-13 20:54:19 
Re: gsoc, text search selectivity and dllist enhancments
alvherre@[EMAIL PROTECTED  2008-07-13 23:52:43 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-14 00:02:39 
Re: gsoc, text search selectivity and dllist enhancments
alvherre@[EMAIL PROTECTED  2008-07-14 01:01:20 
Re: gsoc, text search selectivity and dllist enhancments
tgl@[EMAIL PROTECTED] (T  2008-07-14 01:06:43 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Fri Dec 5 8:24:00 CST 2008.