Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Data Bases > Pgsql Hackers > Re: [GENERAL] F...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 1 of 2 Topic 9479 of 10966
Post > Topic >>

Re: [GENERAL] Fragments in tsearch2 headline

by sushant354@[EMAIL PROTECTED] (Sushant Sinha) Jun 3, 2008 at 08:14 PM

My main argument for using Cover instead of hlCover was that Cover will
be faster. I tested the default headline generation that uses hlCover
with the current patch that uses Cover. There was not much difference.
So I think you are right in that we do not need norms and we can just
use hlCover.

I also compared performance of ts_headline with my first patch to
headline generation (one which was a separate function and took tsvector
as input). The performance was dramatically different. For one query
ts_headline took roughly 200 ms while headline_with_fragments took just
70 ms. On an another query ts_headline took 76 ms while
headline_with_fragments took 24 ms. You can find 'explain analyze' for
the first query at the bottom of the page. 

These queries were run multiple times to ensure that I never hit the
disk. This is a m/c with 2.0 GhZ Pentium 4 CPU and 512 MB RAM running
Linux 2.6.22-gentoo-r8.

A couple of caveats: 

1. ts_headline testing was done with current cvs head where as
headline_with_fragments was done with postgres 8.3.1.

2. For headline_with_fragments, TSVector for the do***ent was obtained
by joining with another table.

Are these differences understandable?

If you think these caveats are the reasons or there is something I am
missing, then I can repeat the entire experiments with exactly the same
conditions. 

-Sushant.


Here is 'explain analyze' for both the functions:


ts_headline
------------

lawdb=# explain analyze SELECT ts_headline('english', doc, q, '')
            FROM    docraw, plainto_tsquery('english', 'freedom of
speech') as q
            WHERE   docraw.tid = 125596;
                                                         QUERY
PLAN                                                         

 Nested Loop  (cost=0.00..8.31 rows=1 width=497) (actual
time=199.692..200.207 rows=1 loops=1)
   ->  Index Scan using docraw_pkey on docraw  (cost=0.00..8.29 rows=1
width=465) (actual time=0.041..0.065 rows=1 loops=1)
         Index Cond: (tid = 125596)
   ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.010..0.014 rows=1 loops=1)
 Total runtime: 200.311 ms


headline_with_fragments
-----------------------

lawdb=# explain analyze SELECT headline_with_fragments('english',
docvector, doc, q, 'MaxWords=40')
            FROM    docraw, docmeta, plainto_tsquery('english', 'freedom
of speech') as q
            WHERE   docraw.tid = 125596 and docmeta.tid=125596;
                                                             QUERY
PLAN                                                             
----------------------
 Nested Loop  (cost=0.00..16.61 rows=1 width=883) (actual
time=70.564..70.949 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..16.59 rows=1 width=851) (actual
time=0.064..0.094 rows=1 loops=1)
         ->  Index Scan using docraw_pkey on docraw  (cost=0.00..8.29
rows=1 width=454) (actual time=0.040..0.044 rows=1 loops=1)
               Index Cond: (tid = 125596)
         ->  Index Scan using docmeta_pkey on docmeta  (cost=0.00..8.29
rows=1 width=397) (actual time=0.017..0.040 rows=1 loops=1)
               Index Cond: (docmeta.tid = 125596)
   ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.012..0.016 rows=1 loops=1)
 Total runtime: 71.076 ms
(8 rows)


On Tue, 2008-06-03 at 22:53 +0400, Teodor Sigaev wrote:
> > Why we need norms?
> 
> We don't need norms at all - all matched HeadlineWordEntry already
marked by 
> HeadlineWordEntry->item! If it equals to NULL then this word isn't
contained in 
> tsquery.
> 
> > hlCover does the exact thing that Cover in tsrank does which is to
find
> > the  cover that contains the query. However hlcover has to go through
> > words that do not match the query. Cover on the other hand operates on
> > position indexes for just the query words and so it should be faster. 
> Cover, by definition, is a minimal continuous text's piece matched by
query. May 
> be a several covers in text and hlCover will find all of them. Next, 
> prsd_headline() (for now) tries to define the best one. "Best" means:
cover 
> contains a lot of words from query, not less that MinWords, not greater
than 
> MaxWords, hasn't words shorter that ShortWord on the begin and end of
cover etc.
> > 
> > The main reason why I would I like it to be fast is that I want to
> > generate all covers for a given query. Then choose covers with
smallest
> hlCover generates all covers.
> 
> > Let me know what you think on this patch and I will update the patch
to
> > respect other options like MinWords and ShortWord. 
> 
> As I understand, you very wish to call Cover() function instead of
hlCover() - 
> by design, they should be identical, but accepts different do***ent's 
> representation. So, the best way is generalize them: develop a new one
which can 
> be called with some kind of callback or/and opaque structure to use it
in both 
> rank and headline.
> 
> > 
> > NumFragments < 2:
> > I wanted people to use the new headline marker if they specify
> > NumFragments >= 1. If they do not specify the NumFragments or put it
to
> Ok, but if you unify cover generation and NumFragments == 1 then result
for old 
> and new algorithms should be the same...
> 
> 
> > On an another note I found that make_tsvector crashes if it receives a
> > ParsedText with curwords = 0. Specifically uniqueWORD returns curwords
> > as 1 even when it gets 0 words. I am not sure if this is the desired
> > behavior.
> In all places there is a check before call of make_tsvector.
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@[EMAIL PROTECTED]
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
 




 2 Posts in Topic:
Re: [GENERAL] Fragments in tsearch2 headline
sushant354@[EMAIL PROTECT  2008-06-03 20:14:32 
Re: [GENERAL] Fragments in tsearch2 headline
teodor@[EMAIL PROTECTED]   2008-06-05 20:21:26 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Mon Dec 1 11:50:34 CST 2008.