Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Data Bases > Pgsql Sql > Re: in() VS exi...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 1 of 1 Topic 3347 of 3799
Post > Topic >>

Re: in() VS exists()

by Bart.Degryse@[EMAIL PROTECTED] ("Bart Degryse") Mar 13, 2008 at 03:50 PM

--=__PartC7E1540E.0__=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

So it all comes back to what I wrote in the beginning:
"Which solution performs best depends on the data, the database
version, the available indexes, ..."
Tips:
  - be aware that statements can be written in different ways
  - test them on realistic data
  - use explain to tune your statements


>>> Julien <jcigar@[EMAIL PROTECTED]
> 2008-03-13 17:50 >>>
>From my experience I tend to avoid SELECT DISTINCT queries because
it's
usually very slow with many rows ...

For my specific case the result is the same:

muridae=> select count(distinct s.specimen_id) from specimens sp INNER
JOIN sequences s on s.specimen_id = sp.id;
count 
-------
  1431
(1 row)

Time: 65.351 ms
muridae=> select count(sp.id) from specimens sp where sp.id in (select
specimen_id from sequences group by specimen_id);
count 
-------
  1431
(1 row)

Time: 66.371 ms

But to give an example, I have a table with ~1 000 000 rows where the
DISTINCT solution is more than 10 times slower :

muridae=> select count(distinct sp.id) from specimens sp INNER JOIN
specimen_measurements m ON m.specimen_id = sp.id;
count 
-------
75241
(1 row)

Time: 15970.668 ms

muridae=> select count(sp.id) from specimens sp INNER JOIN (select
specimen_id from specimen_measurements GROUP BY specimen_id) as foo on
foo.specimen_id = sp.id;
count 
-------
75241
(1 row)

Time: 1165.487 ms

Regards,
Julien

On Thu, 2008-03-13 at 15:28 +0100, Bart Degryse wrote:
> how about
> select count(distinct s.specimen_id) from specimens sp INNER JOIN
> sequences s 
> on s.specimen_id = sp.id;
> 
> 
> >>> Julien <jcigar@[EMAIL PROTECTED]
> 2008-03-13 17:27 >>>
> mmh no because it's a one to many relation (a specimen can have more
> than one sequence) :
> 
> muridae=> select count(sp.id) from specimens sp INNER JOIN sequences
s
> on s.specimen_id = sp.id;
> count 
> -------
>   1536
> (1 row)
> 
> Time: 81.242 ms
> muridae=> select count(sp.id) from specimens sp where sp.id in
(select
> specimen_id from sequences group by specimen_id);
> count 
> -------
>   1431
> (1 row)
> 
> Time: 81.736 ms
> muridae=> 
> 
> (of course this is a bad example, because I could just do: select
> count(specimen_id) from sequences group by specimen_id;, but in my
> application I have more fields coming from specimens of course)
> 
> Julien
> 
> On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:
> > I think that just
> > select count(sp.id) from specimens sp INNER JOIN sequences s on
> > s.specimen_id = sp.id;
> > should be enough
> > 
> > >>> Julien <jcigar@[EMAIL PROTECTED]
> 2008-03-13 17:10 >>>
> > If I understood well the query plan, the planner optimize the
> > IN(SELECT ...) version with a JOIN (line 19-20 of the first paste)
:
> > 
> > ->  Hash IN Join  (cost=240.95..4011.20 rows=1436 width=4) (actual
> > time=93.971..201.908 rows=1431 loops=1)
> >          Hash Cond: ("outer".id = "inner".specimen_id)
> > 
> > so I guess that :
> > 
> > select count(sp.id) from specimens sp where sp.id in (select
> > specimen_id
> > from sequences);
> > 
> > is almost the same as :
> > 
> > select count(sp.id) from specimens sp INNER JOIN (select
specimen_id
> > from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
> > sp.id;
> > 
> > ?
> > 
> > Thanks,
> > Julien
> > 
> > On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> > > The chapter on indexes in the manual should give you a pretty
good
> > > idea on the why.
> > > IN and EXISTS are not the only possibilities, you can also use
> inner
> > > or outer joins.
> > > Which solution performs best depends on the data, the database
> > > version, the available indexes, ...
> > > 
> > > >>> Julien <jcigar@[EMAIL PROTECTED]
> 2008-03-13 15:47 >>>
> > > Hello,
> > > 
> > > Does anyone has an idea why sometimes:
> > > - select ... where ... in (select ...)
> > > is faster than :
> > > - select ... where ... exists(select ...)
> > > and sometimes it's the opposite ?
> > > 
> > > I had such a situation, I've pasted the queries on:
> > > http://rafb.net/p/KXNZ6892.html
and
> http://rafb.net/p/jvo5DO38.html

> > > 
> > > It's running PostgreSQL 8.1 with an effective_cache_size of
30000.
> > > 
> > > specimens.id is the primary key and there are indexes on
> > > sequences(specimen_id) and specimen_measurements(specimen_id)
> > > 
> > > Is there a general "rule" to know when to use the in() version
and
> > > when
> > > to use the exists() version ? Is it true to say that the
exists()
> > > version is more scalable (with many rows) than the in() version
> > (from
> > > the little tests I made it seems the case) ?
> > > 
> > > Thanks,
> > > Julien
> > > 
> > > -- 
> > > Julien Cigar
> > > Belgian Biodiversity Platform
> > > http://www.biodiversity.be
( http://www.biodiversity.be/
)
> > > Université Libre de Bruxelles (ULB)
> > > Campus de la Plaine CP 257
> > > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > > Boulevard du Triomphe, entrée ULB 2
> > > B-1050 Bruxelles
> > > Mail: jcigar@[EMAIL PROTECTED]
 
> > > @[EMAIL PROTECTED]
 http://biobel.biodiversity.be/person/show/471

> > > Tel : 02 650 57 52
> > > 
> > > 
> > > -- 
> > > Sent via pgsql-sql mailing list (pgsql-sql@[EMAIL PROTECTED]
)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-sql

> > > 
> > -- 
> > Julien Cigar
> > Belgian Biodiversity Platform
> > http://www.biodiversity.be
( http://www.biodiversity.be/
)
> > Université Libre de Bruxelles (ULB)
> > Campus de la Plaine CP 257
> > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > Boulevard du Triomphe, entrée ULB 2
> > B-1050 Bruxelles
> > Mail: jcigar@[EMAIL PROTECTED]
 
> > @[EMAIL PROTECTED]
 http://biobel.biodiversity.be/person/show/471

> > Tel : 02 650 57 52
> > 
> > 
> > -- 
> > Sent via pgsql-sql mailing list (pgsql-sql@[EMAIL PROTECTED]
)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-sql

> > 
> -- 
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be
( http://www.biodiversity.be/
)
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: jcigar@[EMAIL PROTECTED]
 
> @[EMAIL PROTECTED]
 http://biobel.biodiversity.be/person/show/471

> Tel : 02 650 57 52
> 
> 
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@[EMAIL PROTECTED]
)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql

> 
-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
( http://www.biodiversity.be/
)
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar@[EMAIL PROTECTED]
 
@[EMAIL PROTECTED]
 http://biobel.biodiversity.be/person/show/471

Tel : 02 650 57 52


--=__PartC7E1540E.0__=
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8">
<META content=3D"MSHTML 6.00.2800.1264" name=3DGENERATOR></HEAD>
<BODY style=3D"MARGIN: 4px 4px 1px; FONT: 10pt Tahoma">
<DIV>So it all comes back to what I wrote in the beginning:</DIV>
<DIV>"Which solution performs best depends on the data, the
database&nbsp;v=
ersion, the available indexes, ..."</DIV>
<DIV>Tips:</DIV>
<DIV>&nbsp;&nbsp;- be aware that statements can be written in different =
ways</DIV>
<DIV>&nbsp; - test them on realistic data</DIV>
<DIV>&nbsp; - use explain to tune your statements<BR><BR><BR>&gt;&gt;&gt;
=
Julien &lt;jcigar@[EMAIL PROTECTED]
 2008-03-13 17:50 &gt;&gt;&gt;<BR>&gt;From
=
my experience I tend to avoid SELECT DISTINCT queries because
it's<BR>usual=
ly very slow with many rows ...<BR><BR>For my specific case the result is
=
the same:<BR><BR>muridae=3D&gt; select count(distinct s.specimen_id) from
=
specimens sp INNER<BR>JOIN sequences s on s.specimen_id =3D
sp.id;<BR>count=
 <BR>-------<BR>&nbsp; 1431<BR>(1 row)<BR><BR>Time: 65.351
ms<BR>muridae=3D=
&gt; select count(sp.id) from specimens sp where sp.id in
(select<BR>specim=
en_id from sequences group by specimen_id);<BR>count <BR>-------<BR>&nbsp;
=
1431<BR>(1 row)<BR><BR>Time: 66.371 ms<BR><BR>But to give an example, I =
have a table with ~1 000 000 rows where the<BR>DISTINCT solution is more =
than 10 times slower :<BR><BR>muridae=3D&gt; select count(distinct sp.id)
=
from specimens sp INNER JOIN<BR>specimen_measurements m ON m.specimen_id =
=3D sp.id;<BR>count <BR>-------<BR>75241<BR>(1 row)<BR><BR>Time: 15970.668
=
ms<BR><BR>muridae=3D&gt; select count(sp.id) from specimens sp INNER JOIN
=
(select<BR>specimen_id from specimen_measurements GROUP BY specimen_id) as
=
foo on<BR>foo.specimen_id =3D sp.id;<BR>count <BR>-------<BR>75241<BR>(1 =
row)<BR><BR>Time: 1165.487 ms<BR><BR>Regards,<BR>Julien<BR><BR>On Thu, =
2008-03-13 at 15:28 +0100, Bart Degryse wrote:<BR>&gt; how about<BR>&gt; =
select count(distinct s.specimen_id) from specimens sp INNER JOIN<BR>&gt;
=
sequences s <BR>&gt; on s.specimen_id =3D sp.id;<BR>&gt; <BR>&gt; <BR>&gt;
=
&gt;&gt;&gt; Julien &lt;jcigar@[EMAIL PROTECTED]
 2008-03-13 17:27
&gt;&gt;&gt;<=
BR>&gt; mmh no because it's a one to many relation (a specimen can have =
more<BR>&gt; than one sequence) :<BR>&gt; <BR>&gt; muridae=3D&gt; select =
count(sp.id) from specimens sp INNER JOIN sequences s<BR>&gt; on
s.specimen=
_id =3D sp.id;<BR>&gt; count <BR>&gt; -------<BR>&gt;&nbsp;&nbsp; =
1536<BR>&gt; (1 row)<BR>&gt; <BR>&gt; Time: 81.242 ms<BR>&gt;
muridae=3D&gt=
; select count(sp.id) from specimens sp where sp.id in (select<BR>&gt; =
specimen_id from sequences group by specimen_id);<BR>&gt; count <BR>&gt; =
-------<BR>&gt;&nbsp;&nbsp; 1431<BR>&gt; (1 row)<BR>&gt; <BR>&gt; Time: =
81.736 ms<BR>&gt; muridae=3D&gt; <BR>&gt; <BR>&gt; (of course this is a =
bad example, because I could just do: select<BR>&gt; count(specimen_id) =
from sequences group by specimen_id;, but in my<BR>&gt; application I have
=
more fields coming from specimens of course)<BR>&gt; <BR>&gt;
Julien<BR>&gt=
; <BR>&gt; On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:<BR>&gt;
=
&gt; I think that just<BR>&gt; &gt; select count(sp.id) from specimens sp
=
INNER JOIN sequences s on<BR>&gt; &gt; s.specimen_id =3D sp.id;<BR>&gt; =
&gt; should be enough<BR>&gt; &gt; <BR>&gt; &gt; &gt;&gt;&gt; Julien =
&lt;jcigar@[EMAIL PROTECTED]
 2008-03-13 17:10 &gt;&gt;&gt;<BR>&gt; &gt; If I =
understood well the query plan, the planner optimize the<BR>&gt; &gt; =
IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) =
:<BR>&gt; &gt; <BR>&gt; &gt; -&gt;&nbsp; Hash IN Join&nbsp;
(cost=3D240.95.=
..4011.20 rows=3D1436 width=3D4) (actual<BR>&gt; &gt;
time=3D93.971..201.908=
 rows=3D1431 loops=3D1)<BR>&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp; Hash Cond: ("outer".id =3D "inner".specimen_id)<BR>&gt; =
&gt; <BR>&gt; &gt; so I guess that :<BR>&gt; &gt; <BR>&gt; &gt; select =
count(sp.id) from specimens sp where sp.id in (select<BR>&gt; &gt; =
specimen_id<BR>&gt; &gt; from sequences);<BR>&gt; &gt; <BR>&gt; &gt; is =
almost the same as :<BR>&gt; &gt; <BR>&gt; &gt; select count(sp.id) from =
specimens sp INNER JOIN (select specimen_id<BR>&gt; &gt; from sequences =
GROUP BY specimen_id) as foo on foo.specimen_id =3D<BR>&gt; &gt;
sp.id;<BR>=
&gt; &gt; <BR>&gt; &gt; ?<BR>&gt; &gt; <BR>&gt; &gt; Thanks,<BR>&gt; &gt;
=
Julien<BR>&gt; &gt; <BR>&gt; &gt; On Thu, 2008-03-13 at 14:46 +0100, Bart
=
Degryse wrote:<BR>&gt; &gt; &gt; The chapter on indexes in the manual =
should give you a pretty good<BR>&gt; &gt; &gt; idea on the why.<BR>&gt; =
&gt; &gt; IN and EXISTS are not the only possibilities, you can also =
use<BR>&gt; inner<BR>&gt; &gt; &gt; or outer joins.<BR>&gt; &gt; &gt; =
Which solution performs best depends on the data, the database<BR>&gt; =
&gt; &gt; version, the available indexes, ...<BR>&gt; &gt; &gt; <BR>&gt; =
&gt; &gt; &gt;&gt;&gt; Julien &lt;jcigar@[EMAIL PROTECTED]
 2008-03-13 15:47 =
&gt;&gt;&gt;<BR>&gt; &gt; &gt; Hello,<BR>&gt; &gt; &gt; <BR>&gt; &gt; &gt;
=
Does anyone has an idea why sometimes:<BR>&gt; &gt; &gt; - select ... =
where ... in (select ...)<BR>&gt; &gt; &gt; is faster than :<BR>&gt; &gt;
=
&gt; - select ... where ... exists(select ...)<BR>&gt; &gt; &gt; and =
sometimes it's the opposite ?<BR>&gt; &gt; &gt; <BR>&gt; &gt; &gt; I had =
such a situation, I've pasted the queries on:<BR>&gt; &gt; &gt; <A =
href=3D"http://rafb.net/p/KXNZ6892.html">http://rafb.net/p/KXNZ6892.html</A=
> and<BR>&gt; <A
href=3D"http://rafb.net/p/jvo5DO38.html">http://rafb.net/p=
/jvo5DO38.html</A><BR>&gt; &gt; &gt; <BR>&gt; &gt; &gt; It's running =
PostgreSQL 8.1 with an effective_cache_size of 30000.<BR>&gt; &gt; &gt; =
<BR>&gt; &gt; &gt; specimens.id is the primary key and there are indexes =
on<BR>&gt; &gt; &gt; sequences(specimen_id) and
specimen_measurements(speci=
men_id)<BR>&gt; &gt; &gt; <BR>&gt; &gt; &gt; Is there a general "rule" to
=
know when to use the in() version and<BR>&gt; &gt; &gt; when<BR>&gt; &gt;
=
&gt; to use the exists() version ? Is it true to say that the
exists()<BR>&=
gt; &gt; &gt; version is more scalable (with many rows) than the in() =
version<BR>&gt; &gt; (from<BR>&gt; &gt; &gt; the little tests I made it =
seems the case) ?<BR>&gt; &gt; &gt; <BR>&gt; &gt; &gt; Thanks,<BR>&gt; =
&gt; &gt; Julien<BR>&gt; &gt; &gt; <BR>&gt; &gt; &gt; -- <BR>&gt; &gt; =
&gt; Julien Cigar<BR>&gt; &gt; &gt; Belgian Biodiversity Platform<BR>&gt;
=
&gt; &gt; <A
href=3D"http://www.biodiversity.be/">http://www.biodiversity.b=
e</A><BR>&gt; &gt; &gt; Universit=C3=A9 Libre de Bruxelles (ULB)<BR>&gt; =
&gt; &gt; Campus de la Plaine CP 257<BR>&gt; &gt; &gt; B=C3=A2timent NO, =
Bureau 4 N4 115C (Niveau 4)<BR>&gt; &gt; &gt; Boulevard du Triomphe, =
entr=C3=A9e ULB 2<BR>&gt; &gt; &gt; B-1050 Bruxelles<BR>&gt; &gt; &gt; =
Mail: jcigar@[EMAIL PROTECTED]
>&gt; &gt; &gt; @[EMAIL PROTECTED]
 <A
href=3D"http://biobel.=
biodiversity.be/person/show/471">http://biobel.biodiversity.be/person/show/=
471</A><BR>&gt; &gt; &gt; Tel : 02 650 57 52<BR>&gt; &gt; &gt; <BR>&gt; =
&gt; &gt; <BR>&gt; &gt; &gt; -- <BR>&gt; &gt; &gt; Sent via pgsql-sql =
mailing list (pgsql-sql@[EMAIL PROTECTED]
)<BR>&gt; &gt; &gt; To make changes
=
to your subscription:<BR>&gt; &gt; &gt; <A
href=3D"http://www.postgresql.or=
g/mailpref/pgsql-sql">http://www.postgresql.org/mailpref/pgsql-sql</A><BR>&=
gt; &gt; &gt; <BR>&gt; &gt; -- <BR>&gt; &gt; Julien Cigar<BR>&gt; &gt; =
Belgian Biodiversity Platform<BR>&gt; &gt; <A
href=3D"http://www.biodiversi=
ty.be/">http://www.biodiversity.be</A><BR>&gt;
&gt; Universit=C3=A9 Libre
=
de Bruxelles (ULB)<BR>&gt; &gt; Campus de la Plaine CP 257<BR>&gt; &gt; =
B=C3=A2timent NO, Bureau 4 N4 115C (Niveau 4)<BR>&gt; &gt; Boulevard du =
Triomphe, entr=C3=A9e ULB 2<BR>&gt; &gt; B-1050 Bruxelles<BR>&gt; &gt; =
Mail: jcigar@[EMAIL PROTECTED]
>&gt; &gt; @[EMAIL PROTECTED]
 <A
href=3D"http://biobel.biodi=
versity.be/person/show/471">http://biobel.biodiversity.be/person/show/471</=
A><BR>&gt; &gt; Tel : 02 650 57 52<BR>&gt; &gt; <BR>&gt; &gt; <BR>&gt; =
&gt; -- <BR>&gt; &gt; Sent via pgsql-sql mailing list
(pgsql-sql@[EMAIL PROTECTED]
)<BR>&gt; &gt; To make changes to your subscription:<BR>&gt; &gt; <A
=
href=3D"http://www.postgresql.org/mailpref/pgsql-sql">http://www.postgresql=
..org/mailpref/pgsql-sql</A><BR>&gt; &gt; <BR>&gt; -- <BR>&gt; Julien =
Cigar<BR>&gt; Belgian Biodiversity Platform<BR>&gt; <A
href=3D"http://www.b=
iodiversity.be/">http://www.biodiversity.be</A><BR>&gt;
Universit=C3=A9 =
Libre de Bruxelles (ULB)<BR>&gt; Campus de la Plaine CP 257<BR>&gt; =
B=C3=A2timent NO, Bureau 4 N4 115C (Niveau 4)<BR>&gt; Boulevard du =
Triomphe, entr=C3=A9e ULB 2<BR>&gt; B-1050 Bruxelles<BR>&gt; Mail: =
jcigar@[EMAIL PROTECTED]
>&gt; @[EMAIL PROTECTED]
 <A
href=3D"http://biobel.biodiversity.be/=
person/show/471">http://biobel.biodiversity.be/person/show/471</A><BR>&gt;
=
Tel : 02 650 57 52<BR>&gt; <BR>&gt; <BR>&gt; -- <BR>&gt; Sent via =
pgsql-sql mailing list (pgsql-sql@[EMAIL PROTECTED]
)<BR>&gt; To make changes
=
to your subscription:<BR>&gt; <A
href=3D"http://www.postgresql.org/mailpref=
/pgsql-sql">http://www.postgresql.org/mailpref/pgsql-sql</A><BR>&gt;
=
<BR>-- <BR>Julien Cigar<BR>Belgian Biodiversity Platform<BR><A
href=3D"http=
://www.biodiversity.be/">http://www.biodiversity.be</A><BR>Universit=C3=A9
=
Libre de Bruxelles (ULB)<BR>Campus de la Plaine CP 257<BR>B=C3=A2timent =
NO, Bureau 4 N4 115C (Niveau 4)<BR>Boulevard du Triomphe, entr=C3=A9e ULB
=
2<BR>B-1050 Bruxelles<BR>Mail: jcigar@[EMAIL PROTECTED]
>@[EMAIL PROTECTED]
 <A
href=3D"http=
://biobel.biodiversity.be/person/show/471">http://biobel.biodiversity.be/pe=
rson/show/471</A><BR>Tel : 02 650 57 52<BR><BR></DIV></BODY></HTML>

--=__PartC7E1540E.0__=--
 




 1 Posts in Topic:
Re: in() VS exists()
Bart.Degryse@[EMAIL PROTE  2008-03-13 15:50:06 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Mon Dec 1 21:29:31 CST 2008.