--=__PartC7E1540E.0__=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
So it all comes back to what I wrote in the beginning:
"Which solution performs best depends on the data, the database
version, the available indexes, ..."
Tips:
- be aware that statements can be written in different ways
- test them on realistic data
- use explain to tune your statements
>>> Julien <jcigar@[EMAIL PROTECTED]
> 2008-03-13 17:50 >>>
>From my experience I tend to avoid SELECT DISTINCT queries because
it's
usually very slow with many rows ...
For my specific case the result is the same:
muridae=> select count(distinct s.specimen_id) from specimens sp INNER
JOIN sequences s on s.specimen_id = sp.id;
count
-------
1431
(1 row)
Time: 65.351 ms
muridae=> select count(sp.id) from specimens sp where sp.id in (select
specimen_id from sequences group by specimen_id);
count
-------
1431
(1 row)
Time: 66.371 ms
But to give an example, I have a table with ~1 000 000 rows where the
DISTINCT solution is more than 10 times slower :
muridae=> select count(distinct sp.id) from specimens sp INNER JOIN
specimen_measurements m ON m.specimen_id = sp.id;
count
-------
75241
(1 row)
Time: 15970.668 ms
muridae=> select count(sp.id) from specimens sp INNER JOIN (select
specimen_id from specimen_measurements GROUP BY specimen_id) as foo on
foo.specimen_id = sp.id;
count
-------
75241
(1 row)
Time: 1165.487 ms
Regards,
Julien
On Thu, 2008-03-13 at 15:28 +0100, Bart Degryse wrote:
> how about
> select count(distinct s.specimen_id) from specimens sp INNER JOIN
> sequences s
> on s.specimen_id = sp.id;
>
>
> >>> Julien <jcigar@[EMAIL PROTECTED]
> 2008-03-13 17:27 >>>
> mmh no because it's a one to many relation (a specimen can have more
> than one sequence) :
>
> muridae=> select count(sp.id) from specimens sp INNER JOIN sequences
s
> on s.specimen_id = sp.id;
> count
> -------
> 1536
> (1 row)
>
> Time: 81.242 ms
> muridae=> select count(sp.id) from specimens sp where sp.id in
(select
> specimen_id from sequences group by specimen_id);
> count
> -------
> 1431
> (1 row)
>
> Time: 81.736 ms
> muridae=>
>
> (of course this is a bad example, because I could just do: select
> count(specimen_id) from sequences group by specimen_id;, but in my
> application I have more fields coming from specimens of course)
>
> Julien
>
> On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:
> > I think that just
> > select count(sp.id) from specimens sp INNER JOIN sequences s on
> > s.specimen_id = sp.id;
> > should be enough
> >
> > >>> Julien <jcigar@[EMAIL PROTECTED]
> 2008-03-13 17:10 >>>
> > If I understood well the query plan, the planner optimize the
> > IN(SELECT ...) version with a JOIN (line 19-20 of the first paste)
:
> >
> > -> Hash IN Join (cost=240.95..4011.20 rows=1436 width=4) (actual
> > time=93.971..201.908 rows=1431 loops=1)
> > Hash Cond: ("outer".id = "inner".specimen_id)
> >
> > so I guess that :
> >
> > select count(sp.id) from specimens sp where sp.id in (select
> > specimen_id
> > from sequences);
> >
> > is almost the same as :
> >
> > select count(sp.id) from specimens sp INNER JOIN (select
specimen_id
> > from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
> > sp.id;
> >
> > ?
> >
> > Thanks,
> > Julien
> >
> > On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> > > The chapter on indexes in the manual should give you a pretty
good
> > > idea on the why.
> > > IN and EXISTS are not the only possibilities, you can also use
> inner
> > > or outer joins.
> > > Which solution performs best depends on the data, the database
> > > version, the available indexes, ...
> > >
> > > >>> Julien <jcigar@[EMAIL PROTECTED]
> 2008-03-13 15:47 >>>
> > > Hello,
> > >
> > > Does anyone has an idea why sometimes:
> > > - select ... where ... in (select ...)
> > > is faster than :
> > > - select ... where ... exists(select ...)
> > > and sometimes it's the opposite ?
> > >
> > > I had such a situation, I've pasted the queries on:
> > > http://rafb.net/p/KXNZ6892.html
and
> http://rafb.net/p/jvo5DO38.html
> > >
> > > It's running PostgreSQL 8.1 with an effective_cache_size of
30000.
> > >
> > > specimens.id is the primary key and there are indexes on
> > > sequences(specimen_id) and specimen_measurements(specimen_id)
> > >
> > > Is there a general "rule" to know when to use the in() version
and
> > > when
> > > to use the exists() version ? Is it true to say that the
exists()
> > > version is more scalable (with many rows) than the in() version
> > (from
> > > the little tests I made it seems the case) ?
> > >
> > > Thanks,
> > > Julien
> > >
> > > --
> > > Julien Cigar
> > > Belgian Biodiversity Platform
> > > http://www.biodiversity.be
( http://www.biodiversity.be/
)
> > > Université Libre de Bruxelles (ULB)
> > > Campus de la Plaine CP 257
> > > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > > Boulevard du Triomphe, entrée ULB 2
> > > B-1050 Bruxelles
> > > Mail: jcigar@[EMAIL PROTECTED]
> > > @[EMAIL PROTECTED]
http://biobel.biodiversity.be/person/show/471
> > > Tel : 02 650 57 52
> > >
> > >
> > > --
> > > Sent via pgsql-sql mailing list (pgsql-sql@[EMAIL PROTECTED]
)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-sql
> > >
> > --
> > Julien Cigar
> > Belgian Biodiversity Platform
> > http://www.biodiversity.be
( http://www.biodiversity.be/
)
> > Université Libre de Bruxelles (ULB)
> > Campus de la Plaine CP 257
> > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > Boulevard du Triomphe, entrée ULB 2
> > B-1050 Bruxelles
> > Mail: jcigar@[EMAIL PROTECTED]
> > @[EMAIL PROTECTED]
http://biobel.biodiversity.be/person/show/471
> > Tel : 02 650 57 52
> >
> >
> > --
> > Sent via pgsql-sql mailing list (pgsql-sql@[EMAIL PROTECTED]
)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-sql
> >
> --
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be
( http://www.biodiversity.be/
)
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: jcigar@[EMAIL PROTECTED]
> @[EMAIL PROTECTED]
http://biobel.biodiversity.be/person/show/471
> Tel : 02 650 57 52
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@[EMAIL PROTECTED]
)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>
--
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
( http://www.biodiversity.be/
)
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar@[EMAIL PROTECTED]
@[EMAIL PROTECTED]
http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52
--=__PartC7E1540E.0__=
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8">
<META content=3D"MSHTML 6.00.2800.1264" name=3DGENERATOR></HEAD>
<BODY style=3D"MARGIN: 4px 4px 1px; FONT: 10pt Tahoma">
<DIV>So it all comes back to what I wrote in the beginning:</DIV>
<DIV>"Which solution performs best depends on the data, the
database v=
ersion, the available indexes, ..."</DIV>
<DIV>Tips:</DIV>
<DIV> - be aware that statements can be written in different =
ways</DIV>
<DIV> - test them on realistic data</DIV>
<DIV> - use explain to tune your statements<BR><BR><BR>>>>
=
Julien <jcigar@[EMAIL PROTECTED]
2008-03-13 17:50 >>><BR>>From
=
my experience I tend to avoid SELECT DISTINCT queries because
it's<BR>usual=
ly very slow with many rows ...<BR><BR>For my specific case the result is
=
the same:<BR><BR>muridae=3D> select count(distinct s.specimen_id) from
=
specimens sp INNER<BR>JOIN sequences s on s.specimen_id =3D
sp.id;<BR>count=
<BR>-------<BR> 1431<BR>(1 row)<BR><BR>Time: 65.351
ms<BR>muridae=3D=
> select count(sp.id) from specimens sp where sp.id in
(select<BR>specim=
en_id from sequences group by specimen_id);<BR>count <BR>-------<BR>
=
1431<BR>(1 row)<BR><BR>Time: 66.371 ms<BR><BR>But to give an example, I =
have a table with ~1 000 000 rows where the<BR>DISTINCT solution is more =
than 10 times slower :<BR><BR>muridae=3D> select count(distinct sp.id)
=
from specimens sp INNER JOIN<BR>specimen_measurements m ON m.specimen_id =
=3D sp.id;<BR>count <BR>-------<BR>75241<BR>(1 row)<BR><BR>Time: 15970.668
=
ms<BR><BR>muridae=3D> select count(sp.id) from specimens sp INNER JOIN
=
(select<BR>specimen_id from specimen_measurements GROUP BY specimen_id) as
=
foo on<BR>foo.specimen_id =3D sp.id;<BR>count <BR>-------<BR>75241<BR>(1 =
row)<BR><BR>Time: 1165.487 ms<BR><BR>Regards,<BR>Julien<BR><BR>On Thu, =
2008-03-13 at 15:28 +0100, Bart Degryse wrote:<BR>> how about<BR>> =
select count(distinct s.specimen_id) from specimens sp INNER JOIN<BR>>
=
sequences s <BR>> on s.specimen_id =3D sp.id;<BR>> <BR>> <BR>>
=
>>> Julien <jcigar@[EMAIL PROTECTED]
2008-03-13 17:27
>>><=
BR>> mmh no because it's a one to many relation (a specimen can have =
more<BR>> than one sequence) :<BR>> <BR>> muridae=3D> select =
count(sp.id) from specimens sp INNER JOIN sequences s<BR>> on
s.specimen=
_id =3D sp.id;<BR>> count <BR>> -------<BR>> =
1536<BR>> (1 row)<BR>> <BR>> Time: 81.242 ms<BR>>
muridae=3D>=
; select count(sp.id) from specimens sp where sp.id in (select<BR>> =
specimen_id from sequences group by specimen_id);<BR>> count <BR>> =
-------<BR>> 1431<BR>> (1 row)<BR>> <BR>> Time: =
81.736 ms<BR>> muridae=3D> <BR>> <BR>> (of course this is a =
bad example, because I could just do: select<BR>> count(specimen_id) =
from sequences group by specimen_id;, but in my<BR>> application I have
=
more fields coming from specimens of course)<BR>> <BR>>
Julien<BR>>=
; <BR>> On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:<BR>>
=
> I think that just<BR>> > select count(sp.id) from specimens sp
=
INNER JOIN sequences s on<BR>> > s.specimen_id =3D sp.id;<BR>> =
> should be enough<BR>> > <BR>> > >>> Julien =
<jcigar@[EMAIL PROTECTED]
2008-03-13 17:10 >>><BR>> > If I =
understood well the query plan, the planner optimize the<BR>> > =
IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) =
:<BR>> > <BR>> > -> Hash IN Join
(cost=3D240.95.=
..4011.20 rows=3D1436 width=3D4) (actual<BR>> >
time=3D93.971..201.908=
rows=3D1431 loops=3D1)<BR>>
> &nb=
sp; Hash Cond: ("outer".id =3D "inner".specimen_id)<BR>> =
> <BR>> > so I guess that :<BR>> > <BR>> > select =
count(sp.id) from specimens sp where sp.id in (select<BR>> > =
specimen_id<BR>> > from sequences);<BR>> > <BR>> > is =
almost the same as :<BR>> > <BR>> > select count(sp.id) from =
specimens sp INNER JOIN (select specimen_id<BR>> > from sequences =
GROUP BY specimen_id) as foo on foo.specimen_id =3D<BR>> >
sp.id;<BR>=
> > <BR>> > ?<BR>> > <BR>> > Thanks,<BR>> >
=
Julien<BR>> > <BR>> > On Thu, 2008-03-13 at 14:46 +0100, Bart
=
Degryse wrote:<BR>> > > The chapter on indexes in the manual =
should give you a pretty good<BR>> > > idea on the why.<BR>> =
> > IN and EXISTS are not the only possibilities, you can also =
use<BR>> inner<BR>> > > or outer joins.<BR>> > > =
Which solution performs best depends on the data, the database<BR>> =
> > version, the available indexes, ...<BR>> > > <BR>> =
> > >>> Julien <jcigar@[EMAIL PROTECTED]
2008-03-13 15:47 =
>>><BR>> > > Hello,<BR>> > > <BR>> > >
=
Does anyone has an idea why sometimes:<BR>> > > - select ... =
where ... in (select ...)<BR>> > > is faster than :<BR>> >
=
> - select ... where ... exists(select ...)<BR>> > > and =
sometimes it's the opposite ?<BR>> > > <BR>> > > I had =
such a situation, I've pasted the queries on:<BR>> > > <A =
href=3D"http://rafb.net/p/KXNZ6892.html">http://rafb.net/p/KXNZ6892.html</A=
> and<BR>> <A
href=3D"http://rafb.net/p/jvo5DO38.html">http://rafb.net/p=
/jvo5DO38.html</A><BR>> > > <BR>> > > It's running =
PostgreSQL 8.1 with an effective_cache_size of 30000.<BR>> > > =
<BR>> > > specimens.id is the primary key and there are indexes =
on<BR>> > > sequences(specimen_id) and
specimen_measurements(speci=
men_id)<BR>> > > <BR>> > > Is there a general "rule" to
=
know when to use the in() version and<BR>> > > when<BR>> >
=
> to use the exists() version ? Is it true to say that the
exists()<BR>&=
gt; > > version is more scalable (with many rows) than the in() =
version<BR>> > (from<BR>> > > the little tests I made it =
seems the case) ?<BR>> > > <BR>> > > Thanks,<BR>> =
> > Julien<BR>> > > <BR>> > > -- <BR>> > =
> Julien Cigar<BR>> > > Belgian Biodiversity Platform<BR>>
=
> > <A
href=3D"http://www.biodiversity.be/">http://www.biodiversity.b=
e</A><BR>> > > Universit=C3=A9 Libre de Bruxelles (ULB)<BR>> =
> > Campus de la Plaine CP 257<BR>> > > B=C3=A2timent NO, =
Bureau 4 N4 115C (Niveau 4)<BR>> > > Boulevard du Triomphe, =
entr=C3=A9e ULB 2<BR>> > > B-1050 Bruxelles<BR>> > > =
Mail: jcigar@[EMAIL PROTECTED]
>> > > @[EMAIL PROTECTED]
<A
href=3D"http://biobel.=
biodiversity.be/person/show/471">http://biobel.biodiversity.be/person/show/=
471</A><BR>> > > Tel : 02 650 57 52<BR>> > > <BR>> =
> > <BR>> > > -- <BR>> > > Sent via pgsql-sql =
mailing list (pgsql-sql@[EMAIL PROTECTED]
)<BR>> > > To make changes
=
to your subscription:<BR>> > > <A
href=3D"http://www.postgresql.or=
g/mailpref/pgsql-sql">http://www.postgresql.org/mailpref/pgsql-sql</A><BR>&=
gt; > > <BR>> > -- <BR>> > Julien Cigar<BR>> > =
Belgian Biodiversity Platform<BR>> > <A
href=3D"http://www.biodiversi=
ty.be/">http://www.biodiversity.be</A><BR>>
> Universit=C3=A9 Libre
=
de Bruxelles (ULB)<BR>> > Campus de la Plaine CP 257<BR>> > =
B=C3=A2timent NO, Bureau 4 N4 115C (Niveau 4)<BR>> > Boulevard du =
Triomphe, entr=C3=A9e ULB 2<BR>> > B-1050 Bruxelles<BR>> > =
Mail: jcigar@[EMAIL PROTECTED]
>> > @[EMAIL PROTECTED]
<A
href=3D"http://biobel.biodi=
versity.be/person/show/471">http://biobel.biodiversity.be/person/show/471</=
A><BR>> > Tel : 02 650 57 52<BR>> > <BR>> > <BR>> =
> -- <BR>> > Sent via pgsql-sql mailing list
(pgsql-sql@[EMAIL PROTECTED]
)<BR>> > To make changes to your subscription:<BR>> > <A
=
href=3D"http://www.postgresql.org/mailpref/pgsql-sql">http://www.postgresql=
..org/mailpref/pgsql-sql</A><BR>> > <BR>> -- <BR>> Julien =
Cigar<BR>> Belgian Biodiversity Platform<BR>> <A
href=3D"http://www.b=
iodiversity.be/">http://www.biodiversity.be</A><BR>>
Universit=C3=A9 =
Libre de Bruxelles (ULB)<BR>> Campus de la Plaine CP 257<BR>> =
B=C3=A2timent NO, Bureau 4 N4 115C (Niveau 4)<BR>> Boulevard du =
Triomphe, entr=C3=A9e ULB 2<BR>> B-1050 Bruxelles<BR>> Mail: =
jcigar@[EMAIL PROTECTED]
>> @[EMAIL PROTECTED]
<A
href=3D"http://biobel.biodiversity.be/=
person/show/471">http://biobel.biodiversity.be/person/show/471</A><BR>>
=
Tel : 02 650 57 52<BR>> <BR>> <BR>> -- <BR>> Sent via =
pgsql-sql mailing list (pgsql-sql@[EMAIL PROTECTED]
)<BR>> To make changes
=
to your subscription:<BR>> <A
href=3D"http://www.postgresql.org/mailpref=
/pgsql-sql">http://www.postgresql.org/mailpref/pgsql-sql</A><BR>>
=
<BR>-- <BR>Julien Cigar<BR>Belgian Biodiversity Platform<BR><A
href=3D"http=
://www.biodiversity.be/">http://www.biodiversity.be</A><BR>Universit=C3=A9
=
Libre de Bruxelles (ULB)<BR>Campus de la Plaine CP 257<BR>B=C3=A2timent =
NO, Bureau 4 N4 115C (Niveau 4)<BR>Boulevard du Triomphe, entr=C3=A9e ULB
=
2<BR>B-1050 Bruxelles<BR>Mail: jcigar@[EMAIL PROTECTED]
>@[EMAIL PROTECTED]
<A
href=3D"http=
://biobel.biodiversity.be/person/show/471">http://biobel.biodiversity.be/pe=
rson/show/471</A><BR>Tel : 02 650 57 52<BR><BR></DIV></BODY></HTML>
--=__PartC7E1540E.0__=--


|