> This message is in MIME format. Since your mail reader does not
understand
this format, some or all of this message may not be legible.
--B_3291575759_709166
Content-type: text/plain;
charset="ISO-8859-1"
Content-transfer-encoding: quoted-printable
Hi Francisco,
Generally, PG sorting is much slower than hash aggregation for performing
the distinct operation. There may be small sizes where this isn=B9t true,
bu=
t
for large amounts of data (in-memory or not), hash agg (used most often,
bu=
t
not always by GROUP BY) is faster.
We=B9ve implemented a special optimization to PG sorting that does the
distinct processing within the sort, instead of afterward, but it=B9s
limited
to some small-ish number (10,000) of distinct values due to it=B9s use of
a
memory and processing intensive heap.
So, you=B9re better off using GROUP BY and making sure that the planner is
using hash agg to do the work.
- Luke=20
On 4/17/08 8:46 PM, "Francisco Reyes" <lists@[EMAIL PROTECTED]
> wrote:
> I am trying to get a distinct set of rows from 2 tables.
> After looking at someone else's query I noticed they were doing a group
b=
y
> to obtain the unique list.
>=20
> After comparing on multiple machines with several tables, it seems using
> group by to obtain a distinct list is substantially faster than using
> select distinct.
>=20
> Is there any dissadvantage of using "group by" to obtain a unique list?
>=20
> On a small dataset the difference was about 20% percent.
>=20
> Group by
> HashAggregate (cost=3D369.61..381.12 rows=3D1151 width=3D8) (actual
> time=3D76.641..85.167 rows=3D2890 loops=3D1)
>=20
> Distinct
> Unique (cost=3D1088.23..1174.53 rows=3D1151 width=3D8) (actual
> time=3D90.516..140.123 rows=3D2890 loops=3D1)
>=20
> Although I don't have the numbers here with me, a simmilar result was
> obtaining against a query that would return 100,000 rows. 20% and more
> speed differnce between "group by" over "select distinct".
>=20
> --
> Sent via pgsql-performance mailing list
(pgsql-performance@[EMAIL PROTECTED]
)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>=20
--B_3291575759_709166
Content-type: text/html;
charset="ISO-8859-1"
Content-transfer-encoding: quoted-printable
<HTML>
<HEAD>
<TITLE>Re: [PERFORM] Group by more efficient than distinct?</TITLE>
</HEAD>
<BODY>
<FONT SIZE=3D"4"><FONT FACE=3D"Verdana, Helvetica, Arial"><SPAN
STYLE=3D'font-siz=
e:14pt'>Hi Francisco,<BR>
<BR>
Generally, PG sorting is much slower than hash aggregation for performing
t=
he distinct operation. There may be small sizes where this
isn’t=
true, but for large amounts of data (in-memory or not), hash agg (used
most=
often, but not always by GROUP BY) is faster.<BR>
<BR>
We’ve implemented a special optimization to PG sorting that does the
=
distinct processing within the sort, instead of afterward, but it’s
li=
mited to some small-ish number (10,000) of distinct values due to
it’s=
use of a memory and processing intensive heap.<BR>
<BR>
So, you’re better off using GROUP BY and making sure that the
planner=
is using hash agg to do the work.<BR>
<BR>
- Luke <BR>
<BR>
<BR>
On 4/17/08 8:46 PM, "Francisco Reyes"
<lists@[EMAIL PROTECTED]
wrote:<BR>
<BR>
</SPAN></FONT></FONT><BLOCKQUOTE><FONT SIZE=3D"4"><FONT FACE=3D"Verdana,
Helvet=
ica, Arial"><SPAN STYLE=3D'font-size:14pt'>I am trying to get a distinct
set o=
f rows from 2 tables.<BR>
After looking at someone else's query I noticed they were doing a group
by<=
BR>
to obtain the unique list.<BR>
<BR>
After comparing on multiple machines with several tables, it seems
using<BR=
>
group by to obtain a distinct list is substantially faster than using<BR>
select distinct.<BR>
<BR>
Is there any dissadvantage of using "group by" to obtain a
unique=
list?<BR>
<BR>
On a small dataset the difference was about 20% percent.<BR>
<BR>
Group by<BR>
HashAggregate (cost=3D369.61..381.12 rows=3D1151 width=3D8)
(actual<B=
R>
time=3D76.641..85.167 rows=3D2890 loops=3D1)<BR>
<BR>
Distinct<BR>
Unique (cost=3D1088.23..1174.53 rows=3D1151 width=3D8)
(actual<BR>
time=3D90.516..140.123 rows=3D2890 loops=3D1)<BR>
<BR>
Although I don't have the numbers here with me, a simmilar result was<BR>
obtaining against a query that would return 100,000 rows. 20% and more<BR>
speed differnce between "group by" over "select
distinct&quo=
t;. <BR>
<BR>
--<BR>
Sent via pgsql-performance mailing list
(pgsql-performance@[EMAIL PROTECTED]
)<=
BR>
To make changes to your subscription:<BR>
<a
href=3D"http://www.postgresql.org/mailpref/pgsql-performance">http://www.p=
ostgresql.org/mailpref/pgsql-performance</a><BR>
<BR>
</SPAN></FONT></FONT></BLOCKQUOTE>
</BODY>
</HTML>
--B_3291575759_709166--


|