> Thanks a lot, all of you - this is excellent advice. With the data=20=20
> clustered and statistics at a more reasonable value of 100, it now=20=20
> reproducibly takes even less time - 20-57 ms per query.
1000x speedup with proper tuning - always impressive, lol.
IO seeks are always your worst enemy.
> After reading the section on "Statistics Used By the Planner" in the=20=
=20
> manual, I was a little concerned that, while the statistics sped up the=
=20=20
> queries that I tried immeasurably, that the most_common_vals array
was=20=
=20
> where the speedup was happening, and that the values which wouldn't fit=
=20=20
> in this array wouldn't be sped up. Though I couldn't offhand find an=20=
=20
> example where this occurred, the clustering approach seems
intuitively=20=
=20
> like a much more complete and scalable solution, at least for a=20=20
> read-only table like this.
Actually, with statistics set to 100, then 100 values will be stored
in=20=
=20
most_common_vals. This would mean that the values not in most_common_vals=
=20=20
will have less than 1% frequency, and probably much less than that.
The=20=
=20
choice of plan for these rare values is pretty simple.
With two columns, "interesting" stuff can happen, like if you have
col1=20=
=20
in [1...10] and col2 in [1...10] and use a condition on col1=3Dconst
and=20=
=20
col2=3Dconst, the selectivity of the result depends not only on the=20=20
distribution of col1 and col2 but also their correlation.
As for the tests you did, it's hard to say without seeing the explain=20=
=20
analyze outputs. If you change the stats and the plan choice (EXPLAIN)=20=
=20
stays the same, and you use the same values in your query, any difference=
=20=20
in timing comes from caching, since postgres is executing the same
plan=20=
=20
and therefore doing the exact same thing. Caching (from PG and from
the=20=
=20
OS) can make the timings vary a lot.
> - Trying the same constant a second time gave an instantaneous
result,=20=
=20
> I'm guessing because of query/result caching.
PG does not cache queries or results. It caches data & index pages in
its=
=20=20
shared buffers, and then the OS adds another layer of the usual disk
cache.
A simple query like selecting one row based on PK takes about 60=20=20
microseconds of CPU time, but if it needs one seek for the index and
one=20=
=20
for the data it may take 20 ms waiting for the moving parts to move...=20=
=20
Hence, CLUSTER is a very useful tool.
Bitmap index scans love clustered tables because all the interesting
rows=
=20=20
end up being grouped together, so much less pages need to be visited.
> - I didn't try decreasing the statistics back to 10 before I ran the=20=
=20
> cluster command, so I can't show the search times going up because
of=20=
=20
> that. But I tried killing the 500 meg process. The new process uses
less=
=20=20
> than 5 megs of ram, and still reproducibly returns a result in less
than=
=20=20
> 60 ms. Again, this is with a statistics value of 100 and the data=20=20
> clustered by gene_prediction_view_gene_ref_key.
Killing it or just restarting postgres ?
If you let postgres run (not idle) for a while, naturally it will
fill=20=
=20
the RAM up to the shared_buffers setting that you specified in the=20=20
configuration file. This is good, since grabbing data from postgres'
own=20=
=20
cache is faster than having to make a syscall to the OS to get it from
the=
=20=20
OS disk cache (or disk). This isn't bloat.
But what those 500 MB versus 6 MB show is that before, postgres had
to=20=
=20
read a lot of data for your query, so it stayed in the cache ; after=20=20
tuning it needs to read much less data (thanks to CLUSTER) so the
cache=20=
=20
stays empty.
--=20
Sent via pgsql-performance mailing list (pgsql-performance@[EMAIL PROTECTED]
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


|