> Our ~600,000,000
> row table is changed very infrequently and is on a 12 disk
software=20=20
> raid-6
> for historical reasons using an LSI Logic / Symbios Logic SAS1068 PCI-X
> Fusion-MPT SAS Our ~50,000,000 row staging table is on a 12 disk=20=20
> hardware
> raid-10 using a Dell PowerEdge Expandable RAID controller 5.
> So my disk IO and index question. When I issue a query on the big table
> like this:
> SELECT column, count(*)
> FROM bigtable
> GROUP BY column
> ORDER BY count DESC
> When I run dstat to see my disk IO I see the software raid-6
consistently
> holding over 70M/sec. This is fine with me, but I generally don't like=
=20=20
> to
> do queries that table scan 600,000,000 rows. So I do:
Note that RAID5 or 6 is fine when reading, it's the small random
writes=20=
=20
that kill it.
Is the table being inserted to while you run this query, which will=20=20
generate small random writes for the index updates ?
Or is the table only inserted to during the nightly cron job ?
70 MB/s seems to me quite close to what a single SATA disk could do
these=
=20=20
days.
My software RAID 5 saturates the PCI bus in the machine and pushes
more=20=
=20
than 120 MB/s.
You have PCI-X and 12 disks so you should get huuuuge disk
throughput,=20=
=20
really mindboggling figures, not 70 MB/s.
Since this seems a high-budget system perhaps a good fast hardware RAID ?
Or perhaps this test was performed under heavy load and it is actually a=
=20=20
good result.
> All of the
> rows in the staging table are changed at least once and then deleted and
> recreated in the bigger table. All of the staging table's indexes are
on
> the raid-10. The postgres data directory itself is on the raid-6.
I=20=
=20
> think
> all the disks are SATA 10Ks. The setup is kind of a beast.
>
> SELECT column, count(*)
> FROM bigtable
> WHERE date > '4-24-08'
> GROUP BY column
> ORDER BY count DESC
> When I run dstat I see only around 2M/sec and it is not consistent
at=20=
=20
> all.
>
> So my question is, why do I see such low IO load on the index scan=20=20
> version?
First, it is probably choosing a bitmap index scan, which means it needs=
=20=20
to grab lots of pages from the index. If your index is fragmented,
just=20=
=20
scanning the index could take a long time.
Then, i is probably taking lots of random bites in the table data.
If this is an archive table, the dates should be increasing
sequentially.=
=20=20
If this is not the case you will get random IO which is rather bad on
huge=
=20=20
data sets.
So.
If you need the rows to be grouped on-disk by date (or perhaps
another=20=
=20
field if you more frequently run other types of query, like grouping
by=20=
=20
category, or perhaps something else, you decide) :
The painful thing will be to reorder the table, either
- use CLUSTER
- or recreate a table and INSERT INTO it ORDER BY the field you
chose.=20=
=20
This is going to take a while, set sort_mem to a large value. Then create=
=20=20
the indexes.
Then every time you insert data in the archive, be sure to insert it
in=20=
=20
big batches, ORDER BY the field you chose. That way new inserts will
be=20=
=20
still in the order you want.=09
While you're at it you might think about partitioning the monster on
a=20=
=20
useful criterion (this depends on your querying).
> If I could tweak some setting to make more aggressive use of IO, would
it
> actually make the query faster? The field I'm scanning has a .960858
> correlation, but I haven't vacuumed since im****ting any of the data that
You have ANALYZEd at least ?
Cause if you didn't and an index scan (not bitmap) comes up on this kind=
=20=20
of query and it does a million index hits you have a problem.
> I'm
> scanning, though the correlation should remain very high. When I do a
> similar set of queries on the hardware raid I see similar performance
> except the numbers are both more than doubled.
>
> Here is the explain output for the queries:
> SELECT column, count(*)
> FROM bigtable
> GROUP BY column
> ORDER BY count DESC
> "Sort (cost=3D74404440.58..74404444.53 rows=3D1581 width=3D10)"
> " Sort Key: count(*)"
> " -> HashAggregate (cost=3D74404336.81..74404356.58 rows=3D1581
width=
=3D10)"
> " -> Seq Scan on bigtable (cost=3D0.00..71422407.21
rows=3D596385=
921
> width=3D10)"
Plan is OK (nothing else to do really)
> ---------------
> SELECT column, count(*)
> FROM bigtable
> WHERE date > '4-24-08'
> GROUP BY column
> ORDER BY count DESC
> "Sort (cost=3D16948.80..16948.81 rows=3D1 width=3D10)"
> " Sort Key: count(*)"
> " -> HashAggregate (cost=3D16948.78..16948.79 rows=3D1 width=3D10)"
> " -> Index Scan using date_idx on bigtable
(cost=3D0.00..16652.77
> rows=3D59201 width=3D10)"
> " Index Cond: (date > '2008-04-21
00:00:00'::timestamp=20=20
> without
> time zone)"
Argh.
So you got an index scan after all.
Is the 59201 rows estimate right ? If it is 10 times that you really
have=
=20=20
a problem.
Is it ANALYZEd ?
> So now the asking for advice part. I have two questions:
> What is the fastest way to copy data from the smaller table to the
larger
> table?
INSERT INTO SELECT FROM (add ORDER BY to taste)
> We plan to rearrange the setup when we move to Postgres 8.3.
We'll=20=20
> probably
> move all the storage over to a SAN and slice the larger table into=20=20
> monthly
> or weekly tables. Can someone point me to a good page on partitioning?=
=20=20=20
> My
> gut tells me it should be better, but I'd like to learn more about why.
Because in your case, records having the dates you want will be in
1=20=20
partition (or 2), so you get a kind of automatic CLUSTER. For instance if=
=20=20
you do your query on last week's data, it will seq scan last week's=20=20
partition (which will be a much more manageable size) and not even look
at=
=20=20
the others.
Matthew said :
> You could possibly not bother with a staging table, and replacethe mass=
=20=20
> copy with making a new partition. Not sure of the details myself though.
Yes you could do that.
When a partition ceases to become actively updated, though, you
should=20=
=20
CLUSTER it so it is really tight and fast.
CLUSTER on a partition which has a week's worth of data will obviously
be=
=20=20
much faster than CLUSTERing your monster archive.
--=20
Sent via pgsql-performance mailing list (pgsql-performance@[EMAIL PROTECTED]
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


|