------=_Part_11_9236393.1209045562147
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
I have a question about index us and IO and am seeking advice.
We are running postgres 8.2. We have two big big tables. Our
~600,000,000
row table is changed very infrequently and is on a 12 disk software raid-6
for historical reasons using an LSI Logic / Symbios Logic SAS1068 PCI-X
Fusion-MPT SAS Our ~50,000,000 row staging table is on a 12 disk hardware
raid-10 using a Dell PowerEdge Expandable RAID controller 5. All of the
rows in the staging table are changed at least once and then deleted and
recreated in the bigger table. All of the staging table's indexes are on
the raid-10. The postgres data directory itself is on the raid-6. I
think
all the disks are SATA 10Ks. The setup is kind of a beast.
So my disk IO and index question. When I issue a query on the big table
like this:
SELECT column, count(*)
FROM bigtable
GROUP BY column
ORDER BY count DESC
When I run dstat to see my disk IO I see the software raid-6 consistently
holding over 70M/sec. This is fine with me, but I generally don't like to
do queries that table scan 600,000,000 rows. So I do:
SELECT column, count(*)
FROM bigtable
WHERE date > '4-24-08'
GROUP BY column
ORDER BY count DESC
When I run dstat I see only around 2M/sec and it is not consistent at all.
So my question is, why do I see such low IO load on the index scan
version?
If I could tweak some setting to make more aggressive use of IO, would it
actually make the query faster? The field I'm scanning has a .960858
correlation, but I haven't vacuumed since im****ting any of the data that
I'm
scanning, though the correlation should remain very high. When I do a
similar set of queries on the hardware raid I see similar performance
except the numbers are both more than doubled.
Here is the explain output for the queries:
SELECT column, count(*)
FROM bigtable
GROUP BY column
ORDER BY count DESC
"Sort (cost=74404440.58..74404444.53 rows=1581 width=10)"
" Sort Key: count(*)"
" -> HashAggregate (cost=74404336.81..74404356.58 rows=1581 width=10)"
" -> Seq Scan on bigtable (cost=0.00..71422407.21 rows=596385921
width=10)"
---------------
SELECT column, count(*)
FROM bigtable
WHERE date > '4-24-08'
GROUP BY column
ORDER BY count DESC
"Sort (cost=16948.80..16948.81 rows=1 width=10)"
" Sort Key: count(*)"
" -> HashAggregate (cost=16948.78..16948.79 rows=1 width=10)"
" -> Index Scan using date_idx on bigtable (cost=0.00..16652.77
rows=59201 width=10)"
" Index Cond: (date > '2008-04-21 00:00:00'::timestamp
without
time zone)"
So now the asking for advice part. I have two questions:
What is the fastest way to copy data from the smaller table to the larger
table?
We plan to rearrange the setup when we move to Postgres 8.3. We'll
probably
move all the storage over to a SAN and slice the larger table into monthly
or weekly tables. Can someone point me to a good page on partitioning?
My
gut tells me it should be better, but I'd like to learn more about why.
Does anyone have experience migrating large databases to a SAN? I
understand that it'll give me better fail over capabilities so long as the
SAN itself doesn't go out, but are we going to be sacrificing performance
for this? That doesn't even mention the cost....
Thanks so much for reading through all this,
--Nik
------=_Part_11_9236393.1209045562147
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
I have a question about index us and IO and am seeking advice.<br><br>We
are running postgres 8.2. We have two big big tables. Our
~600,000,000 row table is changed very infrequently and is on a 12 disk
software raid-6 for historical reasons using an LSI Logic / Symbios
Logic SAS1068 PCI-X Fusion-MPT SAS Our ~50,000,000 row staging table
is on a 12 disk hardware raid-10 using a Dell PowerEdge Expandable RAID
controller 5. All of the rows in the staging table are changed at
least once and then deleted and recreated in the bigger table. All
of the staging table's indexes are on the raid-10. The postgres
data directory itself is on the raid-6. I think all the disks are
SATA 10Ks. The setup is kind of a beast.<br>
<br>So my disk IO and index question. When I issue a query on the
big table like this:<br>SELECT column,
count(*)<br>FROM bigtable<br>GROUP BY column<br>ORDER BY
count DESC<br>When I run dstat to see my disk IO I see the software raid-6
consistently holding over 70M/sec. This is fine with me, but I
generally don't like to do queries that table scan 600,000,000
rows. So I do:<br>
SELECT column, count(*)<br>
FROM bigtable<br>WHERE date > '4-24-08'<br>
GROUP BY column<br>
ORDER BY count DESC<br>When I run dstat I see only around 2M/sec and it is
not consistent at all.<br><br>So my question is, why do I see such low IO
load on the index scan version? If I could tweak some setting to
make more aggressive use of IO, would it actually make the query
faster? The field I'm scanning has a .960858 correlation, but I
haven't vacuumed since im****ting any of the data that I'm
scanning, though the correlation should remain very high. When I do
a similar set of queries on the hardware raid I see similar performance
except the numbers are both more than doubled.<br>
<br>Here is the explain output for the
queries:<br>SELECT column,
count(*)<br>FROM bigtable<br>GROUP BY column<br>ORDER BY
count DESC<br>
"Sort (cost=74404440.58..74404444.53 rows=1581
width=10)"<br>" Sort Key: count(*)"<br>"
-> HashAggregate (cost=74404336.81..74404356.58 rows=1581
width=10)"<br>"
-> Seq Scan on bigtable (cost=0.00..71422407.21 rows=596385921
width=10)"<br>
---------------<br>SELECT column, count(*)<br>
FROM bigtable<br>WHERE date > '4-24-08'<br>
GROUP BY column<br>
ORDER BY count DESC<br>"Sort (cost=16948.80..16948.81 rows=1
width=10)"<br>" Sort Key: count(*)"<br>"
-> HashAggregate (cost=16948.78..16948.79 rows=1
width=10)"<br>"
-> Index Scan using date_idx on bigtable (cost=0.00..16652.77
rows=59201 width=10)"<br>
"
Index Cond: (date > '2008-04-21 00:00:00'::timestamp without
time zone)"<br><br>So now the asking for advice part. I have
two questions:<br>What is the fastest way to copy data from the smaller
table to the larger table?<br>
<br>We plan to rearrange the setup when we move to Postgres 8.3.
We'll probably move all the storage over to a SAN and slice the larger
table into monthly or weekly tables. Can someone point me to a good
page on partitioning? My gut tells me it should be better, but
I'd like to learn more about why.<br>
Does anyone have experience migrating large databases to a SAN? I
understand that it'll give me better fail over capabilities so long as
the SAN itself doesn't go out, but are we going to be sacrificing
performance for this? That doesn't even mention the cost....<br>
<br>Thanks so much for reading through all this,<br><br>--Nik<br>
------=_Part_11_9236393.1209045562147--


|