------=_Part_2081_70993.1209061148697
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
On Thu, Apr 24, 2008 at 12:56 PM, PFC <lists@[EMAIL PROTECTED]
> wrote:
>
> Our ~600,000,000
>> row table is changed very infrequently and is on a 12 disk software
raid-6
>> for historical reasons using an LSI Logic / Symbios Logic SAS1068
PCI-X
>> Fusion-MPT SAS Our ~50,000,000 row staging table is on a 12 disk
hardware
>> raid-10 using a Dell PowerEdge Expandable RAID controller 5.
>>
>
> So my disk IO and index question. When I issue a query on the big
table
>> like this:
>> SELECT column, count(*)
>> FROM bigtable
>> GROUP BY column
>> ORDER BY count DESC
>> When I run dstat to see my disk IO I see the software raid-6
consistently
>> holding over 70M/sec. This is fine with me, but I generally don't like
to
>> do queries that table scan 600,000,000 rows. So I do:
>>
>
> Note that RAID5 or 6 is fine when reading, it's the small random
> writes that kill it.
> Is the table being inserted to while you run this query, which
will
> generate small random writes for the index updates ?
> Or is the table only inserted to during the nightly cron job ?
>
> 70 MB/s seems to me quite close to what a single SATA disk could
do
> these days.
> My software RAID 5 saturates the PCI bus in the machine and
pushes
> more than 120 MB/s.
> You have PCI-X and 12 disks so you should get huuuuge disk
> throughput, really mindboggling figures, not 70 MB/s.
> Since this seems a high-budget system perhaps a good fast
hardware
> RAID ?
> Or perhaps this test was performed under heavy load and it is
> actually a good result.
>
>
> All of the
>> rows in the staging table are changed at least once and then deleted
and
>> recreated in the bigger table. All of the staging table's indexes are
on
>> the raid-10. The postgres data directory itself is on the raid-6. I
>> think
>> all the disks are SATA 10Ks. The setup is kind of a beast.
>>
>> SELECT column, count(*)
>> FROM bigtable
>> WHERE date > '4-24-08'
>> GROUP BY column
>> ORDER BY count DESC
>> When I run dstat I see only around 2M/sec and it is not consistent at
all.
>>
>> So my question is, why do I see such low IO load on the index scan
>> version?
>>
>
> First, it is probably choosing a bitmap index scan, which means
it
> needs to grab lots of pages from the index. If your index is fragmented,
> just scanning the index could take a long time.
> Then, i is probably taking lots of random bites in the table
data.
> If this is an archive table, the dates should be increasing
> sequentially. If this is not the case you will get random IO which is
rather
> bad on huge data sets.
>
> So.
>
> If you need the rows to be grouped on-disk by date (or perhaps
> another field if you more frequently run other types of query, like
grouping
> by category, or perhaps something else, you decide) :
>
> The painful thing will be to reorder the table, either
> - use CLUSTER
> - or recreate a table and INSERT INTO it ORDER BY the field you
> chose. This is going to take a while, set sort_mem to a large value.
Then
> create the indexes.
>
> Then every time you insert data in the archive, be sure to insert
it
> in big batches, ORDER BY the field you chose. That way new inserts will
be
> still in the order you want.
>
> While you're at it you might think about partitioning the monster
on
> a useful criterion (this depends on your querying).
>
> If I could tweak some setting to make more aggressive use of IO, would
it
>> actually make the query faster? The field I'm scanning has a .960858
>> correlation, but I haven't vacuumed since im****ting any of the data
that
>>
>
> You have ANALYZEd at least ?
> Cause if you didn't and an index scan (not bitmap) comes up on
this
> kind of query and it does a million index hits you have a problem.
>
> I'm
>> scanning, though the correlation should remain very high. When I do a
>> similar set of queries on the hardware raid I see similar performance
>> except the numbers are both more than doubled.
>>
>> Here is the explain output for the queries:
>> SELECT column, count(*)
>> FROM bigtable
>> GROUP BY column
>> ORDER BY count DESC
>> "Sort (cost=74404440.58..74404444.53 rows=1581 width=10)"
>> " Sort Key: count(*)"
>> " -> HashAggregate (cost=74404336.81..74404356.58 rows=1581
width=10)"
>> " -> Seq Scan on bigtable (cost=0.00..71422407.21
rows=596385921
>> width=10)"
>>
>
> Plan is OK (nothing else to do really)
>
> ---------------
>> SELECT column, count(*)
>> FROM bigtable
>> WHERE date > '4-24-08'
>> GROUP BY column
>> ORDER BY count DESC
>> "Sort (cost=16948.80..16948.81 rows=1 width=10)"
>> " Sort Key: count(*)"
>> " -> HashAggregate (cost=16948.78..16948.79 rows=1 width=10)"
>> " -> Index Scan using date_idx on bigtable (cost=0.00..16652.77
>> rows=59201 width=10)"
>> " Index Cond: (date > '2008-04-21 00:00:00'::timestamp
>> without
>> time zone)"
>>
>
> Argh.
> So you got an index scan after all.
> Is the 59201 rows estimate right ? If it is 10 times that you
really
> have a problem.
> Is it ANALYZEd ?
>
> So now the asking for advice part. I have two questions:
>> What is the fastest way to copy data from the smaller table to the
larger
>> table?
>>
>
> INSERT INTO SELECT FROM (add ORDER BY to taste)
>
> We plan to rearrange the setup when we move to Postgres 8.3. We'll
>> probably
>> move all the storage over to a SAN and slice the larger table into
monthly
>> or weekly tables. Can someone point me to a good page on partitioning?
>> My
>> gut tells me it should be better, but I'd like to learn more about why.
>>
>
> Because in your case, records having the dates you want will be
in 1
> partition (or 2), so you get a kind of automatic CLUSTER. For instance
if
> you do your query on last week's data, it will seq scan last week's
> partition (which will be a much more manageable size) and not even look
at
> the others.
>
> Matthew said :
>
>> You could possibly not bother with a staging table, and replacethe mass
>> copy with making a new partition. Not sure of the details myself
though.
>>
>
> Yes you could do that.
> When a partition ceases to become actively updated, though, you
> should CLUSTER it so it is really tight and fast.
> CLUSTER on a partition which has a week's worth of data will
> obviously be much faster than CLUSTERing your monster archive.
>
Both Matthew and PFC, thanks for the response.
It turns out that the DB really loves to do index scans when I check new
data because I haven't had a chance to analyze it yet. It should be doing
a
bitmap index scan and a bitmap heap scan. I think. Doing a quick "set
enable_indexscan = false" and doing a different date range really helped
things. Here is my understanding of the situation:
An index scan looks through the index and pulls in each pages as it sees
it.
A bitmap index scan looks through the index and makes a sorted list of all
the pages it needs and then the bitmap heap scan reads all the pages.
If your data is scattered then you may as well do the index scan, but if
your data is sequential-ish then you should do the bitmap index scan.
Is that right? Where can I learn more? I've read
http://www.postgresql.org/docs/8.2/interactive/using-explain.html
but it
didn't really dive deeply enough. I'd like a list of all the options the
query planner has and what they mean.
About clustering: I know that CLUSTER takes an exclusive lock on the
table. At present, users can query the table at any time, so I'm not
allowed to take an exclusive lock for more than a few seconds. Could I
achieve the same thing by creating a second copy of the table and then
swapping the first copy out for the second? I think something like that
would fit in my time frames.
About partitioning: I can definitely see how having the data in more
manageable chunks would allow me to do things like clustering. It will
definitely make vacuuming easier.
About IO speeds: The db is always under some kind of load. I actually
get
scared if the load average isn't at least 2. Could I try to run something
like bonnie++ to get some real load numbers? I'm sure that would cripple
the system while it is running, but if it only takes a few seconds that
would be ok.
There were updates running while I was running the test. The WAL log is
on
the hardware raid 10. Moving it from the software raid 5 almost doubled
our
insert performance.
Thanks again,
--Nik
------=_Part_2081_70993.1209061148697
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
<br><br><div class="gmail_quote">On Thu, Apr 24, 2008 at 12:56 PM, PFC
<<a href="mailto:lists@[EMAIL PROTECTED]
">lists@[EMAIL PROTECTED]
>>
wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid
rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="Ih2E3d"><br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204,
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Our ~600,000,000<br>
row table is changed very infrequently and is on a 12 disk software
raid-6<br>
for historical reasons using an LSI Logic / Symbios Logic SAS1068
PCI-X<br>
Fusion-MPT SAS Our ~50,000,000 row staging table is on a 12 disk
hardware<br>
raid-10 using a Dell PowerEdge Expandable RAID controller 5.<br>
</blockquote>
<br>
</div><div class="Ih2E3d"><blockquote class="gmail_quote"
style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt
0.8ex; padding-left: 1ex;">
So my disk IO and index question. When I issue a query on the big
table<br>
like this:<br>
SELECT column, count(*)<br>
FROM bigtable<br>
GROUP BY column<br>
ORDER BY count DESC<br>
When I run dstat to see my disk IO I see the software raid-6
consistently<br>
holding over 70M/sec. This is fine with me, but I generally
don't like to<br>
do queries that table scan 600,000,000 rows. So I do:<br>
</blockquote>
<br></div>
Note that RAID5 or 6 is fine when reading,
it's the small random writes that kill it.<br>
Is the table being inserted to while you run
this query, which will generate small random writes for the index updates
?<br>
Or is the table only inserted to during the
nightly cron job ?<br>
<br>
70 MB/s seems to me quite close to what a
single SATA disk could do these days.<br>
My software RAID 5 saturates the PCI bus in
the machine and pushes more than 120 MB/s.<br>
You have PCI-X and 12 disks so you should get
huuuuge disk throughput, really mindboggling figures, not 70 MB/s.<br>
Since this seems a high-budget system perhaps
a good fast hardware RAID ?<br>
Or perhaps this test was performed under heavy
load and it is actually a good result.<br>
<br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204,
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div
class="Ih2E3d">
All of the<br>
rows in the staging table are changed at least once and then deleted
and<br>
recreated in the bigger table. All of the staging table's
indexes are on<br>
the raid-10. The postgres data directory itself is on the raid-6.
I think<br>
all the disks are SATA 10Ks. The setup is kind of a beast.<br>
<br></div><div class="Ih2E3d">
SELECT column, count(*)<br>
FROM bigtable<br>
WHERE date > '4-24-08'<br>
GROUP BY column<br>
ORDER BY count DESC<br>
When I run dstat I see only around 2M/sec and it is not consistent at
all.<br>
<br>
So my question is, why do I see such low IO load on the index scan
version?<br>
</div></blockquote>
<br>
First, it is probably choosing a bitmap index
scan, which means it needs to grab lots of pages from the index. If your
index is fragmented, just scanning the index could take a long time.<br>
Then, i is probably taking lots of random
bites in the table data.<br>
If this is an archive table, the dates should
be increasing sequentially. If this is not the case you will get random IO
which is rather bad on huge data sets.<br>
<br>
So.<br>
<br>
If you need the rows to be grouped on-disk by
date (or perhaps another field if you more frequently run other types of
query, like grouping by category, or perhaps something else, you decide)
:<br>
<br>
The painful thing will be to reorder the
table, either<br>
- use CLUSTER<br>
- or recreate a table and INSERT INTO it ORDER
BY the field you chose. This is going to take a while, set sort_mem to a
large value. Then create the indexes.<br>
<br>
Then every time you insert data in the
archive, be sure to insert it in big batches, ORDER BY the field you
chose. That way new inserts will be still in the order you want.
<br>
<br>
While you're at it you might think about
partitioning the monster on a useful criterion (this depends on your
querying).<div class="Ih2E3d"><br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204,
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
If I could tweak some setting to make more aggressive use of IO, would
it<br>
actually make the query faster? The field I'm scanning has a
.960858<br>
correlation, but I haven't vacuumed since im****ting any of the data
that<br>
</blockquote>
<br></div>
You have ANALYZEd at least ?<br>
Cause if you didn't and an index scan (not
bitmap) comes up on this kind of query and it does a million index hits you
have a problem.<div class="Ih2E3d"><br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204,
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
I'm<br>
scanning, though the correlation should remain very high. When I do
a<br>
similar set of queries on the hardware raid I see similar performance<br>
except the numbers are both more than doubled.<br>
<br>
Here is the explain output for the queries:<br>
SELECT column, count(*)<br>
FROM bigtable<br>
GROUP BY column<br>
ORDER BY count DESC<br>
"Sort (cost=74404440.58..74404444.53 rows=1581
width=10)"<br>
" Sort Key: count(*)"<br>
" -> HashAggregate
(cost=74404336.81..74404356.58 rows=1581 width=10)"<br>
" -> Seq Scan on bigtable
(cost=0.00..71422407.21 rows=596385921<br>
width=10)"<br>
</blockquote>
<br></div>
Plan is OK (nothing else to do really)<div
class="Ih2E3d"><br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204,
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
---------------<br>
SELECT column, count(*)<br>
FROM bigtable<br>
WHERE date > '4-24-08'<br>
GROUP BY column<br>
ORDER BY count DESC<br>
"Sort (cost=16948.80..16948.81 rows=1 width=10)"<br>
" Sort Key: count(*)"<br>
" -> HashAggregate (cost=16948.78..16948.79
rows=1 width=10)"<br>
" -> Index Scan using date_idx on
bigtable (cost=0.00..16652.77<br>
rows=59201 width=10)"<br>
" Index Cond: (date
> '2008-04-21 00:00:00'::timestamp without<br>
time zone)"<br>
</blockquote>
<br></div>
Argh.<br>
So you got an index scan after all.<br>
Is the 59201 rows estimate right ? If it is 10
times that you really have a problem.<br>
Is it ANALYZEd ?<div class="Ih2E3d"><br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204,
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
So now the asking for advice part. I have two questions:<br>
What is the fastest way to copy data from the smaller table to the
larger<br>
table?<br>
</blockquote>
<br></div>
INSERT INTO SELECT FROM (add ORDER BY to
taste)<div class="Ih2E3d"><br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204,
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
We plan to rearrange the setup when we move to Postgres 8.3.
We'll probably<br>
move all the storage over to a SAN and slice the larger table into
monthly<br>
or weekly tables. Can someone point me to a good page on
partitioning? My<br>
gut tells me it should be better, but I'd like to learn more about
why.<br>
</blockquote>
<br></div>
Because in your case, records having the dates
you want will be in 1 partition (or 2), so you get a kind of automatic
CLUSTER. For instance if you do your query on last week's data, it
will seq scan last week's partition (which will be a much more
manageable size) and not even look at the others.<br>
<br>
Matthew said :<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204,
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
You could possibly not bother with a staging table, and replacethe mass
copy with making a new partition. Not sure of the details myself
though.<br>
</blockquote>
<br>
Yes you could do that.<br>
When a partition ceases to become actively
updated, though, you should CLUSTER it so it is really tight and fast.<br>
CLUSTER on a partition which has a week's
worth of data will obviously be much faster than CLUSTERing your monster
archive.<br>
</blockquote><div><br>Both Matthew and PFC, thanks for the
response.<br><br>It turns out that the DB really loves to do index scans
when I check new data because I haven't had a chance to analyze it
yet. It should be doing a bitmap index scan and a bitmap heap
scan. I think. Doing a quick "set enable_indexscan =
false" and doing a different date range really helped things.
Here is my understanding of the situation:<br>
<br>An index scan looks through the index and pulls in each pages as it
sees it.<br>A bitmap index scan looks through the index and makes a sorted
list of all the pages it needs and then the bitmap heap scan reads all the
pages.<br>
If your data is scattered then you may as well do the index scan, but if
your data is sequential-ish then you should do the bitmap index
scan.<br><br>Is that right? Where can I learn more? I've
read <a
href="http://www.postgresql.org/docs/8.2/interactive/using-explain.html">http://www.postgresql.org/docs/8.2/interactive/using-explain.html</a>
but it didn't really dive deeply enough. I'd like a list of
all the options the query planner has and what they mean.<br>
</div></div><br><br>About clustering: I know that CLUSTER takes an
exclusive lock on the table. At present, users can query the table
at any time, so I'm not allowed to take an exclusive lock for more
than a few seconds. Could I achieve the same thing by creating a
second copy of the table and then swapping the first copy out for the
second? I think something like that would fit in my time frames.<br>
<br><br>About partitioning: I can definitely see how having the data
in more manageable chunks would allow me to do things like
clustering. It will definitely make vacuuming easier.<br><br>About
IO speeds: The db is always under some kind of load. I
actually get scared if the load average isn't at least 2. Could
I try to run something like bonnie++ to get some real load numbers?
I'm sure that would cripple the system while it is running, but if it
only takes a few seconds that would be ok.<br>
<br>There were updates running while I was running the test. The WAL
log is on the hardware raid 10. Moving it from the software raid 5
almost doubled our insert performance.<br><br>Thanks
again,<br><br>--Nik<br>
------=_Part_2081_70993.1209061148697--


|