Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Data Bases > Pgsql Hackers > Re: the un-vacu...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 4 of 16 Topic 9559 of 11013
Post > Topic >>

Re: the un-vacuumable table

by andrew.george.hammond@[EMAIL PROTECTED] ("Andrew Hammond") Jun 27, 2008 at 04:31 PM

On Wed, Jun 25, 2008 at 9:57 AM, Andrew Hammond
<andrew.george.hammond@[EMAIL PROTECTED]
> wrote:
>
> On Wed, Jun 25, 2008 at 2:58 AM, Heikki Lin****angas
<heikki@[EMAIL PROTECTED]
> wrote:
>>
>> Andrew Hammond wrote:
>>>
>>> I found this error message in my log files repeatedly:
>>>
>>> Error: failed to re-find parent key in "ledgerdetail_2008_03_idx2" for
>>> deletion target page 64767
>>>
>>> I though "hmm, that index looks broken. I'd better re-create it." So,
I
>>> dropped the index and then tried to create a new one to replace it.
Which
>>> completely locked up the backend that was running the CREATE TABLE. I
ran
>>> truss against the backend in question and it didn't register anything
>>> (except signals 2 and 15 when I tried to cancel the query and kill the
>>> backend respectively). I eventually had to restart the database to get
the
>>> CREATE INDEX process to go away (well, to release that big nasty
lock).
>>
>> What kind of an index is it? Does "SELECT COUNT(*) from <table>" work?
>
> After the restart I did a count(*) and it worked. A little under 13m
rows. So, sequential scans seem to work.
>
>>>
>>> posting here in case there's interest in gathering some forensic data
or a
>>> clever suggetion about how I can recover this situation or even some
ideas
>>> about what's causing it.
>>
>> Anyway, the current plan is to drop the table and reload it from
backup. I'm
>>
>> Yes, please take a filesystem-level backup right away to retain the
evidence.
>
> Well, I've already burned our downtime allowance for this month, but we
do a regular PITR type backup which hopefully will be sufficient to
replicate the problem.
>
>>
>> Could you connect to the hung backend with gdb and get a stacktrace?
>
> The backend is no longer hung (two restarts later). I'll try to
reproduce this problem on my workstation (same binary, same OS, libraries
etc) using the PITR dump.
>
> Andrew

I tried to restore the PITR backup and it failed.

Jun 27 15:54:30 qadb2 postgres[92517]: [1-1] DEBUG:  postmaster:
PostmasterMain: initial environ dump:
Jun 27 15:54:30 qadb2 postgres[92517]: [2-1] DEBUG:
-----------------------------------------
Jun 27 15:54:30 qadb2 postgres[92517]: [3-1] DEBUG:     USER=pgsql
Jun 27 15:54:30 qadb2 postgres[92517]: [4-1] DEBUG:    
MAIL=/var/mail/pgsql
Jun 27 15:54:30 qadb2 postgres[92517]: [5-1] DEBUG:
LD_LIBRARY_PATH=:/usr/local/adecn/lib
Jun 27 15:54:30 qadb2 postgres[92517]: [6-1] DEBUG:    
HOME=/usr/local/pgsql
Jun 27 15:54:30 qadb2 postgres[92517]: [7-1] DEBUG:    
PGLIB=/usr/local/lib
Jun 27 15:54:30 qadb2 postgres[92517]: [8-1] DEBUG:     PS1=[QA2]
\u@[EMAIL PROTECTED]
 27 15:54:30 qadb2 postgres[92517]: [9-1] DEBUG:     BLOCKSIZE=K
Jun 27 15:54:30 qadb2 postgres[92517]: [10-1] DEBUG:    TERM=xterm
Jun 27 15:54:30 qadb2 postgres[92517]: [11-1] DEBUG:
PGSYSCONFDIR=/usr/local/etc/postgresql
Jun 27 15:54:30 qadb2 postgres[92517]: [12-1] DEBUG:
PGLOCALEDIR=/usr/local/share/locale
Jun 27 15:54:30 qadb2 postgres[92517]: [13-1] DEBUG:
Jun 27 15:54:30 qadb2 postgres[92517]: [13-2]
PATH=/usr/local/adecn/bin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin
Jun 27 15:54:30 qadb2 postgres[92517]: [13-3] :/usr/local/pgsql/bin
Jun 27 15:54:30 qadb2 postgres[92517]: [14-1] DEBUG:
ADECN_HOME=/usr/local/adecn
Jun 27 15:54:30 qadb2 postgres[92517]: [15-1] DEBUG:    SHELL=/bin/sh
Jun 27 15:54:30 qadb2 postgres[92517]: [16-1] DEBUG:
Jun 27 15:54:30 qadb2 postgres[92517]: [16-2]   CLASSPATH=
(deleted a bunch of lines)
Jun 27 15:54:30 qadb2 postgres[92517]: [17-1] DEBUG:
PYTHONPATH=/usr/local/adecn/python:/usr/local/adecn/lib/python:/usr/local/adecn/api/client/python
Jun 27 15:54:30 qadb2 postgres[92517]: [18-1] DEBUG:   
FTP_PASSIVE_MODE=YES
Jun 27 15:54:30 qadb2 postgres[92517]: [19-1] DEBUG:
PGDATA=/var/db/adecn/adecndb
Jun 27 15:54:30 qadb2 postgres[92517]: [20-1] DEBUG:    LC_COLLATE=C
Jun 27 15:54:30 qadb2 postgres[92517]: [21-1] DEBUG:    LC_CTYPE=C
Jun 27 15:54:30 qadb2 postgres[92517]: [22-1] DEBUG:    LC_MESSAGES=C
Jun 27 15:54:30 qadb2 postgres[92517]: [23-1] DEBUG:    LC_MONETARY=C
Jun 27 15:54:30 qadb2 postgres[92517]: [24-1] DEBUG:    LC_NUMERIC=C
Jun 27 15:54:30 qadb2 postgres[92517]: [25-1] DEBUG:    LC_TIME=C
Jun 27 15:54:30 qadb2 postgres[92517]: [26-1] DEBUG:
-----------------------------------------
Jun 27 15:54:30 qadb2 postgres[92518]: [27-1] DEBUG:  invoking
IpcMemoryCreate(size=92938240)
Jun 27 15:54:30 qadb2 postgres[92518]: [28-1] DEBUG:  max_safe_fds =
983, usable_fds = 1000, already_open = 7
Jun 27 15:54:30 qadb2 postgres[92519]: [29-1] LOG:  database system
was interrupted at 2008-06-25 03:01:02 PDT
Jun 27 15:54:30 qadb2 postgres[92519]: [30-1] LOG:  starting archive
recovery
Jun 27 15:54:30 qadb2 postgres[92519]: [31-1] LOG:  restore_command =
"cp -p /usr/tmp/2008-06-25_wals/%f %p"
Jun 27 15:54:30 qadb2 postgres[92519]: [32-1] DEBUG:  executing
restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history
pg_xlog/RECOVERYHISTORY"
Jun 27 15:54:30 qadb2 postgres[92519]: [33-1] DEBUG:  could not
restore file "00000001.history" from archive: return code 256
Jun 27 15:54:30 qadb2 postgres[92519]: [34-1] DEBUG:  executing
restore command "cp -p
/usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup
Jun 27 15:54:30 qadb2 postgres[92519]: [34-2]  pg_xlog/RECOVERYHISTORY"
Jun 27 15:54:30 qadb2 postgres[92519]: [35-1] LOG:  restored log file
"00000001000001D600000078.0055F0B8.backup" from archive
Jun 27 15:54:30 qadb2 postgres[92519]: [36-1] DEBUG:  executing
restore command "cp -p
/usr/tmp/2008-06-25_wals/00000001000001D600000078
pg_xlog/RECOVERYXLOG"
Jun 27 15:54:31 qadb2 postgres[92519]: [37-1] LOG:  restored log file
"00000001000001D600000078" from archive
Jun 27 15:54:31 qadb2 postgres[92519]: [38-1] LOG:  checkpoint record
is at 1D6/7855F0B8
Jun 27 15:54:31 qadb2 postgres[92519]: [39-1] LOG:  redo record is at
1D6/7855F0B8; undo record is at 0/0; shutdown FALSE
Jun 27 15:54:31 qadb2 postgres[92519]: [40-1] LOG:  next transaction
ID: 397171279; next OID: 679516596
Jun 27 15:54:31 qadb2 postgres[92519]: [41-1] LOG:  next MultiXactId:
857318; next MultiXactOffset: 1718141
Jun 27 15:54:31 qadb2 postgres[92519]: [42-1] LOG:  automatic recovery
in progress
Jun 27 15:54:31 qadb2 postgres[92519]: [43-1] LOG:  redo starts at
1D6/7855F108



(I thought this line was interesting)
Jun 27 15:54:31 qadb2 postgres[92519]: [44-1] PANIC:  could not open
relation 1663/16386/679439393: No such file or directory



Jun 27 15:54:31 qadb2 postgres[92518]: [29-1] DEBUG:  forked new
backend, pid=92526 socket=8
Jun 27 15:54:31 qadb2 postgres[92526]: [29-1] LOG:  connection
received: host=[local]
Jun 27 15:54:31 qadb2 postgres[92526]: [30-1] FATAL:  the database
system is starting up
Jun 27 15:54:31 qadb2 postgres[92526]: [31-1] DEBUG:  proc_exit(0)
Jun 27 15:54:31 qadb2 postgres[92526]: [32-1] DEBUG:  shmem_exit(0)
Jun 27 15:54:31 qadb2 postgres[92526]: [33-1] DEBUG:  exit(0)
Jun 27 15:54:31 qadb2 postgres[92518]: [30-1] DEBUG:  reaping dead
processes
Jun 27 15:54:31 qadb2 postgres[92518]: [31-1] DEBUG:  server process
(PID 92526) exited with exit code 0
Jun 27 15:54:31 qadb2 postgres[92518]: [32-1] DEBUG:  reaping dead
processes
Jun 27 15:54:31 qadb2 postgres[92518]: [33-1] LOG:  startup process
(PID 92519) was terminated by signal 6
Jun 27 15:54:31 qadb2 postgres[92518]: [34-1] LOG:  aborting startup
due to startup process failure
Jun 27 15:54:31 qadb2 postgres[92518]: [35-1] DEBUG:  proc_exit(1)
Jun 27 15:54:31 qadb2 postgres[92518]: [36-1] DEBUG:  shmem_exit(1)
Jun 27 15:54:31 qadb2 postgres[92518]: [37-1] DEBUG:  exit(1)


I googled to find out what the numbers 1663/16386/679439393 from the
PANIC message mean, but no luck. The last looks like an OID. I also
hunted through our production log files and found the following
interesting error messages.

(On Thursday night)

vacuumdb: vacuuming of database "adecndb" failed: ERROR:  could not
write block 209610 of relation 1663/16386/236356665: No space left on
device

CONTEXT:  writing block 209610 of relation 1663/16386/236356665



(Friday night, and every night until the incident on Tuesday).

vacuumdb: vacuuming of database "adecndb" failed: ERROR:  failed to
re-find parent key in "ledgerdetail_2008_03_idx2" for deletion target
page 64767

Now, the first message is very strange since we have monitoring on the
file system used by the database and it's been hovering at about 18%
space used for the last month. So I can't figure out why we'd get "No
space left on device", assuming the device is actually the disk (which
seems reasonable given the context) and not shared memory.

I also checked our bug system for history and discovered we'd seen
similar problems in the past when connecting to a SAN via a FC switch.
Directly attaching the server to the SAN appeared to solve the
problem. So currently I'm wondering if this might be a hardware / os /
some-other-part-of-the-storage-stack issue manifesting as tra****ng the
database in some way?

Interestingly, since I dropped the ledgerdetail_2008_03_idx2 index,
the nightly vacuum has been completing successfully.

I'd also like advice on whether I need to schedule another outage and
do a pg_dump / initdb / load (well, more accurately, do a slony move
to the backup box and rebuild the old origin). Or perhaps I should
ask, how urgently.

Andrew

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@[EMAIL PROTECTED]
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
 




 16 Posts in Topic:
the un-vacuumable table
andrew.george.hammond@[EM  2008-06-25 00:40:35 
Re: the un-vacuumable table
heikki@[EMAIL PROTECTED]   2008-06-25 12:58:04 
Re: the un-vacuumable table
andrew.george.hammond@[EM  2008-06-25 09:57:55 
Re: the un-vacuumable table
andrew.george.hammond@[EM  2008-06-27 16:31:05 
Re: the un-vacuumable table
tgl@[EMAIL PROTECTED] (T  2008-06-27 23:14:36 
Re: the un-vacuumable table
andrew.george.hammond@[EM  2008-06-30 19:20:50 
Re: the un-vacuumable table
andrew.george.hammond@[EM  2008-07-03 13:42:30 
Re: the un-vacuumable table
tgl@[EMAIL PROTECTED] (T  2008-07-03 17:35:23 
Re: the un-vacuumable table
andrew.george.hammond@[EM  2008-07-03 15:16:32 
Re: the un-vacuumable table
tgl@[EMAIL PROTECTED] (T  2008-07-03 18:47:51 
Re: the un-vacuumable table
andrew.george.hammond@[EM  2008-07-03 22:57:36 
Re: the un-vacuumable table
stark@[EMAIL PROTECTED]   2008-07-04 09:20:59 
Re: the un-vacuumable table
andrew.george.hammond@[EM  2008-07-07 12:00:12 
Re: the un-vacuumable table
tgl@[EMAIL PROTECTED] (T  2008-07-07 15:33:10 
Re: the un-vacuumable table
andrew.george.hammond@[EM  2008-07-07 14:08:03 
Re: the un-vacuumable table
andrew.george.hammond@[EM  2008-07-07 15:04:47 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Fri Dec 5 8:54:06 CST 2008.