一、現象描述 在一次執行gprecoverseg後發現資料庫無法正常連接,現象如下: 執行gprecoverseg日誌如下: 這裡看出gprecoverseg命令是執行失敗了的。 之後嘗試啟動資料庫,執行gpstart: 這時候顯示所有segment都是啟動的,但是庫沒有正常啟動起來 然後執行gps ...
一、現象描述
在一次執行gprecoverseg後發現資料庫無法正常連接,現象如下:
執行gprecoverseg日誌如下:
gprecoverseg:mdw-1:gpadmin-[INFO]:-Starting gprecoverseg with args: -r gprecoverseg:mdw-1:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.99.00 build dev' gprecoverseg:mdw-1:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 4.3.99.00 build dev) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) compiled on Jan 18 2018 15:34:00 (with assert checking)' gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master... gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 1 gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master... gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 2 gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master... gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 3 gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master... gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 4 gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master... gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 5 gprecoverseg:mdw-1:gpadmin-[CRITICAL]:-gprecoverseg failed. (Reason='Unable to connect to database and start transaction') exiting...
這裡看出gprecoverseg命令是執行失敗了的。
之後嘗試啟動資料庫,執行gpstart:
gpstart:mdw-1:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait... .. gpstart:mdw-1:gpadmin-[INFO]:-Process results... gpstart:mdw-1:gpadmin-[INFO]:----------------------------------------------------- gpstart:mdw-1:gpadmin-[INFO]:- Successful segment starts = 8 gpstart:mdw-1:gpadmin-[INFO]:- Failed segment starts = 0 gpstart:mdw-1:gpadmin-[INFO]:- Skipped segment starts (segments are marked down in configuration) = 0 gpstart:mdw-1:gpadmin-[INFO]:----------------------------------------------------- gpstart:mdw-1:gpadmin-[INFO]:- gpstart:mdw-1:gpadmin-[INFO]:-Successfully started 8 of 8 segment instances gpstart:mdw-1:gpadmin-[INFO]:----------------------------------------------------- gpstart:mdw-1:gpadmin-[INFO]:-Starting Master instance mdw-1 directory /home/data/master/gpseg-1 gpstart:mdw-1:gpadmin-[INFO]:-Command pg_ctl reports Master mdw-1 instance active gpstart:mdw-1:gpadmin-[WARNING]:-server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. gpstart:mdw-1:gpadmin-[INFO]:-Starting standby master gpstart:mdw-1:gpadmin-[INFO]:-Checking if standby master is running on host: mdw-2 in directory: /home/data/master/gpseg-1 gpstart:mdw-1:gpadmin-[INFO]:-Check status of database with gpstate utility
這時候顯示所有segment都是啟動的,但是庫沒有正常啟動起來
然後執行gpstate,確認狀態:
gpstate:mdw-1:gpadmin-[INFO]:-Starting gpstate with args: gpstate:mdw-1:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.99.00 build dev' gpstate:mdw-1:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 4.3.99.00 build dev) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) compiled on Jan 18 2018 15:34:00 (with assert checking)' gpstate:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master... gpstate:mdw-1:gpadmin-[INFO]:-Gathering data from segments... . gpstate:mdw-1:gpadmin-[INFO]:-Greenplum instance status summary gpstate:mdw-1:gpadmin-[INFO]:----------------------------------------------------- gpstate:mdw-1:gpadmin-[INFO]:- Master instance = Active gpstate:mdw-1:gpadmin-[INFO]:- Master standby = mdw-2 gpstate:mdw-1:gpadmin-[INFO]:- Standby master state = Standby host passive gpstate:mdw-1:gpadmin-[INFO]:- Total segment instance count from metadata = 8 gpstate:mdw-1:gpadmin-[INFO]:----------------------------------------------------- gpstate:mdw-1:gpadmin-[INFO]:- Primary Segment Status gpstate:mdw-1:gpadmin-[INFO]:----------------------------------------------------- gpstate:mdw-1:gpadmin-[INFO]:- Total primary segments = 4 gpstate:mdw-1:gpadmin-[INFO]:- Total primary segment valid (at master) = 4 gpstate:mdw-1:gpadmin-[INFO]:- Total primary segment failures (at master) = 0 gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid files missing = 0 gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid files found = 4 gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid PIDs missing = 0 gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid PIDs found = 4 gpstate:mdw-1:gpadmin-[INFO]:- Total number of /tmp lock files missing = 0 gpstate:mdw-1:gpadmin-[INFO]:- Total number of /tmp lock files found = 4 gpstate:mdw-1:gpadmin-[INFO]:- Total number postmaster processes missing = 0 gpstate:mdw-1:gpadmin-[INFO]:- Total number postmaster processes found = 4 gpstate:mdw-1:gpadmin-[INFO]:----------------------------------------------------- gpstate:mdw-1:gpadmin-[INFO]:- Mirror Segment Status gpstate:mdw-1:gpadmin-[INFO]:----------------------------------------------------- gpstate:mdw-1:gpadmin-[INFO]:- Total mirror segments = 4 gpstate:mdw-1:gpadmin-[INFO]:- Total mirror segment valid (at master) = 4 gpstate:mdw-1:gpadmin-[INFO]:- Total mirror segment failures (at master) = 0 gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid files missing = 0 gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid files found = 4 gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid PIDs missing = 0 gpstate:mdw-1:gpadmin-[INFO]:- Total number of postmaster.pid PIDs found = 4 gpstate:mdw-1:gpadmin-[INFO]:- Total number of /tmp lock files missing = 0 gpstate:mdw-1:gpadmin-[INFO]:- Total number of /tmp lock files found = 4 gpstate:mdw-1:gpadmin-[INFO]:- Total number postmaster processes missing = 0 gpstate:mdw-1:gpadmin-[INFO]:- Total number postmaster processes found = 4 gpstate:mdw-1:gpadmin-[WARNING]:-Total number mirror segments acting as primary segments = 1 <<<<<<<< gpstate:mdw-1:gpadmin-[INFO]:- Total number mirror segments acting as mirror segments = 3 gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
這裡說明是發生過primary/mirror的角色切換的
嘗試連接資料庫:
psql: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
這時候庫已經無法連接上。
二、定位過程
1、查看日誌
首先查看master的日誌:
CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con38,,seg-1,,,,sx1,"LOG","00000","The previous session was reset because its gang was disconnected (session id = 37). The new session id = 38",,,,,,,0,,"cdbgang.c",1638, CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con38,,seg-1,,,,sx1,"LOG","00000","Failed to connect to seg0 xx.xx.xx.xx:50000(xx.xx.xx.xx是segment的ip)",,,,,,,0,,"cdbgang_async.c",174, CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con39,,seg-1,,,,sx1,"LOG","00000","The previous session was reset because its gang was disconnected (session id = 38). The new session id = 39",,,,,,,0,,"cdbgang.c",1638, CST,,,p30743,th0,,,CST,0,con39,,seg-1,,,,,"PANIC","XX000","Unexpected internal error: Master process received signal SIGSEGV",,,,,,,0,,,,"1 0xa0ce4b postgres StandardHandlerForSigillSigsegvSigbus_OnMainThread + 0x1b2 2 0x8d376e postgres CdbProgramErrorHandler + 0xf1 3 0x7fee63a14100 libpthread.so.0 <symbol not found> + 0x63a14100 4 0xaaa718 postgres <symbol not found> + 0xaaa718 5 0xaaa3e4 postgres CdbDispatchCommand + 0x35 6 0xb75719 postgres <symbol not found> + 0xb75719 7 0xb755d4 postgres <symbol not found> + 0xb755d4 8 0xb7524b postgres <symbol not found> + 0xb7524b 9 0xb71c51 postgres <symbol not found> + 0xb71c51 10 0xb720ae postgres initTM + 0xe2 11 0xb78864 postgres cdb_setup + 0x4b 12 0xa1b6e8 postgres InitPostgres + 0x9fa " CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","server process (PID 30743) was terminated by signal 11: Segmentation fault",,,,,,,0,,"postmaster.c",5604, CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",5284, CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","seqserver process (PID 28900) exited with exit code 2",,,,,,,0,,"postmaster.c",5584, CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","ftsprobe process (PID 28901) exited with exit code 2",,,,,,,0,,"postmaster.c",5584, CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","sweeper process (PID 28902) exited with exit code 2",,,,,,,0,,"postmaster.c",5584, CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","BeginResetOfPostmasterAfterChildrenAreShutDown: counter 0",,,,,,,0,,"postmaster.c",2171, CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","gp_session_id high-water mark is 39",,,,,,,0,,"postmaster.c",2197,
這一句說明有個segment是連不上的:
"Failed to connect to seg0 xx.xx.xx.xx:50000(xx.xx.xx.xx是segment的ip)",,,,,,,0,,"cdbgang_async.c",174,
連接到這個segment所在的物理機,查看segment日誌,發現這個時間點segment並沒有相關的日誌信息。
2、查看其它狀態信息
使用gpstate -s 查看各節點的狀態,截取其中有問題的一個segment:
gpstate:mdw-1:gpadmin-[INFO]:- Segment Info gpstate:mdw-1:gpadmin-[INFO]:- Hostname = sdw-2 gpstate:mdw-1:gpadmin-[INFO]:- Address = sdw-2 gpstate:mdw-1:gpadmin-[INFO]:- Datadir = /gpadmin/data/mirror/gpseg0 gpstate:mdw-1:gpadmin-[INFO]:- Port = 50000 gpstate:mdw-1:gpadmin-[INFO]:- Mirroring Info gpstate:mdw-1:gpadmin-[INFO]:- Current role = Primary gpstate:mdw-1:gpadmin-[INFO]:- Preferred role = Mirror gpstate:mdw-1:gpadmin-[INFO]:- Mirror status = Resynchronizing gpstate:mdw-1:gpadmin-[INFO]:- Change Tracking Info gpstate:mdw-1:gpadmin-[INFO]:- Change tracking data size = 32.1 kB gpstate:mdw-1:gpadmin-[INFO]:- Resynchronization Info gpstate:mdw-1:gpadmin-[INFO]:- Resynchronization mode = Incremental gpstate:mdw-1:gpadmin-[INFO]:- Data synchronized = 0 bytes gpstate:mdw-1:gpadmin-[INFO]:- Estimated total data to synchronize = Not Available gpstate:mdw-1:gpadmin-[INFO]:- Estimated resync progress with mirror = Not Available gpstate:mdw-1:gpadmin-[INFO]:- Total resync objects = 0 gpstate:mdw-1:gpadmin-[INFO]:- Objects to resync = 0 gpstate:mdw-1:gpadmin-[INFO]:- Estimated resync end time = Not Available gpstate:mdw-1:gpadmin-[INFO]:- Status gpstate:mdw-1:gpadmin-[INFO]:- PID = 26649 gpstate:mdw-1:gpadmin-[INFO]:- Configuration reports status as = Up gpstate:mdw-1:gpadmin-[INFO]:- Database status = Stopping gpstate:mdw-1:gpadmin-[INFO]:----------------------------------------------------- gpstate:mdw-1:gpadmin-[INFO]:- Segment Info gpstate:mdw-1:gpadmin-[INFO]:- Hostname = sdw-1 gpstate:mdw-1:gpadmin-[INFO]:- Address = sdw-1 gpstate:mdw-1:gpadmin-[INFO]:- Datadir = /gpadmin/data/primary/gpseg0 gpstate:mdw-1:gpadmin-[INFO]:- Port = 40000 gpstate:mdw-1:gpadmin-[INFO]:- Mirroring Info gpstate:mdw-1:gpadmin-[INFO]:- Current role = Mirror gpstate:mdw-1:gpadmin-[INFO]:- Preferred role = Primary gpstate:mdw-1:gpadmin-[INFO]:- Mirror status = Resynchronizing gpstate:mdw-1:gpadmin-[INFO]:- Status gpstate:mdw-1:gpadmin-[INFO]:- PID = 30458 gpstate:mdw-1:gpadmin-[INFO]:- Configuration reports status as = Up gpstate:mdw-1:gpadmin-[WARNING]:- Segment status = Down <<<<<<<<
這裡看到segment是處於Change Tracking狀態的,其中資料庫是Stopping狀態,sdw-1機器上,埠為40000的Mirror處於down的狀態,和配置中的up狀態是衝突的。
由於這個時候庫是不可連接的,所以通過master only模式直接連接到master,查看源資料庫,信息如下:
postgres=# select * from gp_segment_configuration ; dbid | content | role | preferred_role | mode | status | port | hostname | address | replication_port | san_mounts ------+---------+------+----------------+------+--------+-------+---------------+---------------+------------------+------------ (省略正常節點) 2 | 0 | m | p | r | u | 40000 | sdw-1 | sdw-1 | 41000 | 6 | 0 | p | m | r | u | 50000 | sdw-2 | sdw-2 | 51000 |
這裡可以看到源資料庫標記的這個segment處於同步狀態,然而status的狀態都為‘u’,可見,源資料庫標記出現了錯誤!
三、解決方案
通過以上的所有信息,可以分析出如下的一些問題:
- 庫是無法正常啟動的,也無法正常連接
- 無法連接庫則使用gprecoverseg -F 這種全量修複是不可行的
- 元資料庫信息是有問題的,資料庫啟動的時候通過元資料庫認為segment正常,會進行同步,然而有個segment實際是down的,無法反正同步,導致庫無法啟動
- gpcheckcat該命令是用於修複元資料庫錯誤的,會給出報告和修複建議,因為庫無法正常啟動,所以該命令無法正常執行
那麼需要通過修複元數據的方式,讓greenplum資料庫正常啟動,首先我們瞭解下greenplum的primary/mirror的狀態: