一、問題現象 20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]: 20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Greenplum Primary Seg ...
一、問題現象
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:----------------------------------------
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Greenplum Primary Segment Configuration
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:----------------------------------------
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-1 /home/primary/gpseg0 40000 2 0 41000
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-1 /home/primary/gpseg1 40001 3 1 41001
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-2 /home/primary/gpseg2 40000 4 2 41000
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-2 /home/primary/gpseg3 40001 5 3 41001
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:---------------------------------------
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Greenplum Mirror Segment Configuration
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:---------------------------------------
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-2 /home/mirror/gpseg0 50000 6 0 51000
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-2 /home/mirror/gpseg1 50001 7 1 51001
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-1 /home/mirror/gpseg2 50000 8 2 51000
20180201:15:06:25:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-sdw1-1 /home/mirror/gpseg3 50001 9 3 51001
Continue with Greenplum creation Yy/Nn>
y
20180201:15:06:28:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Building the Master instance database, please wait...
20180201:15:06:38:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Starting the Master in admin mode
20180201:15:06:46:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Commencing parallel build of primary segment instances
20180201:15:06:46:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Spawning parallel processes batch [1], please wait...
....
20180201:15:06:46:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Waiting for parallel processes batch [1], please wait...
........................
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Parallel process exit status
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Total processes marked as completed = 4
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Total processes marked as killed = 0
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Total processes marked as failed = 0
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Commencing parallel build of mirror segment instances
20180201:15:07:10:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Spawning parallel processes batch [1], please wait...
....
20180201:15:07:11:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Waiting for parallel processes batch [1], please wait...
....
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Parallel process exit status
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Total processes marked as completed = 0
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Total processes marked as killed = 0
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[WARN]:-Total processes marked as failed = 4 <<<<<
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:------------------------------------------------
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[FATAL]:-Errors generated from parallel processes
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Dumped contents of status file to the log file
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Building composite backout file
20180201:15:07:15:gpinitsystem:sdw1-2:gpadmin-[FATAL]:-Failures detected, see log file /home/gpadmin/gpAdminLogs/gpinitsystem_20180201.log for more detail Script Exiting!
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[WARN]:-Script has left Greenplum Database in an incomplete state
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[WARN]:-Run command /bin/bash /home/gpadmin/gpAdminLogs/backout_gpinitsystem_gpadmin_20180201_150615 to remove these changes
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-Start Function BACKOUT_COMMAND
20180201:15:07:15:028653 gpinitsystem:sdw1-2:gpadmin-[INFO]:-End Function BACKOUT_COMMAND
在裝機的時候,發現所有的segment都啟動失敗
檢查所有計算節點文件和日誌,沒有明顯的異常信息
二、定位過程
1、查看master日誌
根據提示查看/home/gpadmin/gpAdminLogs/gpinitsystem_20180201.log日誌信息
20180201:15:07:12:015183 gpcreateseg.sh:sdw1-2:gpadmin-[INFO]:-End Function BACKOUT_COMMAND
20180201:15:07:12:015183 gpcreateseg.sh:sdw1-2:gpadmin-[INFO][3]:-Completed to start segment instance database sdw1-1 /home/mirror/gpseg3
20180201:15:07:12:015183 gpcreateseg.sh:sdw1-2:gpadmin-[INFO]:-Copying data for mirror on sdw1-1 using remote copy from primary sdw1-2 ...
20180201:15:07:12:015183 gpcreateseg.sh:sdw1-2:gpadmin-[INFO]:-Start Function RUN_COMMAND_REMOTE
20180201:15:07:12:015183 gpcreateseg.sh:sdw1-2:gpadmin-[INFO]:-Commencing remote /bin/ssh sdw1-2 export GPHOME=/usr/local/gpdb; . /usr/local/gpdb/greenplum_path.sh; /usr/local/gpdb/bin/lib/pysync.py -x pg_log -x postgresql.conf -x postmaster.pid /home/primary/gpseg3 \[sdw1-1\]:/home/mirror/gpseg3
Killed by signal 1.^M
Killed by signal 1.^M
Killed by signal 1.^M
Traceback (most recent call last):
File "/usr/local/gpdb/bin/lib/pysync.py", line 669, in <module>
sys.exit(LocalPysync(sys.argv, progressTimestamp=True).run())
File "/usr/local/gpdb/bin/lib/pysync.py", line 647, in run
code = self.work()
File "/usr/local/gpdb/bin/lib/pysync.py", line 611, in work
self.socket.connect(self.connectAddress)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 113] No route to host
20180201:15:07:15:014991 gpcreateseg.sh:sdw1-2:gpadmin-[FATAL]:- Command export GPHOME=/usr/local/gpdb; . /usr/local/gpdb/greenplum_path.sh; /usr/local/gpdb/bin/lib/pysync.py -x pg_log -x postgresql.conf -x postmaster.pid /home/primary/gpseg2 \[sdw1-1\]:/home/mirror/gpseg2 on sdw1-2 failed with error status 1
20180201:15:07:15:014991 gpcreateseg.sh:sdw1-2:gpadmin-[INFO]:-End Function RUN_COMMAND_REMOTE
20180201:15:07:15:014991 gpcreateseg.sh:sdw1-2:gpadmin-[FATAL][2]:-Failed remote copy of segment data directory from sdw1-2 to sdw1-1
Killed by signal 1.^M
Traceback (most recent call last):
File "/usr/local/gpdb/bin/lib/pysync.py", line 669, in <module>
sys.exit(LocalPysync(sys.argv, progressTimestamp=True).run())
File "/usr/local/gpdb/bin/lib/pysync.py", line 647, in run
code = self.work()
File "/usr/local/gpdb/bin/lib/pysync.py", line 611, in work
self.socket.connect(self.connectAddress)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 113] No route to host
Traceback (most recent call last):
File "/usr/local/gpdb/bin/lib/pysync.py", line 669, in <module>
sys.exit(LocalPysync(sys.argv, progressTimestamp=True).run())
File "/usr/local/gpdb/bin/lib/pysync.py", line 647, in run
code = self.work()
File "/usr/local/gpdb/bin/lib/pysync.py", line 611, in work
self.socket.connect(self.connectAddress)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 113] No route to host
此處的關鍵信息在於:No route to host
按照此信息判斷是集群內機器有不互通的情況,之後做如下檢查:
- 所有物理機是否能正常連通
- 所有物理機hosts文件配置是否正確
- 是否ssh免密打通
所有內容檢查完後,發現一切都是正常的。
2、查看segment目錄
Commencing remote /bin/ssh sdw1-2 export GPHOME=/usr/local/gpdb; . /usr/local/gpdb/greenplum_path.sh; /usr/local/gpdb/bin/lib/pysync.py -x pg_log -x postgresql.conf -x postmaster.pid /home/primary/gpseg3 \[sdw1-1\]:/home/mirror/gpseg3
根據報錯的這段信息,找到sdw1-1和sdw1-2機器的primary和mirror目錄,做如下檢查:
- 目錄是否正常創建
- 文件是否完成
- 許可權是否正確(文件夾許可權應該授予給資料庫的管理員賬戶)
檢查完成後發現都是正常的
3、檢查segment日誌文件
檢查完所有相關的segment日誌文件後,基本segment的啟動都是正常的,僅找到如下的信息:
2018-02-01 15:07:07.854785 CST,,,p9642,th708450368,,,,0,,,seg-1,,,,,"LOG","00000","database system is ready to accept connections","PostgreSQL 8.3.23 (Greenplum Database 4.3.99.00 build dev) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) compiled on Jan 18 2018 15:33:53 (with assert checking)",,,,,,0,,"postmaster.c",4337, 2018-02-01 15:07:08.853415 CST,,,p9642,th708450368,,,,0,,,seg-1,,,,,"LOG","00000","received smart shutdown request",,,,,,,0,,"postmaster.c",4075, 2018-02-01 15:07:08.855196 CST,,,p9664,th708450368,,,,0,,,seg-1,,,,,"LOG","00000","shutting down",,,,,,,0,,"xlog.c",8616, 2018-02-01 15:07:08.863175 CST,,,p9664,th708450368,,,,0,,,seg-1,,,,,"LOG","00000","database system is shut down",,,,,,,0,,"xlog.c",8632,
這裡說明資料庫並沒有啟動完成,問題也不是在於segment上
至此,沒有多少有價值的信息來解決問題,因為翻閱官方文檔,在https://www.emc.com/collateral/TechnicalDocument/docu51071.pdf文檔的第三章提到需要關閉防火牆,於是使用如下命令查看防火牆狀態
systemctl status firewalld.service
發現防火牆是開啟狀態
三、解決方案:
1、回滾安裝
根據裝機時的日誌,提供瞭如下語句進行回滾:
/bin/bash /home/gpadmin/gpAdminLogs/backout_gpinitsystem_gpadmin_20180201_150615
2、配置防火牆
systemctl stop firewalld.service
systemctl disable firewalld.service#禁止防火牆開機後自啟動
註意:以上語句適用於centos 7
3、重新執行安裝步驟即可
疑問:目前為什麼一定要關閉防火牆,還沒有深入研究,另外如何在防火牆開啟的狀態下部署集群,本人也還沒有成功過。