gprecoverseg導致的元資料庫問題致使資料庫無法啟動以及修複

来源:https://www.cnblogs.com/qiannianyuan/archive/2018/03/12/greenplum_change_tracking.html
-Advertisement-
Play Games

一、現象描述 在一次執行gprecoverseg後發現資料庫無法正常連接,現象如下: 執行gprecoverseg日誌如下: 這裡看出gprecoverseg命令是執行失敗了的。 之後嘗試啟動資料庫,執行gpstart: 這時候顯示所有segment都是啟動的,但是庫沒有正常啟動起來 然後執行gps ...


一、現象描述

在一次執行gprecoverseg後發現資料庫無法正常連接,現象如下:

執行gprecoverseg日誌如下:

 

gprecoverseg:mdw-1:gpadmin-[INFO]:-Starting gprecoverseg with args: -r
gprecoverseg:mdw-1:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.99.00 build dev'
gprecoverseg:mdw-1:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 4.3.99.00 build dev) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) compiled on Jan 18 2018 15:34:00 (with assert checking)'
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 1
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 2
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 3
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 4
gprecoverseg:mdw-1:gpadmin-[INFO]:-Checking if segments are ready to connect
gprecoverseg:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gprecoverseg:mdw-1:gpadmin-[INFO]:-Unable to connect to database. Retrying 5
gprecoverseg:mdw-1:gpadmin-[CRITICAL]:-gprecoverseg failed. (Reason='Unable to connect to database and start transaction') exiting...

 

 這裡看出gprecoverseg命令是執行失敗了的。

 

之後嘗試啟動資料庫,執行gpstart:

gpstart:mdw-1:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
..
gpstart:mdw-1:gpadmin-[INFO]:-Process results...
gpstart:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstart:mdw-1:gpadmin-[INFO]:-   Successful segment starts                                            = 8
gpstart:mdw-1:gpadmin-[INFO]:-   Failed segment starts                                                = 0
gpstart:mdw-1:gpadmin-[INFO]:-   Skipped segment starts (segments are marked down in configuration)   = 0
gpstart:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstart:mdw-1:gpadmin-[INFO]:-
gpstart:mdw-1:gpadmin-[INFO]:-Successfully started 8 of 8 segment instances
gpstart:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstart:mdw-1:gpadmin-[INFO]:-Starting Master instance mdw-1 directory /home/data/master/gpseg-1
gpstart:mdw-1:gpadmin-[INFO]:-Command pg_ctl reports Master mdw-1 instance active
gpstart:mdw-1:gpadmin-[WARNING]:-server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

gpstart:mdw-1:gpadmin-[INFO]:-Starting standby master
gpstart:mdw-1:gpadmin-[INFO]:-Checking if standby master is running on host: mdw-2  in directory: /home/data/master/gpseg-1
gpstart:mdw-1:gpadmin-[INFO]:-Check status of database with gpstate utility

 

這時候顯示所有segment都是啟動的,但是庫沒有正常啟動起來

 

然後執行gpstate,確認狀態:

gpstate:mdw-1:gpadmin-[INFO]:-Starting gpstate with args:
gpstate:mdw-1:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.99.00 build dev'
gpstate:mdw-1:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 4.3.99.00 build dev) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) compiled on Jan 18 2018 15:34:00 (with assert checking)'
gpstate:mdw-1:gpadmin-[INFO]:-Obtaining Segment details from master...
gpstate:mdw-1:gpadmin-[INFO]:-Gathering data from segments...
.
gpstate:mdw-1:gpadmin-[INFO]:-Greenplum instance status summary
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Master instance                                           = Active
gpstate:mdw-1:gpadmin-[INFO]:-   Master standby                                            = mdw-2
gpstate:mdw-1:gpadmin-[INFO]:-   Standby master state                                      = Standby host passive
gpstate:mdw-1:gpadmin-[INFO]:-   Total segment instance count from metadata                = 8
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Primary Segment Status
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Total primary segments                                    = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total primary segment valid (at master)                   = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total primary segment failures (at master)                = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number postmaster processes missing                 = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number postmaster processes found                   = 4
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Mirror Segment Status
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Total mirror segments                                     = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total mirror segment valid (at master)                    = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total mirror segment failures (at master)                 = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 4
gpstate:mdw-1:gpadmin-[INFO]:-   Total number postmaster processes missing                 = 0
gpstate:mdw-1:gpadmin-[INFO]:-   Total number postmaster processes found                   = 4
gpstate:mdw-1:gpadmin-[WARNING]:-Total number mirror segments acting as primary segments   = 1                      <<<<<<<<
gpstate:mdw-1:gpadmin-[INFO]:-   Total number mirror segments acting as mirror segments    = 3
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------

 

 這裡說明是發生過primary/mirror的角色切換的

 

嘗試連接資料庫:

psql: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

 

 這時候庫已經無法連接上。

 

二、定位過程

1、查看日誌

首先查看master的日誌:

CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con38,,seg-1,,,,sx1,"LOG","00000","The previous session was reset because its gang was disconnected (session id = 37). The new session id = 38",,,,,,,0,,"cdbgang.c",1638,
CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con38,,seg-1,,,,sx1,"LOG","00000","Failed to connect to seg0 xx.xx.xx.xx:50000(xx.xx.xx.xx是segment的ip)",,,,,,,0,,"cdbgang_async.c",174,
CST,"gpadmin","postgres",p30743,th1692887104,"[local]",,CST,0,con39,,seg-1,,,,sx1,"LOG","00000","The previous session was reset because its gang was disconnected (session id = 38). The new session id = 39",,,,,,,0,,"cdbgang.c",1638,
CST,,,p30743,th0,,,CST,0,con39,,seg-1,,,,,"PANIC","XX000","Unexpected internal error: Master process received signal SIGSEGV",,,,,,,0,,,,"1    0xa0ce4b postgres StandardHandlerForSigillSigsegvSigbus_OnMainThread + 0x1b2
2    0x8d376e postgres CdbProgramErrorHandler + 0xf1
3    0x7fee63a14100 libpthread.so.0 <symbol not found> + 0x63a14100
4    0xaaa718 postgres <symbol not found> + 0xaaa718
5    0xaaa3e4 postgres CdbDispatchCommand + 0x35
6    0xb75719 postgres <symbol not found> + 0xb75719
7    0xb755d4 postgres <symbol not found> + 0xb755d4
8    0xb7524b postgres <symbol not found> + 0xb7524b
9    0xb71c51 postgres <symbol not found> + 0xb71c51
10   0xb720ae postgres initTM + 0xe2
11   0xb78864 postgres cdb_setup + 0x4b
12   0xa1b6e8 postgres InitPostgres + 0x9fa
"
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","server process (PID 30743) was terminated by signal 11: Segmentation fault",,,,,,,0,,"postmaster.c",5604,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",5284,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","seqserver process (PID 28900) exited with exit code 2",,,,,,,0,,"postmaster.c",5584,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","ftsprobe process (PID 28901) exited with exit code 2",,,,,,,0,,"postmaster.c",5584,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","sweeper process (PID 28902) exited with exit code 2",,,,,,,0,,"postmaster.c",5584,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","BeginResetOfPostmasterAfterChildrenAreShutDown: counter 0",,,,,,,0,,"postmaster.c",2171,
CST,,,p28893,th1692887104,,,,0,,,seg-1,,,,,"LOG","00000","gp_session_id high-water mark is 39",,,,,,,0,,"postmaster.c",2197,

 

 這一句說明有個segment是連不上的:

"Failed to connect to seg0 xx.xx.xx.xx:50000(xx.xx.xx.xx是segment的ip)",,,,,,,0,,"cdbgang_async.c",174,

 連接到這個segment所在的物理機,查看segment日誌,發現這個時間點segment並沒有相關的日誌信息。

 

2、查看其它狀態信息

使用gpstate -s 查看各節點的狀態,截取其中有問題的一個segment:

gpstate:mdw-1:gpadmin-[INFO]:-   Segment Info
gpstate:mdw-1:gpadmin-[INFO]:-      Hostname                                = sdw-2
gpstate:mdw-1:gpadmin-[INFO]:-      Address                                 = sdw-2
gpstate:mdw-1:gpadmin-[INFO]:-      Datadir                                 = /gpadmin/data/mirror/gpseg0
gpstate:mdw-1:gpadmin-[INFO]:-      Port                                    = 50000
gpstate:mdw-1:gpadmin-[INFO]:-   Mirroring Info
gpstate:mdw-1:gpadmin-[INFO]:-      Current role                            = Primary
gpstate:mdw-1:gpadmin-[INFO]:-      Preferred role                          = Mirror
gpstate:mdw-1:gpadmin-[INFO]:-      Mirror status                           = Resynchronizing
gpstate:mdw-1:gpadmin-[INFO]:-   Change Tracking Info
gpstate:mdw-1:gpadmin-[INFO]:-      Change tracking data size               = 32.1 kB
gpstate:mdw-1:gpadmin-[INFO]:-   Resynchronization Info
gpstate:mdw-1:gpadmin-[INFO]:-      Resynchronization mode                  = Incremental
gpstate:mdw-1:gpadmin-[INFO]:-      Data synchronized                       = 0 bytes
gpstate:mdw-1:gpadmin-[INFO]:-      Estimated total data to synchronize     = Not Available
gpstate:mdw-1:gpadmin-[INFO]:-      Estimated resync progress with mirror   = Not Available
gpstate:mdw-1:gpadmin-[INFO]:-      Total resync objects                    = 0
gpstate:mdw-1:gpadmin-[INFO]:-      Objects to resync                       = 0
gpstate:mdw-1:gpadmin-[INFO]:-      Estimated resync end time               = Not Available
gpstate:mdw-1:gpadmin-[INFO]:-   Status
gpstate:mdw-1:gpadmin-[INFO]:-      PID                                     = 26649
gpstate:mdw-1:gpadmin-[INFO]:-      Configuration reports status as         = Up
gpstate:mdw-1:gpadmin-[INFO]:-      Database status                         = Stopping
gpstate:mdw-1:gpadmin-[INFO]:-----------------------------------------------------
gpstate:mdw-1:gpadmin-[INFO]:-   Segment Info
gpstate:mdw-1:gpadmin-[INFO]:-      Hostname                                = sdw-1
gpstate:mdw-1:gpadmin-[INFO]:-      Address                                 = sdw-1
gpstate:mdw-1:gpadmin-[INFO]:-      Datadir                                 = /gpadmin/data/primary/gpseg0
gpstate:mdw-1:gpadmin-[INFO]:-      Port                                    = 40000
gpstate:mdw-1:gpadmin-[INFO]:-   Mirroring Info
gpstate:mdw-1:gpadmin-[INFO]:-      Current role                            = Mirror
gpstate:mdw-1:gpadmin-[INFO]:-      Preferred role                          = Primary
gpstate:mdw-1:gpadmin-[INFO]:-      Mirror status                           = Resynchronizing
gpstate:mdw-1:gpadmin-[INFO]:-   Status
gpstate:mdw-1:gpadmin-[INFO]:-      PID                                     = 30458
gpstate:mdw-1:gpadmin-[INFO]:-      Configuration reports status as         = Up
gpstate:mdw-1:gpadmin-[WARNING]:-   Segment status                          = Down                           <<<<<<<<

 

 這裡看到segment是處於Change Tracking狀態的,其中資料庫是Stopping狀態,sdw-1機器上,埠為40000的Mirror處於down的狀態,和配置中的up狀態是衝突的。

由於這個時候庫是不可連接的,所以通過master only模式直接連接到master,查看源資料庫,信息如下:

 

postgres=# select * from gp_segment_configuration ;
 dbid | content | role | preferred_role | mode | status | port  |   hostname    |    address    | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+---------------+---------------+------------------+------------
(省略正常節點)
    2 |       0 | m    | p              | r    | u      | 40000 | sdw-1 | sdw-1 |            41000 |
    6 |       0 | p    | m              | r    | u      | 50000 | sdw-2 | sdw-2 |            51000 |

 這裡可以看到源資料庫標記的這個segment處於同步狀態,然而status的狀態都為‘u’,可見,源資料庫標記出現了錯誤!

三、解決方案

通過以上的所有信息,可以分析出如下的一些問題:

  • 庫是無法正常啟動的,也無法正常連接
  • 無法連接庫則使用gprecoverseg -F 這種全量修複是不可行的
  • 元資料庫信息是有問題的,資料庫啟動的時候通過元資料庫認為segment正常,會進行同步,然而有個segment實際是down的,無法反正同步,導致庫無法啟動
  • gpcheckcat該命令是用於修複元資料庫錯誤的,會給出報告和修複建議,因為庫無法正常啟動,所以該命令無法正常執行

那麼需要通過修複元數據的方式,讓greenplum資料庫正常啟動,首先我們瞭解下greenplum的primary/mirror的狀態:

您的分享是我們最大的動力!

-Advertisement-
Play Games
更多相關文章
  • MySQL的MyISAM、InnoDB引擎預設均使用B+樹索引(查詢時都顯示為“BTREE”),本文討論兩個問題: 為什麼MySQL等主流資料庫選擇B+樹的索引結構? 如何基於索引結構,理解常見的MySQL索引優化思路? 為什麼索引無法全部裝入記憶體 索引結構的選擇基於這樣一個性質:大數據量時,索引無 ...
  • 前言 這邊文章,筆者要分享的是如何在我們的Linux系統中安裝我們的Mysql資料庫。 一、Mysql安裝 1.1下載資源並上傳到虛擬機上 1.1.1下載資源 官網下載資源:點我跳轉到資料庫下載官網 因為筆者使用的系統是centOS,所以下載的是red hat相關的資料庫系統: 當然了筆者下載的版本 ...
  • HDFS體繫結構 1<!--[if gte mso 9]><xml> <w:data>08D0C9EA79F9BACE118C8200AA004BA90B02000000080000000E0000005F0054006F006300350030003800360033003400350039003 ...
  • 什麼是DBLINK? dblink(Database Link)資料庫鏈接顧名思義就是資料庫的鏈接 ,就像電話線一樣,是一個通道,當我們要跨本地資料庫,訪問另外一個資料庫表中的數據時,本地資料庫中就必須要創建遠程資料庫的dblink,通過dblink本地資料庫可以像訪問本地資料庫一樣訪問遠程資料庫表 ...
  • 目的 創建和管理資料庫用戶帳戶 -驗證用戶 -分配預設存儲區(表空間) 授予和撤銷許可權 創建和管理角色 創建和管理概要文件 -實施標準口令安全功能 -控制用戶的資源使用量 相關術語 資料庫用戶帳號:是一種組織資料庫對象的所有權和訪問許可權的方法 口令:是Oracle DB使用的一種驗證方法 角色:是一 ...
  • 最近在學習Redis,先看看簡介: Redis 是完全開源免費的,遵守BSD協議(可以自由的使用,修改源代碼的協議,當然需要滿足一定的條件),是一個高性能的key-value資料庫。 特點&&優點: 支持數據的持久化,可以將記憶體中的數據保存在磁碟內,重啟的時候可以再次載入進行使用。 不僅支持key- ...
  • 今天使用命令dos 命令 net start mysql 啟動mysql的使用出現以下情況 無法正常啟動mysql服務。 原因是: 啟動dos命令視窗時的用戶許可權太低,無法正常使用 解決辦法: 搜索cmd使用管理員身份打開dos命令 這樣就能夠正常啟動mysql資料庫了 關閉mysql的dos命令 ...
  • 一、MySQL介紹 Mysql是最流行的關係型資料庫管理系統,在WEB應用方面MySQL是最好的RDBMS(Relational Database Management System:關係資料庫管理系統)應用軟體之一。 所謂的關係型資料庫,是建立在關係模型基礎上的資料庫,藉助於集合代數等數學概念和方 ...
一周排行
    -Advertisement-
    Play Games
  • 移動開發(一):使用.NET MAUI開發第一個安卓APP 對於工作多年的C#程式員來說,近來想嘗試開發一款安卓APP,考慮了很久最終選擇使用.NET MAUI這個微軟官方的框架來嘗試體驗開發安卓APP,畢竟是使用Visual Studio開發工具,使用起來也比較的順手,結合微軟官方的教程進行了安卓 ...
  • 前言 QuestPDF 是一個開源 .NET 庫,用於生成 PDF 文檔。使用了C# Fluent API方式可簡化開發、減少錯誤並提高工作效率。利用它可以輕鬆生成 PDF 報告、發票、導出文件等。 項目介紹 QuestPDF 是一個革命性的開源 .NET 庫,它徹底改變了我們生成 PDF 文檔的方 ...
  • 項目地址 項目後端地址: https://github.com/ZyPLJ/ZYTteeHole 項目前端頁面地址: ZyPLJ/TreeHoleVue (github.com) https://github.com/ZyPLJ/TreeHoleVue 目前項目測試訪問地址: http://tree ...
  • 話不多說,直接開乾 一.下載 1.官方鏈接下載: https://www.microsoft.com/zh-cn/sql-server/sql-server-downloads 2.在下載目錄中找到下麵這個小的安裝包 SQL2022-SSEI-Dev.exe,運行開始下載SQL server; 二. ...
  • 前言 隨著物聯網(IoT)技術的迅猛發展,MQTT(消息隊列遙測傳輸)協議憑藉其輕量級和高效性,已成為眾多物聯網應用的首選通信標準。 MQTTnet 作為一個高性能的 .NET 開源庫,為 .NET 平臺上的 MQTT 客戶端與伺服器開發提供了強大的支持。 本文將全面介紹 MQTTnet 的核心功能 ...
  • Serilog支持多種接收器用於日誌存儲,增強器用於添加屬性,LogContext管理動態屬性,支持多種輸出格式包括純文本、JSON及ExpressionTemplate。還提供了自定義格式化選項,適用於不同需求。 ...
  • 目錄簡介獲取 HTML 文檔解析 HTML 文檔測試參考文章 簡介 動態內容網站使用 JavaScript 腳本動態檢索和渲染數據,爬取信息時需要模擬瀏覽器行為,否則獲取到的源碼基本是空的。 本文使用的爬取步驟如下: 使用 Selenium 獲取渲染後的 HTML 文檔 使用 HtmlAgility ...
  • 1.前言 什麼是熱更新 游戲或者軟體更新時,無需重新下載客戶端進行安裝,而是在應用程式啟動的情況下,在內部進行資源或者代碼更新 Unity目前常用熱更新解決方案 HybridCLR,Xlua,ILRuntime等 Unity目前常用資源管理解決方案 AssetBundles,Addressable, ...
  • 本文章主要是在C# ASP.NET Core Web API框架實現向手機發送驗證碼簡訊功能。這裡我選擇是一個互億無線簡訊驗證碼平臺,其實像阿裡雲,騰訊雲上面也可以。 首先我們先去 互億無線 https://www.ihuyi.com/api/sms.html 去註冊一個賬號 註冊完成賬號後,它會送 ...
  • 通過以下方式可以高效,並保證數據同步的可靠性 1.API設計 使用RESTful設計,確保API端點明確,並使用適當的HTTP方法(如POST用於創建,PUT用於更新)。 設計清晰的請求和響應模型,以確保客戶端能夠理解預期格式。 2.數據驗證 在伺服器端進行嚴格的數據驗證,確保接收到的數據符合預期格 ...