分散式資料庫 Join 查詢設計與實現淺析

本文記錄 Mysql 分庫分表和 Elasticsearch Join 查詢的實現思路，瞭解分散式場景數據處理的設計方案。文章從常用的關係型資料庫 MySQL 的分庫分表Join 分析，再到非關係型 ElasticSearch 來分析 Join 實現策略。逐步深入Join 的實現機制。 ...

相對於單例資料庫的查詢操作，分散式數據查詢會有很多技術難題。

本文記錄 Mysql 分庫分表和 Elasticsearch Join 查詢的實現思路，瞭解分散式場景數據處理的設計方案。
文章從常用的關係型資料庫 MySQL 的分庫分表Join 分析，再到非關係型 ElasticSearch 來分析 Join 實現策略。逐步深入Join 的實現機制。

①Mysql 分庫分表 Join 查詢場景

分庫分表場景下，查詢語句如何分發，數據如何組織。相較於NoSQL 資料庫，Mysql 在SQL 規範的範圍內，相對比較容易適配分散式場景。

基於 sharding-jdbc 中間件的方案，瞭解整個設計思路。

sharding-jdbc

sharding-jdbc 代理了原始的 datasource, 實現 jdbc 規範來完成分庫分表的分發和組裝，應用層無感知。
執行流程：SQL解析 => 執行器優化 => SQL路由 => SQL改寫 => SQL執行 => 結果歸併 io.shardingsphere.core.executor.ExecutorEngine#execute
Join 語句的解析，決定了要分發 SQL 到哪些實例節點上。對應SQL路由。
SQL 改寫就是要把原始（邏輯）表名，改為實際分片的表名。
複雜情況下，Join 查詢分發的最多執行的次數 = 資料庫實例 × 表A分片數 × 表B分片數

Code Insight

示例代碼工程：[email protected]:cluoHeadon/sharding-jdbc-demo.git

/**
 * 執行查詢 SQL 切入點，從這裡可以完整 debug 執行流程
 * @see ShardingPreparedStatement#execute()
 * @see ParsingSQLRouter#route(String, List, SQLStatement) Join 查詢實際涉及哪些表，就是在路由規則里匹配得出來的。
 */
public boolean execute() throws SQLException {
    try {
        // 根據參數（決定分片）和具體的SQL 來匹配相關的實際 Table。
        Collection<PreparedStatementUnit> preparedStatementUnits = route();
        // 使用線程池，分發執行和結果歸併。
        return new PreparedStatementExecutor(getConnection().getShardingContext().getExecutorEngine(), routeResult.getSqlStatement().getType(), preparedStatementUnits).execute();
    } finally {
        JDBCShardingRefreshHandler.build(routeResult, connection).execute();
        clearBatch();
    }
}

SQL 路由策略

啟用 sql 列印，直觀看到實際分發執行的 SQL

# 列印的代碼，就是在上述route 得出 ExecutionUnits 後，列印的
sharding.jdbc.config.sharding.props.sql.show=true

sharding-jdbc 根據不同的SQL 語句，會有不同的路由策略。我們關註的 Join 查詢，實際相關就是以下兩種策略。

StandardRoutingEngine binding-tables 模式
ComplexRoutingEngine 最複雜的情況，笛卡爾組合關聯關係。

-- 參數不明，不能定位分片的情況
select * from order o inner join order_item oi on o.order_id = oi.order_id 

-- 路由結果
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_1 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_0 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_0 o inner join order_item_0 oi on o.order_id = oi.order_id

②Elasticsearch Join 查詢場景

首先，對於 NoSQL 資料庫，要求 Join 查詢，可以考慮是不是使用場景和用法有問題。

然後，不可避免的，有些場景需要這個功能。Join 查詢的實現更貼近SQL 引擎。

基於 elasticsearch-sql 組件的方案，瞭解大概的實現思路。

elasticsearch-sql

這是個elasticsearch 插件，通過提供http 服務實現類 SQL 查詢的功能，高版本的elasticsearch 已經具備該功能⭐
因為 elasticsearch 沒有 Join 查詢的特性，所以實現 SQL Join 功能，需要提供更加底層的功能，涉及到 Join 演算法。

Code Insight

源碼地址：[email protected]:NLPchina/elasticsearch-sql.git

/**
 * Execute the ActionRequest and returns the REST response using the channel.
 * @see ElasticDefaultRestExecutor#execute
 * @see ESJoinQueryActionFactory#createJoinAction Join 演算法選擇
 */
@Override
public void execute(Client client, Map<String, String> params, QueryAction queryAction, RestChannel channel) throws Exception{
    // sql parse
    SqlElasticRequestBuilder requestBuilder = queryAction.explain();

    // join 查詢
    if(requestBuilder instanceof JoinRequestBuilder){
        // join 演算法選擇。包括：HashJoinElasticExecutor、NestedLoopsElasticExecutor
        // 如果關聯條件為等值（Condition.OPEAR.EQ）,則使用 HashJoinElasticExecutor
        ElasticJoinExecutor executor = ElasticJoinExecutor.createJoinExecutor(client,requestBuilder);
        executor.run();
        executor.sendResponse(channel);
    }
    // 其他類型查詢 ...
}

③More Than Join

Join 演算法

常用三種 Join 演算法：Nested Loop Join，Hash Join、 Merge Join
MySQL 只支持 NLJ 或其變種，8.0.18 版本後支持 Hash Join
NLJ 相當於兩個嵌套迴圈，用第一張表做 Outter Loop，第二張表做 Inner Loop，Outter Loop 的每一條記錄跟 Inner Loop 的記錄作比較，最終符合條件的就將該數據記錄。
Hash Join 分為兩個階段; build 構建階段和 probe 探測階段。
可以使用Explain 查看 MySQL 使用哪種 Join 演算法。需要的語法關鍵字： FORMAT=JSON or FORMAT=Tree

EXPLAIN FORMAT=JSON  
SELECT * FROM
    sale_line_info u
    JOIN sale_line_manager o ON u.sale_line_code = o.sale_line_code;

{
    "query_block": {
        "select_id": 1,
        // 使用的join 演算法： nested_loop
        "nested_loop": [
            // 涉及join 的表以及對應的 key,其他的信息與常用explain 類似
            {
                "table": {
                    "table_name": "o",
                    "access_type": "ALL"
                }
            },
            {
                "table": {
                    "table_name": "u",
                    "access_type": "ref"
                }
            }
        ]
    }
}

Elasticsearch Nested類型

分析Elasticsearch 業務數據以及使用場景，還有一種選擇是直接存儲關聯信息的文檔。在 Elasticsearch 中，是以完整文檔形式提供查詢和檢索，徹底避開使用 Join 相關的技術。

這樣就牽扯到關聯是歸屬類型的數據還是公用類型的數據、關聯數據量的大小、關聯數據的更新頻率等。這些都是使用 Nested 類型需要考慮的因素。

更多的使用方法，可以從網上和官網找到，不做贅述。
我們現在有個業務功能正好使用到 Nested類型，在查詢和優化過程中，解決了非常大的難題。

總結

通過運行原理分析，對於運行流程有了清晰和深入的認知。

對於中間件的優化和技術選型更加有目的性，使用上會更加謹慎和小心。

明確的篩選條件，更小的篩選範圍，limit 取值數據，都可以減少計算陳本，提高性能。

參考

作者：京東物流楊攀

來源：京東雲開發者社區