筆者使用SeaTunnel 2.3.2版本將Doris數據同步到Hive(cdh-6.3.2)首次運行時有如下報錯,並附上報錯的解決方案: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/MetaExcepti ...
8.Nested Join Optimization
The syntax for expressing joins permits nested joins.
The syntax of table_factor is extended in comparison with the SQL Standard. The latter accepts only table_reference, not a list of them inside a pair of parentheses. This is a conservative【kənˈsɜːrvətɪv 保守的;(英國)保守黨的;守舊的;低於實際數量的;】 extension if we consider each comma in a list of table_reference items as equivalent to an inner join. For example:
SELECT * FROM t1 LEFT JOIN (t2, t3, t4) ON (t2.a=t1.a AND t3.b=t1.b AND t4.c=t1.c)
Is equivalent to:
SELECT * FROM t1 LEFT JOIN (t2 CROSS JOIN t3 CROSS JOIN t4) ON (t2.a=t1.a AND t3.b=t1.b AND t4.c=t1.c)
In MySQL, CROSS JOIN is syntactically【句法上;語法上;句法;句法特征;】 equivalent to INNER JOIN; they can replace each other. In standard SQL, they are not equivalent. INNER JOIN is used with an ON clause; CROSS JOIN is used otherwise.
In general, parentheses can be ignored in join expressions containing only inner join operations. Consider this join expression:
t1 LEFT JOIN (t2 LEFT JOIN t3 ON t2.b=t3.b OR t2.b IS NULL) ON t1.a=t2.a
After removing parentheses and grouping operations to the left, that join expression transforms into this expression:
(t1 LEFT JOIN t2 ON t1.a=t2.a) LEFT JOIN t3 ON t2.b=t3.b OR t2.b IS NULL
Yet, the two expressions are not equivalent. To see this, suppose that the tables t1, t2, and t3 have the following state:
• Table t1 contains rows (1), (2)
• Table t2 contains row (1,101)
• Table t3 contains row (101)
In this case, the first expression returns a result set including the rows (1,1,101,101), (2,NULL,NULL,NULL), whereas the second expression returns the rows (1,1,101,101), (2,NULL,NULL,101):
mysql> SELECT * FROM t1 LEFT JOIN (t2 LEFT JOIN t3 ON t2.b=t3.b OR t2.b IS NULL) ON t1.a=t2.a; +------+------+------+------+ | a | a | b | b | +------+------+------+------+ | 1 | 1 | 101 | 101 | | 2 | NULL | NULL | NULL | +------+------+------+------+ mysql> SELECT * FROM (t1 LEFT JOIN t2 ON t1.a=t2.a) LEFT JOIN t3 ON t2.b=t3.b OR t2.b IS NULL; +------+------+------+------+ | a | a | b | b | +------+------+------+------+ | 1 | 1 | 101 | 101 | | 2 | NULL | NULL | 101 | +------+------+------+------+
In the following example, an outer join operation is used together with an inner join operation:
t1 LEFT JOIN (t2, t3) ON t1.a=t2.a
That expression cannot be transformed into the following expression:
t1 LEFT JOIN t2 ON t1.a=t2.a, t3
For the given table states, the two expressions return different sets of rows:
mysql> SELECT * FROM t1 LEFT JOIN (t2, t3) ON t1.a=t2.a; +------+------+------+------+ | a | a | b | b | +------+------+------+------+ | 1 | 1 | 101 | 101 | | 2 | NULL | NULL | NULL | +------+------+------+------+ mysql> SELECT * FROM t1 LEFT JOIN t2 ON t1.a=t2.a, t3; +------+------+------+------+ | a | a | b | b | +------+------+------+------+ | 1 | 1 | 101 | 101 | | 2 | NULL | NULL | 101 | +------+------+------+------+
Therefore, if we omit【əˈmɪt 省略;忽略;遺漏;刪除;漏掉;不做;未能做;】 parentheses【pəˈrɛnθəˌsiz 插入語;】 in a join expression with outer join operators, we might change the result set for the original expression.
More exactly, we cannot ignore parentheses in the right operand of the left outer join operation and in the left operand of a right join operation. In other words, we cannot ignore parentheses for the inner table expressions of outer join operations. Parentheses for the other operand (operand for the outer table) can be ignored.
The following expression:
(t1,t2) LEFT JOIN t3 ON P(t2.b,t3.b)
Is equivalent to this expression for any tables t1,t2,t3 and any condition P over attributes t2.b and t3.b:
t1, t2 LEFT JOIN t3 ON P(t2.b,t3.b)
Whenever the order of execution of join operations in a join expression (joined_table) is not from left to right, we talk about nested joins. Consider the following queries:
SELECT * FROM t1 LEFT JOIN (t2 LEFT JOIN t3 ON t2.b=t3.b) ON t1.a=t2.a WHERE t1.a > 1 SELECT * FROM t1 LEFT JOIN (t2, t3) ON t1.a=t2.a WHERE (t2.b=t3.b OR t2.b IS NULL) AND t1.a > 1
Those queries are considered to contain these nested joins:
t2 LEFT JOIN t3 ON t2.b=t3.b t2, t3
In the first query, the nested join is formed with a left join operation. In the second query, it is formed with an inner join operation.
In the first query, the parentheses can be omitted: The grammatical【ɡrəˈmætɪkl 語法的;文法的;符合語法規則的;】 structure of the join expression dictates the same order of execution for join operations. For the second query, the parentheses cannot be omitted, although the join expression here can be interpreted unambiguously without them. In our extended syntax, the parentheses in (t2, t3) of the second query are required, although theoretically the query could be parsed without them: We still would have unambiguous syntactical structure for the query because LEFT JOIN and ON play the role of the left and right delimiters for the expression (t2,t3).
The preceding examples demonstrate these points:
• For join expressions involving only inner joins (and not outer joins), parentheses can be removed and joins evaluated left to right. In fact, tables can be evaluated in any order.
• The same is not true, in general, for outer joins or for outer joins mixed with inner joins. Removal of parentheses may change the result.
Queries with nested outer joins are executed in the same pipeline manner as queries with inner joins. More exactly, a variation of the nested-loop join algorithm is exploited. Recall the algorithm by which the nested-loop join executes a query. Suppose that a join query over 3 tables T1,T2,T3 has this form:
SELECT * FROM T1 INNER JOIN T2 ON P1(T1,T2) INNER JOIN T3 ON P2(T2,T3) WHERE P(T1,T2,T3)
Here, P1(T1,T2) and P2(T3,T3) are some join conditions (on expressions), whereas P(T1,T2,T3) is a condition over columns of tables T1,T2,T3.
The nested-loop join algorithm would execute this query in the following manner:
FOR each row t1 in T1 { FOR each row t2 in T2 such that P1(t1,t2) { FOR each row t3 in T3 such that P2(t2,t3) { IF P(t1,t2,t3) { t:=t1||t2||t3; OUTPUT t; } } } }
The notation t1||t2||t3 indicates a row constructed by concatenating【kənˈkætəˌneɪtɪŋ 使連接(連續,銜接)起來;連鎖;串級;】 the columns of rows t1, t2, and t3. In some of the following examples, NULL where a table name appears means a row in which NULL is used for each column of that table. For example, t1||t2||NULL indicates a row constructed by concatenating the columns of rows t1 and t2, and NULL for each column of t3. Such a row is said to be NULL-complemented.
Now consider a query with nested outer joins:
SELECT * FROM T1 LEFT JOIN (T2 LEFT JOIN T3 ON P2(T2,T3)) ON P1(T1,T2) WHERE P(T1,T2,T3)
For this query, modify the nested-loop pattern to obtain:
FOR each row t1 in T1 { BOOL f1:=FALSE; FOR each row t2 in T2 such that P1(t1,t2) { BOOL f2:=FALSE; FOR each row t3 in T3 such that P2(t2,t3) { IF P(t1,t2,t3) { t:=t1||t2||t3; OUTPUT t; } f2=TRUE; f1=TRUE; } IF (!f2) { IF P(t1,t2,NULL) { t:=t1||t2||NULL; OUTPUT t; } f1=TRUE; } } IF (!f1) { IF P(t1,NULL,NULL) { t:=t1||NULL||NULL; OUTPUT t; } } }
In general, for any nested loop for the first inner table in an outer join operation, a flag is introduced that is turned off before the loop and is checked after the loop. The flag is turned on when for the current row from the outer table a match from the table representing the inner operand is found. If at the end of the loop cycle the flag is still off, no match has been found for the current row of the outer table. In this case, the row is complemented by NULL values for the columns of the inner tables. The result row is passed to the final check for the output or into the next nested loop, but only if the row satisfies the join condition of all embedded outer joins.
In the example, the outer join table expressed by the following expression is embedded:
(T2 LEFT JOIN T3 ON P2(T2,T3))
For the query with inner joins, the optimizer could choose a different order of nested loops, such as this one:
FOR each row t3 in T3 { FOR each row t2 in T2 such that P2(t2,t3) { FOR each row t1 in T1 such that P1(t1,t2) { IF P(t1,t2,t3) { t:=t1||t2||t3; OUTPUT t; } } } }
For queries with outer joins, the optimizer can choose only such an order where loops for outer tables precede loops for inner tables. Thus, for our query with outer joins, only one nesting order is possible. For the following query, the optimizer evaluates two different nestings. In both nestings, T1 must be processed in the outer loop because it is used in an outer join. T2 and T3 are used in an inner join, so that join must be processed in the inner loop. However, because the join is an inner join, T2 and T3 can be processed in either order.
SELECT * T1 LEFT JOIN (T2,T3) ON P1(T1,T2) AND P2(T1,T3) WHERE P(T1,T2,T3)
One nesting evaluates T2, then T3:
FOR each row t1 in T1 { BOOL f1:=FALSE; FOR each row t2 in T2 such that P1(t1,t2) { FOR each row t3 in T3 such that P2(t1,t3) { IF P(t1,t2,t3) { t:=t1||t2||t3; OUTPUT t; } f1:=TRUE } } IF (!f1) { IF P(t1,NULL,NULL) { t:=t1||NULL||NULL; OUTPUT t; } } }
The other nesting evaluates T3, then T2:
FOR each row t1 in T1 { BOOL f1:=FALSE; FOR each row t3 in T3 such that P2(t1,t3) { FOR each row t2 in T2 such that P1(t1,t2) { IF P(t1,t2,t3) { t:=t1||t2||t3; OUTPUT t; } f1:=TRUE } } IF (!f1) { IF P(t1,NULL,NULL) { t:=t1||NULL||NULL; OUTPUT t; } } }
When discussing the nested-loop algorithm for inner joins, we omitted some details whose impact on the performance of query execution may be huge. We did not mention so-called “pushed-down” conditions. Suppose that our WHERE condition P(T1,T2,T3) can be represented by a conjunctive formula:
P(T1,T2,T2) = C1(T1) AND C2(T2) AND C3(T3).
In this case, MySQL actually uses the following nested-loop algorithm for the execution of the query with inner joins:
FOR each row t1 in T1 such that C1(t1) { FOR each row t2 in T2 such that P1(t1,t2) AND C2(t2) { FOR each row t3 in T3 such that P2(t2,t3) AND C3(t3) { IF P(t1,t2,t3) { t:=t1||t2||t3; OUTPUT t; } } } }
You see that each of the conjuncts C1(T1), C2(T2), C3(T3) are pushed out of the most inner loop to the most outer loop where it can be evaluated. If C1(T1) is a very restrictive condition, this condition pushdown may greatly reduce the number of rows from table T1 passed to the inner loops. As a result, the execution time for the query may improve immensely.
For a query with outer joins, the WHERE condition is to be checked only after it has been found that the current row from the outer table has a match in the inner tables. Thus, the optimization of pushing conditions out of the inner nested loops cannot be applied directly to queries with outer joins. Here we must introduce conditional pushed-down predicates guarded by the flags that are turned on when a match has been encountered.
Recall this example with outer joins:
P(T1,T2,T3)=C1(T1) AND C(T2) AND C3(T3)
For that example, the nested-loop algorithm using guarded pushed-down conditions looks like this:
FOR each row t1 in T1 such that C1(t1) { BOOL f1:=FALSE; FOR each row t2 in T2 such that P1(t1,t2) AND (f1?C2(t2):TRUE) { BOOL f2:=FALSE; FOR each row t3 in T3 such that P2(t2,t3) AND (f1&&f2?C3(t3):TRUE) { IF (f1&&f2?TRUE:(C2(t2) AND C3(t3))) { t:=t1||t2||t3; OUTPUT t; } f2=TRUE; f1=TRUE; } IF (!f2) { IF (f1?TRUE:C2(t2) && P(t1,t2,NULL)) { t:=t1||t2||NULL; OUTPUT t; } f1=TRUE; } } IF (!f1 && P(t1,NULL,NULL)) { t:=t1||NULL||NULL; OUTPUT t; } }
In general, pushed-down predicates can be extracted from join conditions such as P1(T1,T2) and P(T2,T3). In this case, a pushed-down predicate is guarded also by a flag that prevents checking the predicate for the NULL-complemented row generated by the corresponding outer join operation.
Access by key from one inner table to another in the same nested join is prohibited if it is induced by a predicate from the WHERE condition.
9.Outer Join Optimization
Outer joins include LEFT JOIN and RIGHT JOIN.
MySQL implements an A LEFT JOIN B join_specification as follows:
• Table B is set to depend on table A and all tables on which A depends.
• Table A is set to depend on all tables (except B) that are used in the LEFT JOIN condition.
• The LEFT JOIN condition is used to decide how to retrieve rows from table B. (In other words, any condition in the WHERE clause is not used.)
• All standard join optimizations are performed, with the exception that a table is always read after all tables on which it depends. If there is a circular dependency, an error occurs.
• All standard WHERE optimizations are performed.
• If there is a row in A that matches the WHERE clause, but there is no row in B that matches the ON condition, an extra B row is generated with all columns set to NULL.
• If you use LEFT JOIN to find rows that do not exist in some table and you have the following test: col_name IS NULL in the WHERE part, where col_name is a column that is declared as NOT NULL, MySQL stops searching for more rows (for a particular key combination) after it has found one row that matches the LEFT JOIN condition.
The RIGHT JOIN implementation is analogous to that of LEFT JOIN with the table roles reversed.
For a LEFT JOIN, if the WHERE condition is always false for the generated NULL row, the LEFT JOIN is changed to an inner join. For example, the WHERE clause would be false in the following query if t2.column1 were NULL:
SELECT * FROM t1 LEFT JOIN t2 ON (column1) WHERE t2.column2=5;
Therefore, it is safe to convert the query to an inner join:
SELECT * FROM t1, t2 WHERE t2.column2=5 AND t1.column1=t2.column1;
In MySQL 8.0.14 and later, trivial WHERE conditions arising from constant literal expressions are removed during preparation, rather than at a later stage in optimization, by which time joins have already been simplified. Earlier removal of trivial conditions allows the optimizer to convert outer joins to inner joins; this can result in improved plans for queries with outer joins containing trivial conditions in the WHERE clause, such as this one:
SELECT * FROM t1 LEFT JOIN t2 ON condition_1 WHERE condition_2 OR 0 = 1
The optimizer now sees during preparation that 0 = 1 is always false, making OR 0 = 1 redundant, and removes it, leaving this:
SELECT * FROM t1 LEFT JOIN t2 ON condition_1 where condition_2
Now the optimizer can rewrite the query as an inner join, like this:
SELECT * FROM t1 JOIN t2 WHERE condition_1 AND condition_2
Now the optimizer can use table t2 before table t1 if doing so would result in a better query plan. To provide a hint【hɪnt 提示;暗示;跡象;示意;少許;少量;徵兆;秘訣;】 about the table join order, use optimizer hints. Alternatively, use STRAIGHT_JOIN;However, STRAIGHT_JOIN may prevent indexes from being used because it disables semijoin transformations.
10 Outer Join Simplification
Table expressions in the FROM clause of a query are simplified in many cases.
At the parser【ˈpɑrsər 解析器;分析器;解析;語法分析器;剖析器;】 stage, queries with right outer join operations are converted to equivalent queries containing only left join operations. In the general case, the conversion is performed such that this right join:
(T1, ...) RIGHT JOIN (T2, ...) ON P(T1, ..., T2, ...)
Becomes this equivalent【ɪˈkwɪvələnt (價值、數量、意義、重要性等)相同的;相等的;】 left join:
(T2, ...) LEFT JOIN (T1, ...) ON P(T1, ..., T2, ...)
All inner join expressions of the form T1 INNER JOIN T2 ON P(T1,T2) are replaced by the list T1,T2, P(T1,T2) being joined as a conjunct to the WHERE condition (or to the join condition of the embedding join, if there is any).
When the optimizer evaluates plans for outer join operations, it takes into consideration only plans where, for each such operation, the outer tables are accessed before the inner tables. The optimizer choices are limited because only such plans enable outer joins to be executed using the nested-loop algorithm.
Consider a query of this form, where R(T2) greatly narrows the number of matching rows from table T2:
SELECT * T1 FROM T1 LEFT JOIN T2 ON P1(T1,T2) WHERE P(T1,T2) AND R(T2)
If the query is executed as written, the optimizer has no choice but to access the less-restricted table T1 before the more-restricted table T2, which may produce a very inefficient execution plan.
Instead, MySQL converts the query to a query with no outer join operation if the WHERE condition is nullrejected. (That is, it converts the outer join to an inner join.) A condition is said to be null-rejected for an outer join operation if it evaluates to FALSE or UNKNOWN for any NULL-complemented row generated for the operation.
Thus, for this outer join:
T1 LEFT JOIN T2 ON T1.A=T2.A
Conditions such as these are null-rejected because they cannot be true for any NULL-complemented row (with T2 columns set to NULL):
T2.B IS NOT NULL T2.B > 3 T2.C <= T1.C T2.B < 2 OR T2.C > 1
Conditions such as these are not null-rejected because they might be true for a NULL-complemented【ˈkɑːmplɪmentɪd 補充;補足;使完美;使更具吸引力;】 row:
T2.B IS NULL T1.B < 3 OR T2.B IS NOT NULL T1.B < 3 OR T2.B > 3
The general rules for checking whether a condition is null-rejected for an outer join operation are simple:
• It is of the form A IS NOT NULL, where A is an attribute of any of the inner tables
• It is a predicate containing a reference to an inner table that evaluates to UNKNOWN when one of its arguments is NULL
• It is a conjunction【kənˈdʒʌŋkʃn 連詞 (如and、but、or);(引起某種結果的事物等的)結合;(恆星、行星等的)合;同時發生;】 containing a null-rejected condition as a conjunct
• It is a disjunction【dɪsˈdʒʌŋkʃn 分離;分裂;】 of null-rejected conditions
A condition can be null-rejected for one outer join operation in a query and not null-rejected for another. In this query, the WHERE condition is null-rejected for the second outer join operation but is not null-rejected for the first one:
SELECT * FROM T1 LEFT JOIN T2 ON T2.A=T1.A LEFT JOIN T3 ON T3.B=T1.B WHERE T3.C > 0
If the WHERE condition is null-rejected for an outer join operation in a query, the outer join operation is replaced by an inner join operation.
For example, in the preceding query, the second outer join is null-rejected and can be replaced by an inner join:
SELECT * FROM T1 LEFT JOIN T2 ON T2.A=T1.A INNER JOIN T3 ON T3.B=T1.B WHERE T3.C > 0
For the original query, the optimizer evaluates only plans compatible with the single table-access order T1,T2,T3. For the rewritten query, it additionally considers the access order T3,T1,T2.
A conversion of one outer join operation may trigger a conversion of another. Thus, the query:
SELECT * FROM T1 LEFT JOIN T2 ON T2.A=T1.A LEFT JOIN T3 ON T3.B=T2.B WHERE T3.C > 0
Is first converted to the query:
SELECT * FROM T1 LEFT JOIN T2 ON T2.A=T1.A INNER JOIN T3 ON T3.B=T2.B WHERE T3.C > 0
Which is equivalent to the query:
SELECT * FROM (T1 LEFT JOIN T2 ON T2.A=T1.A), T3 WHERE T3.C > 0 AND T3.B=T2.B
The remaining outer join operation can also be replaced by an inner join because the condition T3.B=T2.B is null-rejected. This results in a query with no outer joins at all:
SELECT * FROM (T1 INNER JOIN T2 ON T2.A=T1.A), T3 WHERE T3.C > 0 AND T3.B=T2.B
Sometimes the optimizer succeeds in replacing an embedded outer join operation, but cannot convert the embedding outer join. The following query:
SELECT * FROM T1 LEFT JOIN (T2 LEFT JOIN T3 ON T3.B=T2.B) ON T2.A=T1.A WHERE T3.C > 0
Is converted to:
SELECT * FROM T1 LEFT JOIN (T2 INNER JOIN T3 ON T3.B=T2.B) ON T2.A=T1.A WHERE T3.C > 0
That can be rewritten only to the form still containing the embedding【ɪmˈbedɪŋ 把…牢牢地嵌入(或插入、埋入);派遣】 outer join operation:
SELECT * FROM T1 LEFT JOIN (T2,T3) ON (T2.A=T1.A AND T3.B=T2.B) WHERE T3.C > 0
Any attempt to convert an embedded outer join operation in a query must take into account the join condition for the embedding outer join together with the WHERE condition. In this query, the WHERE condition is not null-rejected for the embedded outer join, but the join condition of the embedding outer join T2.A=T1.A AND T3.C=T1.C is null-rejected:
SELECT * FROM T1 LEFT JOIN (T2 LEFT JOIN T3 ON T3.B=T2.B) ON T2.A=T1.A AND T3.C=T1.C WHERE T3.D > 0 OR T1.D > 0
Consequently【ˈkɑːnsɪkwentli 因此;所以;】, the query can be converted to:
SELECT * FROM T1 LEFT JOIN (T2, T3) ON T2.A=T1.A AND T3.C=T1.C AND T3.B=T2.B WHERE T3.D > 0 OR T1.D > 0
駿馬秋風冀北,
杏花春雨江南,
景若佳時心自然,
心遠樂處景應妍,休與俗人言。