最近在工作中遇到一個很難解析的JSON,他是一個嵌套的JSON數組的JSON,要使用Hive來進行解析,用Presto寫了一次,邏輯就很清晰,因為Presto自帶了JSON數據類型,轉換數組就很方便,而Hive解析完JSON數組後是一個字元串,只能使用split方法來對string類型的數據進行切分 ...
最近在工作中遇到一個很難解析的JSON,他是一個嵌套的JSON數組的JSON,要使用Hive來進行解析,用Presto寫了一次,邏輯就很清晰,因為Presto自帶了JSON數據類型,轉換數組就很方便,而Hive解析完JSON數組後是一個字元串,只能使用split
方法來對string
類型的數據進行切分,所以如果遇到多層嵌套的數組,要註意切分方法,不然就會亂套。
需要解析的JSON
{
"base": {
"code": "xm",
"name": "project"
},
"list": [{
"ACode": "cp1",
"AName": "Product1",
"BList": [{
"BCode": "gn1",
"BName": "Feature1"
}, {
"BCode": "gn2",
"BName": "Feature2"
}]
}, {
"ACode": "cp2",
"AName": "Product2",
"BList": [{
"BCode": "gn1",
"BName": "Feature1"
}]
}]
}
解析出來的結果應該如下表所示
code | name | ACode | Aname | Bcode | Bname |
---|---|---|---|---|---|
xm | project | cp1 | Product1 | gn1 | Feature1 |
xm | project | cp1 | Product1 | gn2 | Feature2 |
xm | project | cp2 | Product2 | gn1 | Feature1 |
解決方案
首先使用get_json_object
方法,把需要解析的數組解析出來,然後使用regexp_replace
將}]},{
替換成}]}||{
,然後再使用split
方法對||
進行分割,分割成數組後,使用lateral view explode
方法對其進行展開成多列即刻。
SELECT
code
, name
, ai.ACode
, ai.AName
, bi.BCode
, bi.BName
FROM
(
SELECT
get_json_object(t.value, '$.base.code') AS code
, get_json_object(t.value, '$.base.name') AS name
, get_json_object(t.value, '$.list') AS list
FROM
(
SELECT
'{"base":{"code":"xm","name":"project"},"list":[{"ACode":"cp1","AName":"Product1","BList":[{"BCode":"gn1","BName":"Feature1"},{"BCode":"gn2","BName":"Feature2"}]},{"ACode":"cp2","AName":"Product2","BList":[{"BCode":"gn1","BName":"Feature1"}]}]}' as value
)
t
)
t lateral view explode(split(regexp_replace(regexp_extract(list,'^\\[(.+)\\]$',1),'\\}\\]\\}\\,\\{', '\\}\\]\\}\\|\\|\\{'),'\\|\\|')) list as a
lateral view json_tuple(a,'ACode','AName','BList') ai as ACode
, AName
, BList lateral view explode(split(regexp_replace(regexp_extract(BList,'^\\[(.+)\\]$',1),'\\}\\,\\{', '\\}\\|\\|\\{'),'\\|\\|')) BList as b
lateral view json_tuple(b,'BCode','BName') bi as BCode
, BName
;
執行完
xm project cp1 Product1 gn1 Feature1
xm project cp1 Product1 gn2 Feature2
xm project cp2 Product2 gn1 Feature1
Time taken: 0.787 seconds, Fetched: 3 row(s)
總結
- 嘗試切分為數組後,使用
lateral view posexplode
方案,逐層解析,但這樣會導致笛卡爾。所以必須一次性全部解析好,而不是套用多個子查詢逐層解析; - 使用
OUTER
欄位,能使LATERAL VIEW
不忽略NULL
include
OUTER
in the query to get rows with NULL valuessomething like,
select * FROM table LATERAL VIEW OUTER explode ( split ( email ,',' ) ) email AS email_id;