Impala好用,但也遍地是坑,这里Mark一下,持续更新。
multi count distinct不支持
sql如下:1
2
3
4
5
6
7
8select
upload_key,
max(my_timestamp) as upload_time,
max(color_key) as max_color_fk,
count(distinct color_key) as color_count,
count(distinct id) as toy_count
from upload_table
group by upload_key
报错内容:AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT color_key); deviating function: count(DISTINCT id)
A workaround would be two Selects and then a join:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25select
t1.upload_key,
t1.upload_time,
t1.max_color_fk,
t1.color_count,
t2.toy_count
from
(
select
upload_key,
max(my_timestamp) as upload_time,
max(color_key) as max_color_fk,
count(distinct color_key) as color_count
from upload_table
group by upload_key
) as t1
join
(
select
upload_key
count(distinct id) as toy_count
from upload_table
group by upload_key
) as t2
on t1. upload_key = t2.upload_key
Lateral View语法不支持
Impala-2679 not support directly, but Impala 2.3 has support for complex types in Parquet, you should store complex data as complex type in parquet, then you can use it as map/array/set in impala 2.3
复杂的Hive view不支持
对于复杂的Hive view语法, Impala会直接抛出这种报错:Failed to parse view-definition statement of view
LAG等窗口函数取num=1的时候导致sql结果频繁不一致
lag是系统自带的窗口分析函数,hive,impala,spark-sql都是有的,lag函数中order by dt的时候由于dt有大量的重复值,不同的计算的时候可能shuffle到不同的块中,导致计算结果不一致了。 同理Row_number() OVER ( partition BY user_id ORDER BY pay_time desc)也容易造成结果不一致。