hive的五种去重方式 您所在的位置:网站首页 地理五诀错误百出 hive的五种去重方式

hive的五种去重方式

#hive的五种去重方式| 来源: 网络整理| 查看: 265

1.distinct

问题: 每个app下只保留一个用户 案例:

spark-sql> with test1 as > (select 122 as userid,100024 as apptypeid > union all > select 123 as userid,100024 as apptypeid > union all > select 123 as userid,100024 as apptypeid) > select > distinct userid,apptypeid > from test1; 122 100024 123 100024 Time taken: 4.781 seconds, Fetched 2 row(s) 2.group by

问题: 每个app下只保留一个用户 案例:

spark-sql> with test1 as > (select 122 as userid,100024 as apptypeid > union all > select 123 as userid,100024 as apptypeid > union all > select 123 as userid,100024 as apptypeid) > select > userid, > apptypeid > from > (select > userid, > apptypeid > from test1) t1 > group by userid,apptypeid; 122 100024 123 100024 Time taken: 10.5 seconds, Fetched 2 row(s) 3.row_number()

问题: 每个app下,每个用户取最近的渠道、版本、操作系统数据 分析: distinct只是简单的去重,解决不了问题;group by也是简单的分组去重,也解决不了问题;order by只是简单的排序,也解决不了问题。那这个时候row_number()就派上用场了,分组完再排序 案例:

spark-sql> with test1 as > (select 122 as userid,100024 as apptypeid,'appstore' as qid,'ios' as os,'1.0.2' as ver,1627440618 as dateline > union all > select 123 as userid,100024 as apptypeid,'huawei' as qid,'android' as os,'1.0.3' as ver,1627440620 as dateline > union all > select 123 as userid,100024 as apptypeid,'huawei' as qid,'android' as os,'1.0.4' as ver,1627440621 as dateline) > select > userid, > apptypeid, > qid, > os, > ver > from > (select > userid, > apptypeid, > qid, > os, > ver, > row_number() over(distribute by apptypeid,userid sort by dateline desc) as rank > from test1) t1 > where t1.rank=1; 122 100024 appstore ios 1.0.2 123 100024 huawei android 1.0.4 Time taken: 5.286 seconds, Fetched 2 row(s) 4.left join

问题: 求每天的新增用户。现在有一张每天的用户表test1,有一张历史的新用户表test2(新用户:每个app下,每个用户只有一条数据) 分析: 1.每天的用户表test1用group by进行去重,得到每天的用户数据 2.再将用户数据根据历史新用户表进行关联,不在历史新用户表里面的,即为每天新增用户 案例:

spark-sql> with test1 as > (select 122 as userid,100024 as apptypeid > union all > select 123 as userid,100024 as apptypeid > union all > select 123 as userid,100024 as apptypeid), > > test2 as > (select 122 as userid,100024 as apptypeid > union all > select 124 as userid,100024 as apptypeid > union all > select 125 as userid,100024 as apptypeid) > select > t1.userid, > t1.apptypeid > from > (select > userid, > apptypeid > from test1 > group by userid,apptypeid) t1 > > left join > (select > userid, > apptypeid > from test2) t2 > on t1.apptypeid=t2.apptypeid and t1.userid=t2.userid > where t2.userid is null; 123 100024 Time taken: 19.816 seconds, Fetched 1 row(s) 5.位操作:union all+group by

问题: 求每天的新增用户。现在有一张每天的用户表test1,有一张历史的新用户表test2(新用户:每个app下,每个用户只有一条数据) 分析: 1.每天的用户表test1用group by进行去重,得到每天的用户数据 2.将每天的用户数据打上标签10,历史的新用户数据打上标签1(位操作的标签) 3.进行union all拼接,对标签进行汇总,取标签为10的数据,即为每天的新增用户 案例:

spark-sql> with test1 as > (select 122 as userid,100024 as apptypeid > union all > select 123 as userid,100024 as apptypeid > union all > select 123 as userid,100024 as apptypeid), > > test2 as > (select 122 as userid,100024 as apptypeid > union all > select 124 as userid,100024 as apptypeid > union all > select 125 as userid,100024 as apptypeid) > > select > userid, > apptypeid > from > (select > sum(tag) as tag, > userid, > apptypeid > from > (select > 10 as tag, > t1.userid, > t1.apptypeid > from > (select > userid, > apptypeid > from test1 > group by userid,apptypeid) t1 > > union all > select > 1 as tag, > userid, > apptypeid > from test2) t2 > group by userid,apptypeid) t3 > where t3.tag=10; 123 100024 Time taken: 10.428 seconds, Fetched 1 row(s)

总结: 1.简单数据去重建议用group by替代distinct的方式;distinct去重,所有数据都在一个reduce里面,很浪费资源,效率又很低,会有内存溢出的风险 2.对于求每天新增用户,如果数据量很大的情况下,建议用位操作的方式;



【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

    专题文章
      CopyRight 2018-2019 实验室设备网 版权所有