博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
hive基础2
阅读量:5347 次
发布时间:2019-06-15

本文共 15445 字,大约阅读时间需要 51 分钟。

RDBMS

OLTP.relation database management system,关系型数据库管理系统。支持事务(acid)延迟低安全行

V

variaty : 多样性。

hive

mr,sql开发效率高。数据仓库。数据库:            //OLTPOLTP            //online transaction process,OLAP            //online analyze process,在线分析处理 , 很多分析函数                //rank | lag | lead | .. | cube | rollup | grouping sets                //grouping sets                //group by 多次shuffle。数据hive元数据     //表结构信息存在关系型数据库中derby。1.初始化hive    修改hive-site.xml    添加驱动到hive\lib下      
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore
javax.jdo.option.ConnectionURL
jdbc:mysql://192.168.231.1:3306/big11_hive
JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL. For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
javax.jdo.option.ConnectionUserName
root
Username to use against metastore database
javax.jdo.option.ConnectionPassword
root
password to use against metastore database
hive.server2.enable.doAs
false
Setting this property to true will have HiveServer2 execute Hive operations as the user making the calls to it.
查看日志------------ /tmp/2.登录hive $>3.

hive数据类型

1.简单类型    int     string    bigint    float    ...2.复杂类型    array    map    struct    named struct    union

复杂类型

Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead

Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead

CREATE TABLE employee(name string,arr ARRAY
,stru STRUCT
,map1 MAP
,map2 MAP
>)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '|'COLLECTION ITEMS TERMINATED BY ','MAP KEYS TERMINATED BY ':'lines terminated by '\n';

加载数据到hive

load data local inpath '/home/centos/emp.txt' into table employee ;

hive split函数

select split('hello world' , ' ') from employee ;

使用hive实现wordcount计算

1.创建表    CREATE TABLE doc    (    line string    )    ROW FORMAT DELIMITED    FIELDS TERMINATED BY ','    lines terminated by '\n';2.加载数据    load data local inpath '/home/centos/1.txt' into table doc ;3.统计    select     t.word ,     count(t.word) cnt    from    (     select explode(split(line , ' ')) word from doc    )     t     group by     t.word    order by    cnt desc     limit 3 ;4.

内部表

managed表.删除时,表结构和数据同时删除。

外部表

删除时,只删除表结构,不删除数据。

执行hive生成jar

--------------------
hadoop jar job.jar org.apache.hadoop.hive.ql.exec.mr.ExecDriver

CTAS

create table t2 as select * from temployee ;        //带数据create table t3 like t1 ;

使用hiveserver2服务器连接hive

1.hiveserver &2.beeline连接    $>beeline    $beeline>!connect jdbc:hive2://localhost:10000/default ;3.beeline命令行    $beeline>select * from t1 ;

格式

[列存储]rcfileorcparquet                 //

select * from t ;

*       //全字段扫描,select id from t ;      //投影查询

kafka

零拷贝 + 磁盘线性读写。

hive创建列存储格式文件

磁盘线性读写 + 投影查询create table t_orc (id int , name string , age int) stored as orc ;create table t_orc (id int , name string , age int) stored as rcfile ;create table t_orc (id int , name string , age int) stored as parquet ;

hive分区

分区就是子目录。

1.创分区表

CREATE external TABLE t_ext_par
(
id int ,
name string
)
PARTITIONED BY (prov string, city string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
lines terminated by '\n';

//显示分区记录$>SHOW PARTITIONS t_ext_par;//添加分区alter table t_ext_par add partition(prov='henan' , city='kaifeng') partition(prov='henan' , city='zhengzhou'); //删除分区alter table t_ext_par DROP IF EXISTS PARTITION (prov='henan' , city='kaifeng');//加载数据到指定分区load data local inpath '/home/centos/1.txt' into table t_ext_par partition(prov='henan' , city='kaifeng') ;//查询数据插入到指定的分区上insert into t_ext_par partition(prov , city) select 1 , 'tom' , 'hebei' , 'shijiazhuang';insert into t_ext_par partition(prov='hebei' , city) select 1 , 'tom' , 'cc';       //指定了一个静态分区insert into t_ext_par partition(prov , city) select 1 , 'tom' ,'Liaoning' , 'DL';   //指定了两个都是动态分区[动态分区]在DML/DDL是否允许动态创建分区。默认true.set hive.exec.dynamic.partition=true            //启用动态分区set hive.exec.dynamic.partition=false           //禁用动态分区set hive.exec.dynamic.partition                 //查看动态分区[分区模式-严格和非严格]set hive.exec.dynamic.partition.mode=strict     //严格模式,

桶表

1.创建语句    CREATE TABLE t_buck    (    id int ,    name string    )    CLUSTERED BY (id) INTO 3 BUCKETS    ROW FORMAT DELIMITED    FIELDS TERMINATED BY ','    lines terminated by  '\n';2.桶的个数设置    让每个桶的数据量接近于block(128M)的2倍。    预估数据量时500M,3.注意事项    load方法对bucket无效.

设置hive中reducer个数

$>set mapreduce.job.reduces=3 ;$>select * from order by id desc ;                  //始终使用1个reduce实现全排序$>select id , count(*) from t group by id    ;      //受reduces个数影响

view

//创建create view v_emp as select name , arr[0] , stru.sex , map1['DB'],map2['Product'][0] from employee ; //修改alter view v_emp as select * from t1 ;//删除drop view if exists  v_emp ;

查询

select distinct * from t1 ;     //可以指定不同个数的reduce.

抓取转换none | minimal | more

SET hive.fetch.task.conversion=more;select * from t ;           //不是mrSET hive.fetch.task.conversion=none;select * from t ;           //转换成mr操作

准备数据

CREATE TABLE custs

(
id int ,
name string,
age int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
lines terminated by '\n' stored as textfile ;

CREATE TABLE orders

(
id int ,
orderno string,
price float ,
cid int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
lines terminated by '\n' stored as textfile ;

连接

//交叉连接select c.* , o.* from customers c cross join orders o  ;                    //笛卡尔积select c.* , o.* from customers c cross join orders o on  o.cid = c.id;     //等值连接select c.* , o.* from customers c cross join orders o on  o.cid <> c.id;    //非等值连接//map端连接查询暗示select /*+mapjoin(custs)*/ c.*,o.* FROM custs c CROSS JOIN orders o WHERE c.id <> o.cid;

导出表

不但能够导出表结构,数据也能导出.export table mydb.custs to '/user/centos/custs.dat'

mr模式

set hive.mapred.mode = strictselect * from t_ext_par ;               //必须在分区列上进行过滤才可以。

排序

1.order by    全排序,通过一个reduce实现。(数据倾斜)2.sort by(部分排序)    在每个reduce内排序法则。    create table tmp2 as select * from orders sort by id desc ;3.distribute by    按照指定字段进行分发,就是分区过程。    create table tmp3 as select * from orders distribute (by id % 2 ==0) sort by id desc ;4.cluster by    快捷方式 ,如果disbtribute by 和sort by使用的是相同字段,则可以使用cluster by直接完成    select * from orders distribute id sort by id ;    select * from orders cluster by id ;

函数

//show functions ;                //查看所有函数desc function split;            //查看特定函数desc function extended split;   //扩展信息select case when age < 13 then '0' when age > 13 then '2' else '1' from custs ;//集合函数select array(1,2,3) ;select struct(1,2,3).`col1` ;   //结构体函数select map('id' , 1 , 'price' , 100.5) ;select map('id' , 1 , 'price' , 100.5)['id'] ;

虚列

select id ,INPUT_FILE_NAME from t1 ;    //文件名

事务表

hive支持有条件的事务。1.启动相关属性设置

SET hive.support.concurrency = true;

SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker.threads = 1;
2.创建orc 桶表,并且携带事务属性
create table t_tx(id int ,name string , age int)
clustered by (id) into 2 buckets
stored as orc
TBLPROPERTIES('transactional'='true');

准备数据

1.创建emp    CREATE TABLE emp    (    id int ,    name string ,    age int,    dep string,    salary int    )    ROW FORMAT DELIMITED    FIELDS TERMINATED BY ','    lines terminated by  '\n'    stored as textfile;2.上传数据    1,tom,22,001,3000    2,tomas,22,001,2500    3,tomasLee,25,002,1200    4,tomson,24,002,3400    5,peter,24,001,3800    6,john,25,001,4200    7,looser,24,001,5500    8,alex,25,003,6000    9,jason,24,003,6000    10,japser,22,003,3000

高级聚合函数

select dep , avg(salary) from emp group by dep ;select age , avg(salary) from emp group by age ;//联合查询结果,需要3个jobselect dep , avg(salary) from emp group by dep union select age , avg(salary) from emp group by age ;//group by + grouping sets()select avg(salary) from emp group by age,dep grouping sets((age) , (dep)) ;select avg(salary) from emp group by age,dep grouping sets((age) , (dep) , (age,dep)) ;select avg(salary) from emp group by age,dep grouping sets((age) , (dep) , (age,dep) , ()) ;//rollupselect age,dep ,avg(salary) from emp group by age,dep with rollup ;select age,dep ,avg(salary) from emp group by age,dep grouping sets((age,dep) , (age) , ()) ;//cubeselect age,dep ,avg(salary) from emp group by age,dep with cube ;GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())select substr(name,1,1) ,age , dep , avg(salary) from emp  group by substr(name,1,1) ,age , dep with cube ;

分析函数

Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>] [
])//order by全部数据集上的排序select id,name,age,salary , avg(salary) over (partition by dep) from emp order by dep;//select id,name,age,salary , first_value(salary) over (order by id desc) from emp order by id asc;select id,name,age,salary , last_value(salary) over (order by id desc) from emp order by id asc;select id, name, LAST_VALUE(salary) OVER (PARTITION BY dep ORDER BY salary row BETWEEN UNBOUNDED PRECEDING AND current row) as t2 from emp;[开窗子句] 默认时行开窗,前导无界到当前行。 row BETWEEN UNBOUNDED PRECEDING AND current row) //行开窗 row BETWEEN UNBOUNDED PRECEDING AND unbounded folloing //行开窗 range BETWEEN 1 PRECEDING AND 2 following) //范围开窗select id,name,dep,salary,MAX(salary) OVER (PARTITION BY dep ORDER BY salary range BETWEEN 1 PRECEDING AND 1 following) as t2 from emp;[具体分析函数]//排名,有缝select id,name,dep,salary,rank() OVER (PARTITION BY dep ORDER BY salary desc) as t2 from emp;//密度排名,无缝select id,name,dep,salary,dense_rank() OVER (PARTITION BY dep ORDER BY salary desc) as t2 from emp;select id,name,dep,salary,percent_rank() OVER (PARTITION BY dep ORDER BY salary desc) as t2 from emp;select id,name,dep,salary,ntile(4) OVER (PARTITION BY dep ORDER BY salary desc) as t2 from emp;//前查询select id,name,dep,salary,lag(salary,2,'-1000') OVER (PARTITION BY dep ORDER BY salary desc) as t2 from emp;//后查select id,name,dep,salary,lead(salary,2,'-1000') OVER (PARTITION BY dep ORDER BY salary desc) as t2 from emp;

[索引]

CREATE INDEX idx_emp_name ON TABLE emp(name) AS 'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX idx_emp_name ON employee REBUILD;

[数据文件]

1.文件格式,hive支持TEXTFILE, SEQUENCEFILE, RCFILE, ORC, and PARQUET.
修改文件格式:
//创建表
CREATE TABLE… STORED AS
//修改表指定格式
ALTER TABLE… [PARTITION partition_spec] SET FILEFORMAT
//设置默认的表格式
SET hive.default.fileformat=
2.TextFile
文本文件,可以压缩,lzo,...Bzip2压缩文本压缩比较大,时间长,可切割,如果压缩文件不可切割,
则由一个map处理大文件。
3.SequenceFile
序列文件,kv对,格式紧凑。适合mr的output,可压缩可切割。
该文件和text文件都行方式存储。反之,rcFile/orc/parquet文件格式可以列存储。
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
4.RCFile
Record Columnar File.
kv存储,类似于sequencefile,将数据文件水平切割多个组。
若干group存放在一个hdfs中,先保存所有行的第一列,第二列,以此类推。
该文件可切割.可以跳过不相关部分,更快得到数据,成本更低。
5.orc
Optimized Row Columnar
RCFile增强版。
支持的大数据块(256M).
和rcfile的不同是使用特殊的编码器感知数据类型,并依据不同的类型进行压缩优化。
同时也存储了基于column的一些基本的统计(MIN, MAX, SUM, and COUNT)信息,还有轻量
级索引.
支持范围较窄,只有hive和pig。
6.parquet
类似于orc,相对于orc文件格式,hadoop生态系统中大部分工程都支持parquet文件。

[压缩]

1.设置mr中间的压缩处理
//设置中间结果是否可压缩,多次job的中间结果是否压缩处理。
SET hive.exec.compress.intermediate=true
//设置结果压缩编解码器
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
//是否控制最终job的输出压缩
SET hive.exec.compress.output=true
//设置输出压缩编解码器
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
Compression_Codec Extension Splittable
Deflate org.apache.hadoop.io.compress.DefaultCodec .deflate N
GZip org.apache.hadoop.io.compress.GzipCodec .gz N
Bzip2 org.apache.hadoop.io.compress.BZip2Codec .gz Y
LZO com.hadoop.compression.lzo.LzopCodec .lzo N(preprocess,lzoindex)
LZ4 org.apache.hadoop.io.compress.Lz4Codec .lz4 N
Snappy org.apache.hadoop.io.compress.SnappyCodec .snappy N

//输出压缩mapreduce.output.fileoutputformat.compress=falsehive.exec.compress.output=true

合并小文件

-- 每个Map最大输入大小,决定合并后的文件数set mapred.max.split.size=256000000;-- 一个节点上split的至少的大小 ,决定了多个data node上的文件是否需要合并set mapred.min.split.size.per.node=100000000;-- 一个交换机下split的至少的大小,决定了多个交换机上的文件是否需要合并set mapred.min.split.size.per.rack=100000000;-- 执行Map前进行小文件合并set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

采样

//SELECT * FROM emp TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;SELECT * FROM emp TABLESAMPLE(BUCKET 3 OUT OF 16 ON id) ;SELECT * FROM emp TABLESAMPLE(50 PERCENT) s ;SELECT * FROM emp TABLESAMPLE(100M) s; SELECT * FROM emp TABLESAMPLE(10 ROWS);

查询语句执行构成

select ... from ... where ... group by ... having ... order by ...  limit ... ;执行顺序:1.from2.where3.select4.group by5.having 6.order by7.limit ...

JOB : M+/R?M*

hive
----------------------
select ... from ... where ... group by ... having ... order by ... limit ... ;
[生成mr]
group by //分区(map) + 分组(reduce)
group by having //分区(map) + 分组(reduce) + map(fiilter)
order by //1个reduce + sort
disintinct
join

//分区           //排序对比器    distributed by + sort by

explain(解释执行计划)

explain select dep,max(salary) mx from emp group by dep having mx > 4000 ;

OK

STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1

STAGE PLANS:

Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: emp
Statistics: Num rows: 1 Data size: 203 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: dep (type: string), salary (type: int)
outputColumnNames: dep, salary
Statistics: Num rows: 1 Data size: 203 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: max(salary)
keys: dep (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 203 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 203 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: int)
Reduce Operator Tree:
Group By Operator
aggregations: max(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 203 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (_col1 > 4000) (type: boolean)
Statistics: Num rows: 1 Data size: 203 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 203 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0

Fetch Operator
limit: -1
Processor Tree:
ListSink


1.气温数据上传到temps表,使用distribute by + sort by 实现全排序。见hive基础1 最后一节

转载于:https://www.cnblogs.com/star521/p/9702974.html

你可能感兴趣的文章
spring Cache + Redis 开发数据字典以及自定义标签
查看>>
成功连上数据库顿感世界美好许多
查看>>
编程注意2
查看>>
《C++ Primer Plus》第12章 类和动态内存分配 学习笔记
查看>>
javascript中sort()排序方法总结
查看>>
实现聊天界面的代码
查看>>
自己生成一个NDK的浅析
查看>>
Excel数据导入到数据库
查看>>
jQuery最佳实践
查看>>
SELinux FAQ
查看>>
Java中synchronized同步的理解
查看>>
python 数值计算库
查看>>
java 服务重启 js 中被注释代码仍然执行
查看>>
我并不是不闻不问![C#]
查看>>
web前端经典小题
查看>>
AutoCAD如何倒角 倒圆角 倒直角
查看>>
Office PPT中如何插入flash
查看>>
C# Fade Form Effect With the AnimateWindow API Function
查看>>
golang多维数组的切片
查看>>
IP 网际协议
查看>>