Hive基础学习----建表

2017-12-11

###Hive ? 把一张表与已经处理的结构化数据产生映射关系

1 基于hadoop数据仓库的工具

2 讲结构化的数据映射为一张数据库表本质是讲sql转换为MapReduce程序

映射关系

3 hive 表:hdfs上的一组结构化数据产生映射关系

表:对应路径 /user/hive/warehouse 默认路径

数据库 —> 文件夹

表—>子文件夹

注意:

1) .表的字段个数和类型跟结构化数据中的字段个数和类型一致

2) .建表时,指定本次映射的结构化数据中的分隔符

3.1 内部表、外部表

建内部表

create table student(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',';

create table source_table(id int,name string) row format delimited fields terminated by ',';

建外部表

create external table student_ext(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',' location '/stu';

内、外部表加载数据：

load data local inpath '/root/hivedata/students.txt' overwrite into table student;

load data inpath '/stu' into table student_ext;

location 外部表指定表数据加载位置

####3.2 ROW FORMAT DELIMITED（指定分隔符）

create table day_table (id int, content string) partitioned by (dt string) row format delimited fields terminated by ',';   ---指定分隔符创建分区表

复杂类型的数据表指定分隔符

create table complex_array(name string,work_locations array<string>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',';

数据如下：

zhangsan   beijing,shanghai,tianjin,hangzhou
wangwu   shanghai,chengdu,wuhan,haerbin

建表语句

create table t_map(id int,name string,hobby map<string,string>)
row format delimited 
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':' ;

数据：

1,zhangsan,唱歌:非常喜欢-跳舞:喜欢-游泳:一般般
2,lisi,打游戏:非常喜欢-篮球:不喜欢

4 分区(PARTITIONED):

本地模式
set hive.exec.mode.local.auto=true;

hive中的分区字段,不能是表中已经存在的字段
分区字段是一个虚拟的字段,不存放任何数据,值来自装载分区表时指定的目录层面

分区建表分为2种，一种是单分区，也就是说在表文件夹目录下只有一级文件夹目录。另外一种是多分区，表文件夹下出现多文件夹嵌套模式。

单分区建表语句：

create table day_table (id int, content string) partitioned by (dt string);

单分区表，按天分区，在表结构中存在id，content，dt三列。

######双分区建表语句：

create table day_hour_table (id int, content string) partitioned by (dt string, hour string);

双分区表，按天和小时分区，在表结构中新增加了dt和hour两列。

导入数据

LOAD DATA local INPATH '/root/hivedata/dat_table.txt' INTO TABLE day_table partition(dt='2017-07-07');

LOAD DATA local INPATH '/root/hivedata/dat_table.txt' INTO TABLE day_hour_table PARTITION(dt='2017-07-07', hour='08');

基于分区的查询：

SELECT day_table.* FROM day_table WHERE day_table.dt = '2017-07-07';

查看分区：

show partitions day_hour_table;

总的说来partition就是辅助查询，缩小查询范围，加快数据的检索速度和对数据按照一定的规格和条件进行管理。

###5 分桶(CLUSTERED):

hive中的分桶字段,必须是表中已经存在的字段
分桶表的数据插入, insert + select 方式

分桶一定要执行mr,对应mr中的partitioner 文件层面

桶默认关闭*

`#指定开启分桶

set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;

动态分区

开启动态分区

set hive.exec.dynamic.partition=true;    #是否开启动态分区功能，默认false关闭。

set hive.exec.dynamic.partition.mode=nonstrict;   #动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。

cluster by(字段) : 分组+排序(按照分组字段排序) 正序

sort by(字段 asc|desc) : 只排序 (自定义字段排序)

distribute by(字段) : 只分组

order by(字段) 全局排序只有一个reducer