HiveDDL数据操作

一、 Hive集合数据类型

除了常见的基本数据类型，Hive还支持三种复杂的数据类型ARRAY、MAP和STRUCT。ARRAY和MAP与Java中的Array和Map类似，而STRUCT与C语言中的Struct类似，它封装了一个命名字段集合，复杂数据类型允许任意层次的嵌套。

数据类型	描述	语法示例
STRUCT	和c语言中的struct类似，都可以通过“点”符号访问元素内容。例如，如果某个列的数据类型是STRUCT{first STRING, last STRING},那么第1个元素可以通过字段.first来引用。	struct()
MAP	MAP是一组键-值对元组集合，使用数组表示法可以访问数据。例如，如果某个列的数据类型是MAP，其中键->值对是’first’->’John’和’last’->’Doe’，那么可以通过字段名[‘last’]获取最后一个元素	map()
ARRAY	数组是一组具有相同类型和名称的变量的集合。这些变量称为数组的元素，每个数组元素都有一个编号，编号从零开始。例如，数组值为[‘John’, ‘Doe’]，那么第2个元素可以通过数组名[1]进行引用。	Array()

1. 案例实操

假设某表有如下一行，我们用JSON格式来表示其数据结构。下面这个json字符串存在列表，键值对，以及结构体

{
    "name":"songsong",
    "friends":[ // Array
        "bingbing",
        "lili"
    ],
    "children":{ // Map
        "xiao song":18,
        "xiaoxiao song":19
    },
    "address":{ // Struct
        "street":"huilongguan",
        "city":"beijing"
    }
}

现在我们尝试创建对应的表，并将上述的json中包含的数据导入到这个表中。
首先我们创建本地的测试文件test.txt

1 2	songsong,bingbing_lili,xiaosong:18_xiaoxiaosong:19,huilongguan_beijing yangyang,caicai_susu,xiaoyang:18_xiaoxiaoyang:19,chaoyang_beijing

注意:MAP，STRUCT和ARRAY里的元素间关系都可以用同一个字符表示，这里用“_”。

首先我们创建一个数据表

create table test(
name string,
friend array<string>,
children map<string, int>,
address struct<street:string, city:string>
)
row format delimited fields terminated by ',' -- 列分隔符
collection items terminated by '_' -- MAP，ARRAY，STRUCT的分隔符
map keys terminated by ':' -- MAP中的key与value的分隔符
lines terminated by '\n'; -- 行分隔符

将文本数据导入到测试表中

1	load data local inpath "/filepath/test.txt" into table test;

之后我们尝试访问三种集合列中的数据

1	select friend[1], children['xiaosong'], address.city from test where name = 'songsong';

执行后可以发现返回结果

OK
_c0 _c1 city
lili 18 beijing
Time taken: 0.076 seconds, Fetched: 1 row(s)

2. 类型转换

使用CAST操作显示进行数据类型转换
例如CAST(‘1’ AS INT)将把字符串’1’转换成整数1;如果强制类型转换失败，如执行CAST(‘X’ AS INT)，表达式返回空值NULL。

二、 DDL数据操作

1.创建数据库

为了避免已创建的数据库已存在，建议增加if not exists进行判断

1	create database if not exists db_hive;

创建数据库可以指定HDFS上存放的位置

1	create database db_hive location '/db_hive2.db';

3. 查询数据库

显示数据库

1	show databases;

过滤显示查询的数据库

1	show databases like 'db_hive*';

4. 查看数据库详细

显示数据库信息

1	desc database db_hive;

查询数据库详细信息， extended

1	desc database extended db_hive;

5. 切换当前数据库

1	use db_hive;

6. 修改数据库

用户可以使用ALTER DATABASE命令为某个数据库的DBPROPERTIES设置键-值对属性值，来描述这个数据库的属性信息。数据库的其他元数据信息都是不可更改的，包括数据库名和数据库所在的目录位置。

1	alter database hive set dbproperties('createtime'='20170830');

7. 删除数据库

最好采用if exists判断数据库是否存在

1	drop database if exists db_hive2;

对于数据库不为空的时候，可以使用cascade命令强制删除

1	drop database db_hive cascade;

8. 创建表

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]

字段解释说明
(1)CREATE TABLE创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常;用户可以用IFNOTEXISTS选项来忽略这个异常。
(2)EXTERNAL 关键字可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径(LOCATION)，Hive创建内部表时，会将数据移动到数据仓库指向的路径;若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。
(3)COMMENT:为表和列添加注释。
(4)PARTITIONED BY创建分区表
(5)CLUSTERED BY创建分桶表
(6)SORTED BY不常用
(7)ROW FORMAT
DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, …)]
用户在建表的时候可以自定义SerDe或者使用自带的 SerDe。如果没有指定ROW FORMAT或者ROW FORMAT DELIMITED，将会使用自带的SerDe。在建表的时候，用户还需要为表指定列，用户在指定表的列的同时也会指定自定义的SerDe，Hive 通过SerDe确定表的具体的列的数据。SerDe是Serialize/Deserilize的简称，目的是用于序列化和反序列化。
(8)STORED AS指定存储文件类型,常用的存储文件类型:SEQUENCEFILE(二进制序列文件)、TEXTFILE(文本)、 RCFILE(列式存储格式文件),如果文件数据是纯文本，可以使用STORED AS TEXTFILE。如果数据需要压缩，使用STORED AS SEQUENCEFILE。
(9)LOCATION:指定表在HDFS上的存储位置。
(10)LIKE允许用户复制现有的表结构，但是不复制数据。

9. 内部表，外部表，分区表

略

三. 分区表常见操作

创建分区表

1
2
3

create table dept_partition(deptno int, dname string, loc string)
partitioned by (month string)
row format delimited fields terminated by '\t';

加载数据到分区表

1
2
3

load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201709');
hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201708');
hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201707’);

创建单个分区

1	alter table dept_partition add partition(month='201706') ;

创建多个分区

1	alter table dept_partition add partition(month='201705') partition(month='201704');

删除单个分区

1	alter table dept_partition drop partition (month='201704');

删除多个分区

1	alter table dept_partition drop partition (month='201705'), partition (month='201706');

记得区分创建多个和删除多个的时候，分隔符一个为空格一个为逗号。

创建二级分区
(1) 首先创建一个表

1
2
3

create table dept_partition2(deptno int, dname string, loc string)
partitioned by (month string, day string)
row format delimited fields terminated by '\t';

(2) 正常的加载数据,加载数据到二级分区表中

1	load data local inpath '/filepath/test.txt' into table default.dept_partition2 partition(moth='201709', day='13');

(3) 查询分区数据

1	select * from dept_partition2 where month='201709' and day='13';

当我们把数据直接上传到分区目录上的时候，我们需要让分区表和数据产生关联，一共有三种方式

(1) 上传数据后修复

1 2	dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=12; dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12;

此时我们是查询不到数据的，需要执行修复指令

1	msck repair table dept_partition2;

这个时候便可以查询到数据了

(2) 上传数据后添加分区

1 2	dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=12; dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12;

执行添加分区

1	alter table dept_partition2 add partition(month='201709',day='11');

(3) 创建文件夹后load数据到分区

1 2	dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=12; dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12;

上传数据

1	load data local inpath '/opt/module/datas/dept.txt' into table dept_partition2 partition(month='201709',day='10');