首页 » 大数据 » Hadoop » Hadoop学习笔记(11)--Hive基础使用

Hadoop学习笔记(11)--Hive基础使用

 
文章目录

创建数据(文本以tab分隔)

创建数据
touch test.txt
vim test.txt

创建以Tab分隔

1 zhangsan  20  Beijing
2 lisi  13 changsha
3 wangwu  31  chengdu
4 liuliu  14 kunming
5 mingming  21  guangzhou
6 huahua  51  xian
11  xiaoxia 26  tianjin

创建新表

创建表结构
hive> CREATE TABLE test_hive (id int, name String, age int, addr String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 1.629 seconds

导入数据test.txt到test_hive表

导入数据
hive> LOAD DATA LOCAL INPATH '/home/hadoop/test.txt' OVERWRITE INTO TABLE test_hive;
Loading data to table default.test_hive
OK
Time taken: 1.199 seconds

操作数据表

#查看有哪些表
hive> show tables;
OK
test_hive
Time taken: 0.106 seconds, Fetched: 1 row(s)

#正则查询表
hive> show tables '*test*';
OK
test_hive

#查询表
hive> select * from test_hive;
OK
1   zhangsan    20  Beijing
2   lisi    13  changsha
3   wangwu  31  chengdu
4   liuliu  14  kunming
5   mingming    21  guangzhou
6   huahua  51  xian
11  xiaoxia 26  tianjin
Time taken: 0.116 seconds, Fetched: 7 row(s)

#查询表结构
hive> desc test_hive;
OK
id                      int
name                    string
age                     int
addr                    string
Time taken: 0.068 seconds, Fetched: 4 row(s)

#增加一列属性
hive> ALTER TABLE test_hive ADD COLUMNS (mobile String);
OK
Time taken: 0.134 seconds

hive> select * from test_hive;
OK
1   zhangsan    20  Beijing NULL
2   lisi    13  changsha    NULL
3   wangwu  31  chengdu NULL
4   liuliu  14  kunming NULL
5   mingming    21  guangzhou   NULL
6   huahua  51  xian    NULL
11  xiaoxia 26  tianjin NULL
Time taken: 0.117 seconds, Fetched: 7 row(s)

变更表内容,支持ACID

要支持update操作跟delete操作,必须额外再配置一些东西。
做这些操作其实是相当耗时的,需要启动MapReduce,然后启动任务去执行修改。
下面简单举例。

具体可参考官方wiki:

https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-NewConfigurationParametersforTransactions

修改hive-site.xml,然后重启hive

hive.support.concurrency – true
hive.enforce.bucketing – true   (Hive 2.0以后可以不用设置)
hive.exec.dynamic.partition.mode – nonstrict
hive.txn.manager –org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on – true
hive.compactor.worker.threads – 1

<property>
    <name>hive.support.concurrency</name>
    <value>true</value>
    <description>
      Whether Hive supports concurrency control or not.
      A ZooKeeper instance must be up and running when using zookeeper Hive lock manager
    </description>
</property>
<property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
    <description>
      In strict mode, the user must specify at least one static partition
      in case the user accidentally overwrites all partitions.
      In nonstrict mode all partitions are allowed to be dynamic.
    </description>
</property>
<property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
    <description>
      Set to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager as part of turning on Hive
      transactions, which also requires appropriate settings for hive.compactor.initiator.on,
      hive.compactor.worker.threads, hive.support.concurrency (true), hive.enforce.bucketing
      (true), and hive.exec.dynamic.partition.mode (nonstrict).
      The default DummyTxnManager replicates pre-Hive-0.13 behavior and provides
      no transactions.
    </description>
</property>
<property>
    <name>hive.compactor.initiator.on</name>
    <value>on</value>
    <description>
      Whether to run the initiator and cleaner threads on this metastore instance or not.
      Set this to true on one instance of the Thrift metastore service as part of turning
      on Hive transactions. For a complete list of parameters required for turning on
      transactions, see hive.txn.manager.
    </description>
</property>
<property>
    <name>hive.compactor.worker.threads</name>
    <value>1</value>
    <description>
      How many compactor worker threads to run on this metastore instance. Set this to a
      positive number on one or more instances of the Thrift metastore service as part of
      turning on Hive transactions. For a complete list of parameters required for turning
      on transactions, see hive.txn.manager.
      Worker threads spawn MapReduce jobs to do compactions. They do not do the compactions
      themselves. Increasing the number of worker threads will decrease the time it takes
      tables or partitions to be compacted once they are determined to need compaction.
      It will also increase the background load on the Hadoop cluster as more MapReduce jobs
      will be running in the background.
    </description>
</property>

create table test_acid 
   (id int, name String, age int, addr String) 
   clustered by (id) into 2 buckets
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
  stored as orc TBLPROPERTIES ('transactional'='true');

OK
Time taken: 2.894 seconds

hive> show tables;
OK
test_acid
test_hive
Time taken: 0.22 seconds, Fetched: 2 row(s)



#导入数据
from test_hive 
   insert into table test_acid 
   select id,name,age,addr;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Query ID = hadoop_20160527165406_6671a29e-3c04-4e41-8b20-8c1662830433
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1464339368801_0001, Tracking URL = http://localhost:8088/proxy/application_1464339368801_0001/
Kill Command = /opt/hadoop/hadoop-2.7.2/bin/hadoop job  -kill job_1464339368801_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2016-05-27 16:56:26,593 Stage-1 map = 0%,  reduce = 0%
2016-05-27 16:56:35,189 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.53 sec
2016-05-27 16:56:47,356 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 3.85 sec
2016-05-27 16:56:48,402 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.32 sec
MapReduce Total cumulative CPU time: 6 seconds 320 msec
Ended Job = job_1464339368801_0001
Loading data to table default.test_acid
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 6.32 sec   HDFS Read: 12460 HDFS Write: 1892 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 320 msec
OK
Time taken: 164.028 seconds

#查询数据
hive> select * from test_acid;
OK
6   huahua  51  xian
4   liuliu  14  kunming
2   lisi    13  changsha
11  xiaoxia 26  tianjin
5   mingming    21  guangzhou
3   wangwu  31  chengdu
1   zhangsan    20  Beijing

#更新数据
hive> update test_acid set age=99 where id = 1;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Query ID = hadoop_20160527165857_2a47933b-aede-40d4-87cc-281e04b60fe2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1464339368801_0002, Tracking URL = http://localhost:8088/proxy/application_1464339368801_0002/
Kill Command = /opt/hadoop/hadoop-2.7.2/bin/hadoop job  -kill job_1464339368801_0002
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 2
2016-05-27 16:59:07,223 Stage-1 map = 0%,  reduce = 0%
2016-05-27 16:59:17,235 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 2.07 sec
2016-05-27 16:59:19,348 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.24 sec
2016-05-27 16:59:32,275 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.33 sec
MapReduce Total cumulative CPU time: 8 seconds 330 msec
Ended Job = job_1464339368801_0002
Loading data to table default.test_acid
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2  Reduce: 2   Cumulative CPU: 8.33 sec   HDFS Read: 23537 HDFS Write: 935 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 330 msec
OK
Time taken: 37.032 seconds

#查询更改之后的数据
hive> select * from test_acid;
OK
6   huahua  51  xian
4   liuliu  14  kunming
2   lisi    13  changsha
11  xiaoxia 26  tianjin
5   mingming    21  guangzhou
3   wangwu  31  chengdu
1   zhangsan    99  Beijing
Time taken: 0.159 seconds, Fetched: 7 row(s)

由此可见,对于实时更新数据,Hadoop执行效率其实相当低。

其它操作

1) 创建与已知表相同结构的表Like:

只复制表的结构,而不复制表的内容。

 create table test_like_table like test_bucket;
2) 对表进行重命名 rename to:
 ALTER TABLE table_name RENAME TO new_table_name
3) 增加分区 Add Partitions:
 ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ]partition_spec [ LOCATION 'location2' ]
4) 对表中的某一列进行修改,包括列的名称/列的数据类型/列的位置/列的注释
 ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type[COMMENT col_comment] [FIRST|AFTER column_name]
5) 添加/替换列Add/ReplaceColumns
 ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENTcol_comment], ...)
 ADD COLUMNS 允许用户在当前列的末尾增加新的列,但是在分区列之前。
6) 创建表的完整语句:
 Create [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
 [(col_name data_type [COMMENT col_comment], ...)] 
 [COMMENT table_comment] 
 [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
 [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)]INTO num_buckets BUCKETS] 
 [ROW FORMAT row_format] 
 [STORED AS file_format] 
 [LOCATION hdfs_path]


原文链接:Hadoop学习笔记(11)--Hive基础使用,转载请注明来源!

0