impala初体验

  |   0 评论   |   2,061 浏览

背景

impala是Cloudera, MapR, Oracle和Amazon推出的基于Apache Hadoop的分析数据库。

架构

src

其中包括四个部分

  • 客户端:Hue, ODBC, JDBC,或者 Impala Shell
  • Hive Metastore:存储数据库表的元数据信息
  • Impala:impalad进程,在DataNodes上运行, 坐标(coordinates)和执行查询
  • HBase和HDFS:impalad进程,存储数据

特点

来自官方文档

  • 支持SQL92语法,包括SELECT,joinsaggregation函数
  • 支持HDFS, HBase和S3存储
  • 支持JDBC, ODBC和Hue
  • 支持Impala shell
  • 支持Kerberos认证

组件

  • impalad: Impala服务, 在HDFS, HBase和S3上计划和执行语句. 集群每个数据节点(DataNode)上都需要安装.
  • statestored: Name服务,监听所有impalad实例的位置和状态. 集群每个节点上都需要安装.
  • catalogd: Metadata坐标服务,向对应的节点广播DDL和DML语句.
  • impala-shell: 命令行接口.

安装

docker安装(未完成)

下载

docker pull cloudera/impala-dev:minimal

或者

docker pull registry.docker-cn.com/cloudera/impala-dev:minimal

或者直接下载, 但是需要注册

QuickStarts for CDH 5.13

由于下载太慢,故直接使用了虚拟机安装。

虚拟机安装

QuickStarts for CDH 5.13

cloudera impala tutorial

虚拟机文件:cloudera-quickstart-vm-5.13.0-0-virtualbox.zip 5.5G

默认安装的服务有:

  • solr 4.10.3
  • Oozie: 4.1.0-cdh5.13.0
  • Spark 1.6.0
  • Impala 2.10.0-cdh5.13.0
  • HBase 1.2.0-cdh5.13.0
  • Hadoop 2.6.0-cdh5.13.0
  • Yarn 2.6.0-cdh5.13.0
  • Hue
  • Kudu? 这个默认没有.

CentOS安装

TBD

使用

> impala-shell

或者

> impala-shell -i localhost --quiet

常用命令

  • select version();
  • show databases;
  • select current_database();
  • show tables;
  • show tables in default;
  • show tables in tpc like 'customer*';

查看数据库

[quickstart.cloudera:21000] > show databases;
Query: show databases
+------------------+----------------------------------------------+
| name             | comment                                      |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| default          | Default Hive database                        |
+------------------+----------------------------------------------+
Fetched 2 row(s) in 0.11s

查看数据表

[quickstart.cloudera:21000] > show tables;
Query: show tables
Fetched 0 row(s) in 0.00s

查看版本号

[localhost:21000] > select version();
+-------------------------------------------------------------------------------------------+
| version()                                                                                 |
+-------------------------------------------------------------------------------------------+
| impalad version 2.10.0-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4) |
| Built on Wed Oct  4 10:55:37 PDT 2017                                                     |
+-------------------------------------------------------------------------------------------+

增删改查

建数据库

[localhost:21000] > create database test;
[localhost:21000] > show databases like 'test';
+------+---------+
| name | comment |
+------+---------+
| test |         |
+------+---------+

建数据表

[localhost:21000] > use test;
[localhost:21000] > create table t1 (x int);
[localhost:21000] > desc t1;
+------+------+---------+
| name | type | comment |
+------+------+---------+
| x    | int  |         |
+------+------+---------+

插入数据

[localhost:21000] > insert into t1 values (1), (3), (2), (4);

查询数据

[localhost:21000] > select * from t1;
+---+
| x |
+---+
| 1 |
| 3 |
| 2 |
| 4 |
+---+
[localhost:21000] > select min(x), max(x), sum(x), avg(x) from t1;
+--------+--------+--------+--------+
| min(x) | max(x) | sum(x) | avg(x) |
+--------+--------+--------+--------+
| 1      | 4      | 10     | 2.5    |
+--------+--------+--------+--------+
[localhost:21000] > select count(distinct word) from t2;
+----------------------+
| count(distinct word) |
+----------------------+
| 3                    |
+----------------------+

join操作

[localhost:21000] > create table t2 (id int, word string);
[localhost:21000] > insert into t2 values (1, "one"), (3, "three"), (5, 'five');
[localhost:21000] > select word from t1 join t2 on (t1.x = t2.id);
+-------+
| word  |
+-------+
| one   |
| three |
+-------+

Load CSV 本地数据

上传一个CSV示例数据

[cloudera@quickstart ~]$ whoami
cloudera
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera

建立目录

[cloudera@quickstart ~]$ hdfs dfs -mkdir -p /user/cloudera/sample_data/tab1 /user/cloudera/sample_data/tab2
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera
Found 1 items
drwxr-xr-x   - cloudera cloudera          0 2018-04-17 08:15 /user/cloudera/sample_data

生成本地文件

生成本地文件 tab1.csvtab2.csv

[cloudera@quickstart ~]$ cat tab1.csv
1,true,123.123,2012-10-24 08:55:00
2,false,1243.5,2012-10-25 13:40:00
3,false,24453.325,2008-08-22 09:33:21.123
4,false,243423.325,2007-05-12 22:32:21.33454
5,true,243.325,1953-04-22 09:11:33
[cloudera@quickstart ~]$ cat tab2.csv
1,true,12789.123
2,false,1243.5
3,false,24453.325
4,false,2423.3254
5,true,243.325
60,false,243565423.325
70,true,243.325
80,false,243423.325
90,true,243.325

上传文件

[cloudera@quickstart ~]$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
[cloudera@quickstart ~]$ hdfs dfs -put tab2.csv /user/cloudera/sample_data/tab2
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/sample_data
Found 2 items
drwxr-xr-x   - cloudera cloudera          0 2018-04-17 08:17 /user/cloudera/sample_data/tab1
drwxr-xr-x   - cloudera cloudera          0 2018-04-17 08:17 /user/cloudera/sample_data/tab2

impala-shell中建表

DROP TABLE IF EXISTS tab1;
-- The EXTERNAL clause means the data is located outside the central location
-- for Impala data files and is preserved when the associated Impala table is dropped.
-- We expect the data to already exist in the directory specified by the LOCATION clause.
CREATE EXTERNAL TABLE tab1
(
   id INT,
   col_1 BOOLEAN,
   col_2 DOUBLE,
   col_3 TIMESTAMP
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/sample_data/tab1';
DROP TABLE IF EXISTS tab2;
-- TAB2 is an external table, similar to TAB1.
CREATE EXTERNAL TABLE tab2
(
   id INT,
   col_1 BOOLEAN,
   col_2 DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/sample_data/tab2';
DROP TABLE IF EXISTS tab3;
-- Leaving out the EXTERNAL clause means the data will be managed
-- in the central Impala data directory tree. Rather than reading
-- existing data files when the table is created, we load the
-- data after creating the table.
CREATE TABLE tab3
(
   id INT,
   col_1 BOOLEAN,
   col_2 DOUBLE,
   month INT,
   day INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

查看结果

[quickstart.cloudera:21000] > select * from tab1;
+----+-------+------------+-------------------------------+
| id | col_1 | col_2      | col_3                         |
+----+-------+------------+-------------------------------+
| 1  | true  | 123.123    | 2012-10-24 08:55:00           |
| 2  | false | 1243.5     | 2012-10-25 13:40:00           |
| 3  | false | 24453.325  | 2008-08-22 09:33:21.123000000 |
| 4  | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
| 5  | true  | 243.325    | 1953-04-22 09:11:33           |
+----+-------+------------+-------------------------------+
[quickstart.cloudera:21000] > select * from tab2;
+----+-------+---------------+
| id | col_1 | col_2         |
+----+-------+---------------+
| 1  | true  | 12789.123     |
| 2  | false | 1243.5        |
| 3  | false | 24453.325     |
| 4  | false | 2423.3254     |
| 5  | true  | 243.325       |
| 60 | false | 243565423.325 |
| 70 | true  | 243.325       |
| 80 | false | 243423.325    |
| 90 | true  | 243.325       |
+----+-------+---------------+

Aggregate and Join

[quickstart.cloudera:21000] > SELECT tab1.col_1, MAX(tab2.col_2), MIN(tab2.col_2) FROM tab2 JOIN tab1 USING (id) GROUP BY col_1 ORDER BY 1 LIMIT 5;
+-------+-----------------+-----------------+
| col_1 | max(tab2.col_2) | min(tab2.col_2) |
+-------+-----------------+-----------------+
| false | 24453.325       | 1243.5          |
| true  | 12789.123       | 243.325         |
+-------+-----------------+-----------------+

Subquery, Aggregate and Joins

[quickstart.cloudera:21000] > SELECT tab2.*
                            > FROM tab2,
                            > (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
                            >  FROM tab2, tab1
                            >  WHERE tab1.id = tab2.id
                            >  GROUP BY col_1) subquery1
                            > WHERE subquery1.max_col2 = tab2.col_2;
+----+-------+-----------+
| id | col_1 | col_2     |
+----+-------+-----------+
| 1  | true  | 12789.123 |
| 3  | false | 24453.325 |
+----+-------+-----------+

INSERT Query

[quickstart.cloudera:21000] > INSERT OVERWRITE TABLE tab3
                            > SELECT id, col_1, col_2, MONTH(col_3), DAYOFMONTH(col_3)
                            > FROM tab1 WHERE YEAR(col_3) = 2012;
[quickstart.cloudera:21000] > SELECT * FROM tab3;
+----+-------+---------+-------+-----+
| id | col_1 | col_2   | month | day |
+----+-------+---------+-------+-----+
| 1  | true  | 123.123 | 10    | 24  |
| 2  | false | 1243.5  | 10    | 25  |
+----+-------+---------+-------+-----+

参考

背景

搭建demo

搭建生产环境

评论

发表评论

validate