Quantcast
Channel: 碳基体
Viewing all articles
Browse latest Browse all 75

将hdfs web日志映射到hive table的方法

$
0
0

前段时间,自己仿modsecurity 规则,用mapreduce写了个搜索任务,任务慢到挂,后来想想还是将数据导入hive,让hive来干ETL的工作。
考虑到集群空间非常吃紧,不能增加存储量,因此选择外部表的方式

hadoop的目录结构如下
/data/xxx/20151216/00/xxx.log 
第一步:创建表

CREATE EXTERNAL TABLE IF NOT EXISTS xxx_access_log(
http_host STRING,
ip STRING,
time STRING,
http_method STRING,
uri STRING,
http_response_code STRING,
body_bytes_send INT,
referer STRING,
user_agent STRING,
x_forwarded_for STRING,
cookie STRING,
request_time STRING,
content_length INT,
request_body STRING)
COMMENT 'input xxx access log'
PARTITIONED BY(event_date STRING, event_hour STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES(
"input.regex" = "自己的正则")
STORED AS TEXTFILE
LOCATION "/data/xxx";

第二步: 增加分区,导入数据

ALTER TABLE $table_name ADD IF NOT EXISTS

PARTITION event_date='20151216', event_hour='00' location "/data/xxx/20151216/00";

PARTITION event_date='20151216', event_hour='01' location "/data/xxx/20151216/01";

...

PARTITION event_date='20151216', event_hour='23' location "/data/xxx/20151216/23";


需要一一指定分区,看起来有点蠢,也查了查自动分区的方法 
http://stackoverflow.com/questions/7544378/hive-dynamic-partition-adding-to-external-table 待试验

问题1:当serde使用以下包的时候

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES

数据导入后,会因为hive找不到hive-contrib-xxx.jar包,查询时会报错

解决方案将hive-contrib-xxx.jar包的位置写入环境变量中

hive/conf/hive-env.sh

增加

export HIVE_AUX_JARS_PATH=/home/xxx/hive/lib/hive-contrib-xxx.jar


参考:
http://hadooptutorial.info/processing-logs-in-hive/

Viewing all articles
Browse latest Browse all 75

Trending Articles