export HADOOP_USER_CLASSPATH_FIRST=true
错误2:
Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:472)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:515)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:458)
... 8 more
Caused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.<init>(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
... 11 more
解决方案:替换${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D为绝对路径
vim conf/hive-site.xml
编辑
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/hive</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/${hive.session.id}_resources</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/tmp/hive</value>
<description>Location of Hive run time structured log file</description>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/tmp/hive/operation_logs</value>
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
错误3:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:javax.jdo.JDODataStoreException: An exception was thrown while adding/validating class(es) : Specified key was too long; max key length is 767 bytes
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes
解决方案:
alter database hive character set latin1;
错误4:
java版本过低导致的错误,最好使用java 7及以上版本(java6的定位估计是浏览器中的IE6)
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/hadoop/hdfs/server/namenode/NameNode : Unsupported major.minor version 51.0
三、 常用HiveQL及UDF编写
一句话:和mysql非常非常像,学起来完全没有压力
1. hive客户端常用命令
hive -e "SQL语句";
hive -f test.hql 从文件执行hive查询
hive> ! pwd; 执行简单的shell命令
hive> dfs -ls /user/hive/warehouse; 执行Hadoop的dfs命令
2. hive支持的数据类型
包括基本类型与集合类型,对日志分析,一般string,bigint,double,map就够用了
3. hive默认的切割文本文件的分隔符
\n 分割行;ctrl+A 分割字段(列),最后自己来指定分割符
4. hive的环境设置
set hive.cli.print.current.db=true; 设置显示当前DB
set hive.cli.print.header=true; 设置显示表头
set hive.exec.mode.local.auto=ture;设置本地模式,避免进行mapreduce,数据量小的时候适用
set hive.mapred.mode=strict;设置严格模式,当开启非本地模式,采用严格的查询语句优化查询性能,例如where需指定分区;order by要和limit一起
5. HiveQL:我们通过分析web日志来熟悉HiveQL
日志样例
127.0.0.1 [12/May/2015:15:16:30 +0800] sqli(194) BAN(226) 403 174 POST "/wp-content/plugins/store-locator-le/downloadcsv.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0" "-" "-" "-" "query=addr,,1%26%2339;union(select*from((select%20md5(3.1415))a1%20join(select%202)a2))#" "application/x-www-form-urlencoded"
remote_addr string 访问者ip
time_local 时间
attack_type 攻击类型(类型ID)
ban_type 事件处理类型(事件响应时间)
status HTTP响应码
body_bytes_sent body字节数
request_method HTTP请求方法
request_uri HTTP请求URI
http_user_agent User_Agent请求头
http_x_forwarded_for X_Forwarded_For请求头
http_referer Referer请求头
http_cookie Cookie请求头
request_body 请求头
http_content_type Content_Type请求头
第一步:创建数据库weblog
hive> create database if not exists weblog comment 'holds all web logs' ;
数据库存储在
hive> dfs -ls /user/hive/warehouse/;
drwxrwxr-x - root supergroup 0 2015-05-12 15:01 /user/hive/warehouse/weblog.db
第二步:创建表nginxlog,用来存储原始日志
hive> use default;
hive> create table nginxlog(remote_addr string,time_local string, attack_type string,ban_type string,status string,body_bytes_sent string,request_method string,request_uri string,http_user_agent string,http_x_forwarded_for string,http_referer string,http_cookie string,request_body string,http_content_type string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties("input.regex" = "(\\d+\\.\\d+\\.\\d+\\.\\d+)\\s+(\\[[^\\]]+\\])\\s+(\\w+\\(\\d*\\))\\s+(\\w+\\(\\d*\\))\\s+(\\d{3})\\s+(\\d+)\\s+([A-Z]+)\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"") stored as textfile;
这个input regex让我血槽速减 99%
教训:双倍转义,双倍转义;双倍转义 重要的事情说三遍
也让我学会了如何修改table的SerDe属性
补充知识:hive使用一个inputformat对象将输入流分割成记录,然后使用一个outoutformat对象来记录格式化为输出流,再使用
SerDe(序列化,反序列化配置)在读数据时将记录解析成列,在写数据时将列编码成记录。
hive> alter table nginxlog
> set serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
> with serdeproperties("input.regex" = "(\\d+\\.\\d+\\.\\d+\\.\\d+)\\s+(\\[[^\\]]+\\])\\s+(\\w*\\(\\d*\\))\\s+(\\w*\\(\\d*\\))\\s+(\\d{3})\\s+(\\d+)\\s+([A-Z]+)\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"") ;
表创建成功后,我们可以看到其在hadoop中的存储位置为
hive> dfs -ls /user/hive/warehouse/weblog.db/nginxlog;
Found 1 items
-rwxrwxr-x 1 root supergroup 1861896 2015-05-12 20:22 /user/hive/warehouse/weblog.db/nginxlog/access.log
我们可以查看表的结构
hive> describe nginxlog;
OK
remote_addr string
time_local string
attack_type string
ban_type string
status string
body_bytes_sent string
request_method string
request_uri string
http_user_agent string
http_x_forwarded_for string
http_referer string
http_cookie string
request_body string
http_content_type string
Time taken: 0.055 seconds, Fetched: 14 row(s)
第三步:导入原始日志文件
load data local inpath "/home/tanjiti/nginx/logs/access.log" overwrite into table nginxlog;
第四步:创建另一个表,用来存储url parse后的数据
create table urlparse(
request_uri string,
requestfilename string,
param map<string,string>);
将url parse数据存入urlparse表中
insert overwrite table urlparse select request_uri, case when instr(request_uri,'?') == 0 then substr(request_uri,0,length(request_uri)) else substr(request_uri,0,instr(request_uri,'?')-1) end as requestfilename, case when instr(request_uri,'?') == 0 then NULL else str_to_map(substr(request_uri,instr(request_uri,'?')+1),'&','=') end as param from nginxlog;
我们可以检查一下存入的数据
urlparse.request_uri urlparse.requestfilename urlparse.param (列名)
/forummission.php /forummission.php NULL
/userapp.php?script=notice&view=all&option=deluserapp&action=invite&hash='%20and%20(select%201%20from%20(select%20count(*),concat(md5(3.1415),floor(rand(0)*2))x%20from%20information_schema.tables%20group%20by%20x)a)%23 /userapp.php {"hash":"'%20and%20(select%201%20from%20(select%20count(*),concat(md5(3.1415),floor(rand(0)*2))x%20from%20information_schema.tables%20group%20by%20x)a)%23","action":"invite","option":"deluserapp","view":"all","script":"notice"}
注:这种解析方法非常粗略,对不符合url?k1=v1&k2=v2格式的请求是无法正确解析的,其中就包括url改写的,实际应用需要改善,这里仅仅是示例
第五步:探索URL特征
我们从统计的角度来探索url的一些特征:
- 每个host对应多少个去重的url请求;
- 这些URL请求的:
- 参数个数的分布特征
- 参数长度的分布特征
- 参数名的取值枚举及取值分类:Word, ParaArray( e.g. text["fafa"], t[]),Other
- 参数值的取值分类:Digits( e.g. -123 +56 123.3 .3 1,123,123),Word, Email, PATH(windows/*linux), URI,SafeText(-_.,:a-zA-Z0-9\s), Flag(Null), DuplicatePara(e.g. a=1&a=2), Base64, Encrypt(md5, sha1),Other
扩展开来,这种探索方法我们可以用来生成URL白名单,当然在探索前,我们需要对日志源进行清理操作,去除非正常日志例如攻击日志,server错误日志(只取2xx,3xx),静态日志(avi,jpg等),重复日志等杂音,与规范化处理例如统一PATH(压缩多个//, 转换/->\),扯远了。
长度性的判断可以简单地使用切比雪夫定理![大数据之hive安装及分析web日志实例 - 碳基体 - 碳基体 大数据之hive安装及分析web日志实例 - 碳基体 - 碳基体]()
对于数值性的统计分析,Hive提供了一些内置函数,例如
描述数据集中趋势的:均值avg;
描述数据离散程度的:方差var_pop;标准差stddev_pop;协方差covar_pop;相关系数corr
这些功能的实现有些可以用内置的函数,有些就得编写自定义函数了。
内置函数之获得请求路径,该路径下出现的参数名数组,参数值数组
select requestfilename,map_keys(param),map_values(param) from urlparse where param is not null limit 10;
部分结果如下
/bbs/plugin.php ["action","identifier","fmid","module"] ["view","family","1+and+1=2+unIon+selecT+%201,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,group_concat(0x3a,0x3a,md5(3.1415),0x3a,0x3a),25,26,27,28,29,30,31--%20-","family"]
/wp-content/plugins/profiles/library/bio-img.php ["id"] ["-1%27%20AND%201=IF(2%3E1,BENCHMARK(10000000,MD5(CHAR(115,113,108,109,97,112))),0)--%20-"]
内置函数之获得index.php下所有的查询字符串key-value对
hive> from (select explode(param) from urlparse where param is not NULL and requestfilename = '/index.php') e select distinct *;
部分结果如下
view ../../../../../../../../../../../../../../../../../../boot.ini%00
view ../../../../../../../../../../../../../../../../etc/passwd%00
view c%3A%5CBoot.ini%00
view music
view object
view portfolio
view thread
view timereturns
内置函数之获得每条url的参数分布统计特征
hive> select s.requestfilename as requestfilename, sum(distinct s.param_num) as sum, avg(s.param_num) as avg, max(s.param_num) as max, min(s.param_num) as min, variance(s.param_num) as variance, var_samp(s.param_num) as var_samp, stddev_pop(s.param_num) as stddev_pop, stddev_samp(s.param_num) as stddev_samp from (select requestfilename as requestfilename,size(param) as param_num from urlparse where param is not null)s group by s.requestfilename limit 10;
部分结果如下:
requestfilename sum avg max min variance var_samp stddev_pop stddev_samp
/ 21 2.4623655913978495 6 1 0.8077234362354029 0.816503038803179 0.8987343524286823 0.9036055770097808
//m_5_1/govdiropen/que_chooseusers.jsp 1 1.0 1 1 0.0 0.0 0.0 0.0
第六步:编写用户自定义函数-IP GEO信息查询
我们用查询Maxmind IP库来定位remote_addr的地理位置
1. 下载maxmind geoip java api并打包成jar包
git clone https://github.com/maxmind/geoip-api-java.git
cd geoip-api-java/
mvn clean install
会生成target/geoip-api-1.2.15-SNAPSHOT.jar文件
2. 获得IP库数据文件
wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoIP.dat.gz
gzip -d GeoIP.dat.gz
3. 编写hive-geo UDF function
源码参照 https://raw.githubusercontent.com/edwardcapriolo/hive-geoip/master/src/main/java/com/jointhegrid/udf/geoip/GenericUDFGeoIP.java
源码由四部分组成
1.函数使用说明文档,describe function 中会看到的内容
@Description(
name = "geoip",
value = "_FUNC_(ip,property,database) - loads database into GEO-IP lookup "+
"service, then looks up 'property' of ip. "
extended = "Example:\n"
+ "> SELECT _FUNC_(ip,'COUNTRY_CODE','/GeoIP.data') from src LIMIT 1;\n "
)
2. initialize初始阶段,检查传入参数合法性、确认其类型,比如说本例的第一个参数可以是string,也可以是长整形
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
3. 查询逻辑的实现,查询中对应的每个应用到这个函数的地方都会对这个类进行实例化
public Object evaluate(DeferredObject[] arguments) throws HiveException
4.调式使用
public String getDisplayString(String[] children)
自定义非聚合类函数的编写方式基本可以参照上面的源码的结构,改吧改吧
注:聚合类函数编写要比这个复杂些
源码放置在我的git上 https://github.com/tanjiti/UDFExample/tree/master
在eclipse上编译成jar包,当然也可以采用命令行的方式,有兴趣的见第四部分的内容
![大数据之hive安装及HiveQL查询web日志实例 - 碳基体 - 碳基体 大数据之hive安装及HiveQL查询web日志实例 - 碳基体 - 碳基体]()
![大数据之hive安装及HiveQL查询web日志实例 - 碳基体 - 碳基体 大数据之hive安装及HiveQL查询web日志实例 - 碳基体 - 碳基体]()
附加:eclipse安装fatjar插件的方法
Help - install new software - work with 处填入http://kurucz-grafika.de/fatjar
4. 接下来的操作就是hive操作了
add jar /home/tanjiti/UDFExample/UDFExample.jar; #这个操作是设置classpath,但有些问题
add jar /home/tanjiti/UDFExample/lib/geoip-api-1.2.15-SNAPSHOT.jar
add jar /home/tanjiti/UDFExample/lib/hive-exec-1.1.0.jar;
add file /tmp/GeoIP.dat; #这个操作其实就是使用hadoop分布式缓存
create temporary function geoip as 'udfExample.GenericUDFGeoIP';
select geoip(remote_addr,"COUNTRY_NAME","/tmp/GeoIP.dat") from nginxlog limit 1;
或者
select geoip(3514683273,"COUNTRY_NAME","/tmp/GeoIP.dat");
结果如下
xxx.xxx.xxx United States
为了避免不必要的麻烦,请写全路径,全路径,全路径,重要的事情说三遍。
到这里,这篇科普文就结束了,下面的有兴趣可以看看,血泪史
-----------------------------------------------------------------------------------------------------------------------------------(血泪分割线)
老习惯,记录一下遇到的bug
错误5:
Exception in thread "main" java.lang.NoClassDefFoundError: com/maxmind/geoip/LookupService
at udfExample.GenericUDFGeoIP.evaluate(GenericUDFGeoIP.java:133)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:145)
at org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:232)
at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:958)
at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1168)
at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:94)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:78)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:132)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:109)
at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:192)
at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:145)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:10530)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:10486)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3720)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3499)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:9011)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:8966)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9812)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9705)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:10141)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:286)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10152)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:192)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:222)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:421)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:307)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1112)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1160)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1039)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:207)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:754)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.ClassNotFoundException: com.maxmind.geoip.LookupService
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 43 more
有经验的或者有java常识的都知道java.lang.NoClassDefFoundError 多半是classpath环境变量的问题了
但我是菜鸟,其实出现该错误的操作是嫌弃fat jar包打包缓慢,因此用不熟悉的命令行打包,结果花了很多时间来定位原因。。。所幸找到了解决方案
四、命令行编译打包jar文件
一、首先介绍一下源码结构
├── bin #用来存放编译后的字节文件
│?? └── udfExample
│?? └── GenericUDFGeoIP.class
├── lib #用到的外部jar包
│?? ├── geoip-api-1.2.15-SNAPSHOT.jar
│?? └── hive-exec-1.1.0.jar
├── mymainfest #mainfest文件,非常重要
├── src #java源文件
│?? └── udfExample
│?? ├── GenericUDFGeoIP.java
│?? └── GenericUDFNvl.java
二、编译源文件
javac -d bin/ -sourcepath src/ -cp lib/hive-exec-1.1.0.jar:lib/geoip-api-1.2.15-SNAPSHOT.jar src/udfExample/GenericUDFGeoIP.java
-cp <path> 指定依赖的库文件或字节文件
-sourcepath <path> 指定源文件路径
-d <directory> 指定编译后的字节文件存放路径
三、编辑mainfest文件 (千万别忘了这步)
vim mymainfest
编辑
Main-Class: udfExample.GenericUDFGeoIP
Class-Path: lib/geoip-api-1.2.15-SNAPSHOT.jar lib/hive-exec-1.1.0.jar
四、打包
jar cvfm UDFExample.jar mymainfest lib/* src/* -C bin .
-c 创建一个新的jar包
-v 显示详细信息
-f 指定jar包的路径
-C 切换到指定目录, -C 目录后跟着的. 表示包含指定目录下的所有文件
-m 指定mainfest文件
接下来就是hive上操作了
接下来计划把hadoop mapreduce入门的坑填上
hive文档参考:
https://github.com/edwardcapriolo/hive-geoip/