
类型 | 说明 |
Table | 表;表创建的操作包括两个步骤: (1)表创建过程 (2)数据加载过程:实际数据会移动到数据仓库目录中,之后的数据访问将会直接在数据仓库目录中完成。在删除表时,表中的数据和元数据将会被同时删除 在HDFS上的存储路径为 /user/hive/warehouse/accesslog |
External Table | 外部表;表创建的操作只有一个步骤: 加载数据和创建表同时完成,实际数据存储在创建语句LOCATION指定的HDFS路径中,并不会移动到数据仓库目录中。如果删除一个外部表,仅删除元数据,表中的数据不会被删除 |
Partition | 分区;类型索引 在HDFS上的存储路径为 /user/hive/warehouse/accesslog/event_day=20150512/ |
Bucket | 桶;对指定列进行hash计算时,会根据hash值切分数据,使每个桶对应一个文件 在HDFS上的存储路径为 /user/hive/warehouse/accesslog/event_day=20150512/part-00010 |
wget http://mirror.bit.edu.cn/apache/hive/stable/apache-hive-1.1.0-bin.tar.gz
tar zxvf apache-hive-1.1.0-bin.tar.gz
编辑cp conf/hive-env.sh.template conf/hive-env.sh
vim conf/hive-env.sh
HADOOP_HOME=/home/tanjiti/hadoop-2.6.0[替换成hadoop所在目录]
编辑cp conf/hive-default.xml.template conf/hive-site.xml
vim conf/hive-site.xml
在hadoop HDFS中创建相应的目录<property>
<name>hive.metastore.warehouse.dir</name> #hive数据存储目录,指定的是HDFS上的位置
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property><property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value> #hive的数据临时文件目录
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, a
n HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
bin/hdfs dfs -mkdir /user/hive/
bin/hdfs dfs -mkdir /user/hive/warehousebin/hdfs dfs -chmod g+w /user/hive/warehousebin/hdfs dfs -chmod g+w /tmp
默认设置vim conf/hive-site.xml
Derby JDBC驱动包在lib/derby-10.11.1.1.jar<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value> #Hive连接数据库的连接字符串
<description>JDBC connect string for a JDBC metastore</description>
</property><property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value> #jdbc驱动的类入口名称
<description>Driver class name for a JDBC metastore</description>
</property><property>
<name>javax.jdo.option.ConnectionUserName</name> #数据库的用户名
<value>APP</value>
<description>Username to use against metastore database</description> #
</property><property>
<name>javax.jdo.option.ConnectionPassword</name> #数据库的密码
<value>mine</value>
<description>password to use against metastore database</description>
</property>
apt-get install mysql-server
create user 'hive'@'%' identified by 'hive';
grant all privileges on *.* to 'hive'@'%' with grant option;
flush privileges;
create database hive;
alter database hive character set latin1;
编辑vim conf/hive-site.xml
7. 下载MySQL JDBC 驱动包,放置在lib目录下<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password to use against metastore database</description>
</property>
lib/mysql-connector-java-5.1.7-bin.jar
bin/hive
我们在hadoop HFS中也能看到该数据文件,可以看出来hive中的每张表都对应hadoop的一个存储目录hive> create table test(id int, name string) row format delimited FIELDS TERMINATED BY ',';
OKTime taken: 0.201 secondshive> load data local inpath '/home/tanjiti/apache-hive-1.1.0-bin/test.data' overwrite into table test;Loading data to table default.testTable default.test stats: [numFiles=1, numRows=0, totalSize=25, rawDataSize=0]OKTime taken: 0.463 secondshive> select * from test;OK1 tanjiti2 kokcc3 daniTime taken: 0.218 seconds, Fetched: 3 row(s)
/hadoop-2.6.0/bin/hdfs dfs -cat /user/hive/warehouse/test/test.data1,tanjiti2,kokcc3,dani
[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.TerminalFactory.create(TerminalFactory.java:101)
at jline.TerminalFactory.get(TerminalFactory.java:158)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:229)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:221)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:209)
at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.console.ConsoleReader.<init>(ConsoleReader.java:230)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:221)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:209)
at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
../hadoop-2.6.0/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/jline-0.9.94.jar
../hadoop-2.6.0/share/hadoop/yarn/lib/jline-0.9.94.jar
../hadoop-2.6.0/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/jline-0.9.94.jar
../apache-hive-1.1.0-bin/lib/jline-2.12.jar
export HADOOP_USER_CLASSPATH_FIRST=true
错误2:
Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:472)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:515)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:458)
... 8 more
Caused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.<init>(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
... 11 more
解决方案:替换${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D为绝对路径
vim conf/hive-site.xml
编辑
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/hive</value>
<description>Local scratch space for Hive jobs</description>
</property><property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/${hive.session.id}_resources</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property><property>
<name>hive.querylog.location</name>
<value>/tmp/hive</value>
<description>Location of Hive run time structured log file</description>
</property><property>
<name>hive.server2.logging.operation.log.location</name>
<value>/tmp/hive/operation_logs</value>
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
错误3:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:javax.jdo.JDODataStoreException: An exception was thrown while adding/validating class(es) : Specified key was too long; max key length is 767 bytes
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes
alter database hive character set latin1;
错误4:
java版本过低导致的错误,最好使用java 7及以上版本(java6的定位估计是浏览器中的IE6)
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/hadoop/hdfs/server/namenode/NameNode : Unsupported major.minor version 51.0
三、 常用HiveQL及UDF编写
一句话:和mysql非常非常像,学起来完全没有压力
1. hive客户端常用命令
hive -e "SQL语句";
hive -f test.hql 从文件执行hive查询
hive> ! pwd; 执行简单的shell命令
hive> dfs -ls /user/hive/warehouse; 执行Hadoop的dfs命令
2. hive支持的数据类型
包括基本类型与集合类型,对日志分析,一般string,bigint,double,map就够用了
3. hive默认的切割文本文件的分隔符
\n 分割行;ctrl+A 分割字段(列),最后自己来指定分割符
4. hive的环境设置
set hive.cli.print.current.db=true; 设置显示当前DB
set hive.cli.print.header=true; 设置显示表头set hive.exec.mode.local.auto=ture;设置本地模式,避免进行mapreduce,数据量小的时候适用set hive.mapred.mode=strict;设置严格模式,当开启非本地模式,采用严格的查询语句优化查询性能,例如where需指定分区;order by要和limit一起
5. HiveQL:我们通过分析web日志来熟悉HiveQL
日志样例
127.0.0.1 [12/May/2015:15:16:30 +0800] sqli(194) BAN(226) 403 174 POST "/wp-content/plugins/store-locator-le/downloadcsv.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0" "-" "-" "-" "query=addr,,1%26%2339;union(select*from((select%20md5(3.1415))a1%20join(select%202)a2))#" "application/x-www-form-urlencoded"
remote_addr string 访问者iptime_local 时间attack_type 攻击类型(类型ID)ban_type 事件处理类型(事件响应时间)status HTTP响应码body_bytes_sent body字节数request_method HTTP请求方法request_uri HTTP请求URIhttp_user_agent User_Agent请求头http_x_forwarded_for X_Forwarded_For请求头http_referer Referer请求头http_cookie Cookie请求头request_body 请求头http_content_type Content_Type请求头
第一步:创建数据库weblog
hive> create database if not exists weblog comment 'holds all web logs' ;
数据库存储在
hive> dfs -ls /user/hive/warehouse/;
drwxrwxr-x - root supergroup 0 2015-05-12 15:01 /user/hive/warehouse/weblog.db
第二步:创建表nginxlog,用来存储原始日志
hive> use default;
hive> create table nginxlog(remote_addr string,time_local string, attack_type string,ban_type string,status string,body_bytes_sent string,request_method string,request_uri string,http_user_agent string,http_x_forwarded_for string,http_referer string,http_cookie string,request_body string,http_content_type string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties("input.regex" = "(\\d+\\.\\d+\\.\\d+\\.\\d+)\\s+(\\[[^\\]]+\\])\\s+(\\w+\\(\\d*\\))\\s+(\\w+\\(\\d*\\))\\s+(\\d{3})\\s+(\\d+)\\s+([A-Z]+)\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"") stored as textfile;
这个input regex让我血槽速减 99%
教训:双倍转义,双倍转义;双倍转义 重要的事情说三遍
也让我学会了如何修改table的SerDe属性
补充知识:hive使用一个inputformat对象将输入流分割成记录,然后使用一个outoutformat对象来记录格式化为输出流,再使用
SerDe(序列化,反序列化配置)在读数据时将记录解析成列,在写数据时将列编码成记录。
hive> alter table nginxlog
> set serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
> with serdeproperties("input.regex" = "(\\d+\\.\\d+\\.\\d+\\.\\d+)\\s+(\\[[^\\]]+\\])\\s+(\\w*\\(\\d*\\))\\s+(\\w*\\(\\d*\\))\\s+(\\d{3})\\s+(\\d+)\\s+([A-Z]+)\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"\\s+\\\"([^\"]+)\\\"") ;
表创建成功后,我们可以看到其在hadoop中的存储位置为
hive> dfs -ls /user/hive/warehouse/weblog.db/nginxlog;
Found 1 items
-rwxrwxr-x 1 root supergroup 1861896 2015-05-12 20:22 /user/hive/warehouse/weblog.db/nginxlog/access.log
我们可以查看表的结构
hive> describe nginxlog;
OK
remote_addr string
time_local string
attack_type string
ban_type string
status string
body_bytes_sent string
request_method string
request_uri string
http_user_agent string
http_x_forwarded_for string
http_referer string
http_cookie string
request_body string
http_content_type string
Time taken: 0.055 seconds, Fetched: 14 row(s)
第三步:导入原始日志文件
load data local inpath "/home/tanjiti/nginx/logs/access.log" overwrite into table nginxlog;
第四步:创建另一个表,用来存储url parse后的数据
create table urlparse(
request_uri string,
requestfilename string,
param map<string,string>);
将url parse数据存入urlparse表中
insert overwrite table urlparse select request_uri, case when instr(request_uri,'?') == 0 then substr(request_uri,0,length(request_uri)) else substr(request_uri,0,instr(request_uri,'?')-1) end as requestfilename, case when instr(request_uri,'?') == 0 then NULL else str_to_map(substr(request_uri,instr(request_uri,'?')+1),'&','=') end as param from nginxlog;
我们可以检查一下存入的数据
urlparse.request_uri urlparse.requestfilename urlparse.param (列名)
/forummission.php /forummission.php NULL
/userapp.php?script=notice&view=all&option=deluserapp&action=invite&hash='%20and%20(select%201%20from%20(select%20count(*),concat(md5(3.1415),floor(rand(0)*2))x%20from%20information_schema.tables%20group%20by%20x)a)%23 /userapp.php {"hash":"'%20and%20(select%201%20from%20(select%20count(*),concat(md5(3.1415),floor(rand(0)*2))x%20from%20information_schema.tables%20group%20by%20x)a)%23","action":"invite","option":"deluserapp","view":"all","script":"notice"}
注:这种解析方法非常粗略,对不符合url?k1=v1&k2=v2格式的请求是无法正确解析的,其中就包括url改写的,实际应用需要改善,这里仅仅是示例
第五步:探索URL特征
我们从统计的角度来探索url的一些特征:
- 每个host对应多少个去重的url请求;
- 这些URL请求的:
- 参数个数的分布特征
- 参数长度的分布特征
- 参数名的取值枚举及取值分类:Word, ParaArray( e.g. text["fafa"], t[]),Other
- 参数值的取值分类:Digits( e.g. -123 +56 123.3 .3 1,123,123),Word, Email, PATH(windows/*linux), URI,SafeText(-_.,:a-zA-Z0-9\s), Flag(Null), DuplicatePara(e.g. a=1&a=2), Base64, Encrypt(md5, sha1),Other
扩展开来,这种探索方法我们可以用来生成URL白名单,当然在探索前,我们需要对日志源进行清理操作,去除非正常日志例如攻击日志,server错误日志(只取2xx,3xx),静态日志(avi,jpg等),重复日志等杂音,与规范化处理例如统一PATH(压缩多个//, 转换/->\),扯远了。
长度性的判断可以简单地使用切比雪夫定理
对于数值性的统计分析,Hive提供了一些内置函数,例如
描述数据集中趋势的:均值avg;
描述数据离散程度的:方差var_pop;标准差stddev_pop;协方差covar_pop;相关系数corr
这些功能的实现有些可以用内置的函数,有些就得编写自定义函数了。
内置函数之获得请求路径,该路径下出现的参数名数组,参数值数组
select requestfilename,map_keys(param),map_values(param) from urlparse where param is not null limit 10;
部分结果如下
/bbs/plugin.php ["action","identifier","fmid","module"] ["view","family","1+and+1=2+unIon+selecT+%201,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,group_concat(0x3a,0x3a,md5(3.1415),0x3a,0x3a),25,26,27,28,29,30,31--%20-","family"]
/wp-content/plugins/profiles/library/bio-img.php ["id"] ["-1%27%20AND%201=IF(2%3E1,BENCHMARK(10000000,MD5(CHAR(115,113,108,109,97,112))),0)--%20-"]
内置函数之获得index.php下所有的查询字符串key-value对
hive> from (select explode(param) from urlparse where param is not NULL and requestfilename = '/index.php') e select distinct *;
部分结果如下
view ../../../../../../../../../../../../../../../../../../boot.ini%00
view ../../../../../../../../../../../../../../../../etc/passwd%00
view c%3A%5CBoot.ini%00
view music
view object
view portfolio
view thread
view timereturns
内置函数之获得每条url的参数分布统计特征
hive> select s.requestfilename as requestfilename, sum(distinct s.param_num) as sum, avg(s.param_num) as avg, max(s.param_num) as max, min(s.param_num) as min, variance(s.param_num) as variance, var_samp(s.param_num) as var_samp, stddev_pop(s.param_num) as stddev_pop, stddev_samp(s.param_num) as stddev_samp from (select requestfilename as requestfilename,size(param) as param_num from urlparse where param is not null)s group by s.requestfilename limit 10;
部分结果如下:
requestfilename sum avg max min variance var_samp stddev_pop stddev_samp
/ 21 2.4623655913978495 6 1 0.8077234362354029 0.816503038803179 0.8987343524286823 0.9036055770097808
//m_5_1/govdiropen/que_chooseusers.jsp 1 1.0 1 1 0.0 0.0 0.0 0.0
第六步:编写用户自定义函数-IP GEO信息查询
我们用查询Maxmind IP库来定位remote_addr的地理位置
1. 下载maxmind geoip java api并打包成jar包
git clone https://github.com/maxmind/geoip-api-java.git
cd geoip-api-java/mvn clean install会生成target/geoip-api-1.2.15-SNAPSHOT.jar文件
2. 获得IP库数据文件
wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoIP.dat.gz
gzip -d GeoIP.dat.gz
3. 编写hive-geo UDF function
源码参照 https://raw.githubusercontent.com/edwardcapriolo/hive-geoip/master/src/main/java/com/jointhegrid/udf/geoip/GenericUDFGeoIP.java
源码由四部分组成
1.函数使用说明文档,describe function 中会看到的内容
@Description(
name = "geoip",
value = "_FUNC_(ip,property,database) - loads database into GEO-IP lookup "+
"service, then looks up 'property' of ip. "extended = "Example:\n"
+ "> SELECT _FUNC_(ip,'COUNTRY_CODE','/GeoIP.data') from src LIMIT 1;\n "
)
2. initialize初始阶段,检查传入参数合法性、确认其类型,比如说本例的第一个参数可以是string,也可以是长整形
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
3. 查询逻辑的实现,查询中对应的每个应用到这个函数的地方都会对这个类进行实例化
public Object evaluate(DeferredObject[] arguments) throws HiveException
4.调式使用
public String getDisplayString(String[] children)
自定义非聚合类函数的编写方式基本可以参照上面的源码的结构,改吧改吧
注:聚合类函数编写要比这个复杂些
源码放置在我的git上 https://github.com/tanjiti/UDFExample/tree/master
在eclipse上编译成jar包,当然也可以采用命令行的方式,有兴趣的见第四部分的内容

附加:eclipse安装fatjar插件的方法
Help - install new software - work with 处填入http://kurucz-grafika.de/fatjar
4. 接下来的操作就是hive操作了
add jar /home/tanjiti/UDFExample/UDFExample.jar; #这个操作是设置classpath,但有些问题
add jar /home/tanjiti/UDFExample/lib/geoip-api-1.2.15-SNAPSHOT.jaradd jar /home/tanjiti/UDFExample/lib/hive-exec-1.1.0.jar;add file /tmp/GeoIP.dat; #这个操作其实就是使用hadoop分布式缓存create temporary function geoip as 'udfExample.GenericUDFGeoIP';
select geoip(remote_addr,"COUNTRY_NAME","/tmp/GeoIP.dat") from nginxlog limit 1;
或者
select geoip(3514683273,"COUNTRY_NAME","/tmp/GeoIP.dat");
结果如下
xxx.xxx.xxx United States
为了避免不必要的麻烦,请写全路径,全路径,全路径,重要的事情说三遍。
到这里,这篇科普文就结束了,下面的有兴趣可以看看,血泪史
-----------------------------------------------------------------------------------------------------------------------------------(血泪分割线)
老习惯,记录一下遇到的bug
错误5:
Exception in thread "main" java.lang.NoClassDefFoundError: com/maxmind/geoip/LookupService
at udfExample.GenericUDFGeoIP.evaluate(GenericUDFGeoIP.java:133)at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:145)at org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:232)at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:958)at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1168)at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:94)at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:78)at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:132)at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:109)at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:192)at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:145)at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:10530)at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:10486)at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3720)at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3499)at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:9011)at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:8966)at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9812)at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9705)at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:10141)at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:286)at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10152)at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:192)at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:222)at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:421)at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:307)at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1112)at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1160)at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1039)at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:207)at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:754)at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:606)at org.apache.hadoop.util.RunJar.run(RunJar.java:221)at org.apache.hadoop.util.RunJar.main(RunJar.java:136)Caused by: java.lang.ClassNotFoundException: com.maxmind.geoip.LookupServiceat java.net.URLClassLoader$1.run(URLClassLoader.java:366)at java.net.URLClassLoader$1.run(URLClassLoader.java:355)at java.security.AccessController.doPrivileged(Native Method)at java.net.URLClassLoader.findClass(URLClassLoader.java:354)at java.lang.ClassLoader.loadClass(ClassLoader.java:425)at java.lang.ClassLoader.loadClass(ClassLoader.java:358)... 43 more
有经验的或者有java常识的都知道java.lang.NoClassDefFoundError 多半是classpath环境变量的问题了
但我是菜鸟,其实出现该错误的操作是嫌弃fat jar包打包缓慢,因此用不熟悉的命令行打包,结果花了很多时间来定位原因。。。所幸找到了解决方案
四、命令行编译打包jar文件
一、首先介绍一下源码结构
├── bin #用来存放编译后的字节文件
│?? └── udfExample
│?? └── GenericUDFGeoIP.class
├── lib #用到的外部jar包
│?? ├── geoip-api-1.2.15-SNAPSHOT.jar
│?? └── hive-exec-1.1.0.jar
├── mymainfest #mainfest文件,非常重要
├── src #java源文件
│?? └── udfExample
│?? ├── GenericUDFGeoIP.java
│?? └── GenericUDFNvl.java
二、编译源文件
javac -d bin/ -sourcepath src/ -cp lib/hive-exec-1.1.0.jar:lib/geoip-api-1.2.15-SNAPSHOT.jar src/udfExample/GenericUDFGeoIP.java
-sourcepath <path> 指定源文件路径
vim mymainfest
编辑
Main-Class: udfExample.GenericUDFGeoIP
Class-Path: lib/geoip-api-1.2.15-SNAPSHOT.jar lib/hive-exec-1.1.0.jar
四、打包
jar cvfm UDFExample.jar mymainfest lib/* src/* -C bin .
接下来就是hive上操作了
接下来计划把hadoop mapreduce入门的坑填上