大数据之pig安装及使用实例

前置知识：

具备基本的hadoop分布式文件系统操作与mapreduce计算框架知识——大数据之hadoop伪集群搭建与MapReduce编程入门

一、Pig环境搭建

http://pig.apache.org/

1、安装

wget http://mirrors.cnnic.cn/apache/pig/pig-0.15.0/pig-0.15.0.tar.gz .

tar zxvf pig-0.15.0.tar.gz

cd pig-0.15.0/

vim bin/pig

修改

JAVA_HOME=/opt/jdk1.7.0_67

HADOOP_HOME=/opt/hadoop-2.7.1

HADOOP_CONF_DIR=/opt/hadoop-2.7.1/etc/hadoop


vim ~/.bashrc
修改
JAVA_HOME=/opt/jdk1.7.0_67
export PATH=$JAVA_HOME/bin:$PATH

HADOOP_HOME=/opt/hadoop-2.7.1
export PATH=$HADOOP_HOME/bin:$PATH

PIG_HOME=/opt/pig-0.15.0/
export PATH=$PIG_HOME/bin:$PATH

可选

安装ant http://ant.apache.org/bindownload.cgi



ANT_HOME=/opt/apache-ant-1.9.6
export PATH=$ANT_HOME/bin:$PATH

2、配置

vim conf/pig.properties

编辑

# Should scripts check to prevent multiple stores writing to the same location?

# (default: false) When set to true, stops the execution of script right away.

# 避免存储STORE同一位置的结果覆盖

pig.location.check.strict=true

二、pig 基本使用

1. pig命令行基本操作

交互式：

fs -ls / # fs hadoop 文件系统操作

sh ls #shell命令

非交互式：

pig --help

pig -e 'fs -ls /' #命令执行

pig -c test.pig #pig脚本语法检查

pig -f test.pig #pig脚本运行

2. pig数据分析实例

使用官方的入门tutorial教程－搜索热词统计来学习pig的基本使用，包括数据处理流程——数据加载load，数据过滤filter，数据分组group，数据连接join，数据排序order，数据合并union，数据分割split，数据存储store；pig内置函数，pig UDF函数编写。

编译与生成pigtutorial.tar.gz


cd /home/work/lidanqing01/pig-0.15.0/tutorial
vim build.xml
编辑 添加pig,hadoop依赖lib

    <path id="tutorial.classpath">
        <fileset dir="../lib/">
          <include name="*.jar"/>
        </fileset>
        <fileset dir="../lib/hadoop1-runtime/">
          <include name="*.jar"/>
        </fileset>
        <fileset dir="..">
          <include name="pig*-core-*.jar"/>
        </fileset>
        <pathelement location="${build.classes}"/>
        <pathelement location="${pigjar}"/>
    </path>

编译

ant

tar zxvf pigtutorial.tar.gz

cd pigtmp ＃这个目录下为实例pig脚本

（1） local模式运行

pig -x local script1-local.pig

结果：

script1-local-results.txt

（2）mapreduce模式运行

hadoop fs -mkdir -p /user/root

hadoop fs -put excite.log.bz2 . #将日志存放到hdfs /user/{user}/excite.log.bz2

pig -f script1-hadoop.pig

结果：

hadoop fs -ls /user/{user}/script1-hadoop-results

三、pig高级使用

使用pig对MySQL全日志sql语句查询次数进行统计

1. MySQL全日志格式见 http://danqingdani.blog.163.com/blog/static/186094195201611673420929/ 二、补充知识

2. pig脚本 db_parse.pig

raw_log_entries = LOAD 'mysql.log' using TextLoader AS (line:chararray);
define sqlicheck `python normalize_mapper.py` ship('normalize_mapper.py');
stream_log_entries = stream raw_log_entries through sqlicheck as(sql_base64:chararray, time_base64:chararray);
group_log_entries = GROUP stream_log_entries by sql_base64;
log_count = foreach group_log_entries generate flatten(group) as sql_base64:chararray, COUNT(stream_log_entries) as count;
log_count_order = order log_count by count desc;
register 'udf_pig.py' using jython as udf_tool;
decode_base64_log_count = foreach log_count_order generate udf_tool.base64_decode(sql_base64), count;
store decode_base64_log_count into 'sql_count';

3. pig UDF udf_pig.py

import base64

@outputSchema("sql:chararray")
def base64_decode(s):
    return base64.b64decode(s)

4. pig streaming normalize_mapper.py （http://danqingdani.blog.163.com/blog/static/186094195201611673420929/ ）

5. 运行

pig -x local -f db_parse.pig

结果如下

more sql_count/part-r-00000

select @@version_comment limit 1 1501

四、pig开发环境配置

1. sublime 编辑器

https://github.com/matthayes/sublime-text-pig

2. vim 编辑器

http://www.vim.org/scripts/script.php?script_id=2186

3. eclipse 编辑器

https://wiki.apache.org/pig/PigPen

五、更多

同步了解：

hive——大数据之hive安装及分析web日志实例

pig同hive，cascading的作用相同，它隐藏了mapreduce编程细节，提供一种面向程序猿更简单的操作方式，hive使用SQL，pig使用pig latin脚本，cascading提供java api，最后都将转换为mapreduce job。它能减少开发周期，使得程序猿将注意力集中在数据分析上,而不是执行本质。当然有了便利也会牺牲一些东西，例如性能，例如实现非常见算法。所以一般使用pig来做快速原型，最后在生产环节中使用mapreduce来实现。并且由于pig是批处理的计算方式，也遗传了批处理的缺陷，不支持对数据的随机读和写（用NoSQL数据库例如hbase来满足），不支持实时流式数据处理（用storm来满足），大数据处理框架非常多，需要根据场景选择最适合的。

大致浏览一下Pig Latin Basics官方文档，然后对应着查看tutorial中的源码，实际演练日志的分析，忘记语法时看看cheatsheet就能基本掌握pig。

参考：

http://pig.apache.org/

http://pig.apache.org/docs/r0.14.0/admin.html

https://cwiki.apache.org/confluence/display/PIG/Pig+Training

https://github.com/twitter/elephant-bird

https://github.com/linkedin/datafu

https://github.com/alanfgates/programmingpig

http://software.danielwatrous.com/analyze-tomcat-logs-using-pig-hadoop/