apt-get install ssh rsync
bin/ hadoop, hdfs, mapred,yarn等可执行文件sbin/ start-dfs.sh start-yarn.sh stop-dfs.sh stop-yarn.sh等可执行文件etc/hadoop env和site等配置文件libexec/ hadoop-config.sh hdfs-config.sh mapred-confg.sh yarn-config.sh等用于配置的可执行文件logs/ 日志文件share 文档与jar包,可以认真浏览一下这些jar包,Map-Reduce编程接口就靠它include/ 头文件lib/ 动态链接库
cd hadoop-2.6.0/
vim etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 [替换JDK所在目录]
vim etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value> #配置HDFS的地址及端口号
</property>
</configuration>
vim etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name> #配置备份方式,默认为3.在单机版里需要改为1
<value>1</value>
</property>
</configuration>
cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xmlvim etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
vim etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsacat ~/.ssh/id_dsa.pub >>~/.ssh/authorized_keys
ssh localhost
bin/hdfs namenode -format
sbin/start-dfs.sh
sbin/start-yarn.sh
bin/hdfs dfs -mkdir /userbin/hdfs dfs -mkdir /user/root
bin/hdfs dfs -put etc/hadoop input
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/xxxxx._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation.
sbin/stop-dfs.shrm -rf /tmp/hadoop-root/dfs/data/*
sbin/start-dfs.sh
jps
20870 DataNode
20478 NameNode
31294 FsShell
19474 Elasticsearch
21294 SecondaryNameNode
bin/hdfs dfs -put etc/hadoop input

bin/hdfs dfs -cat output/*
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file
我们也可以把结果merge到单个文件中来看
bin/hdfs dfs -getmerge output/ output
三、MapReduce编程入门
环境IDE:eclipse
导入hadoop核心包 hadoop-core-1.2.1.jar
导出:我们自己的fat jar包
示例代码放在我的git上 https://github.com/tanjiti/mapreduceExample
HelloWorld版本的MapReduce
仍然用单词切割的例子,单词切割就是HelloWorld版本的MapReduce
文件结构如下
├── bin 存放编译后的字节文件
├── lib 存放依赖jar包,来自hadoop安装文件share/hadoop/
│?? ├── commons-cli-1.2.jar
│?? ├── hadoop-common-2.6.0.jar
│?? └── hadoop-mapreduce-client-core-2.6.0.jar
├── mymainfest 配置文件,指定classpath
└── src 源文件
└── mapreduceExample
└── WordCount.java
我一般在eclipse里编码(可视化便利性),然后采用命令行编译打包(命令行高效性)
第一步:源码编辑
编辑vim src/mapreduceExample/WordCount.java
package mapreduceExample;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: mapreduceExample.WordCount <in> <out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class); //设置mapper类
job.setCombinerClass(IntSumReducer.class); //设置combiner类,该类的作用是合并map结果,减少网络I/O
job.setReducerClass(IntSumReducer.class);//设置reducer类
job.setOutputKeyClass(Text.class);//设置reduce结果输出 key的类型
job.setOutputValueClass(IntWritable.class); //设置reduce结果输出 value的类型
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));//设置输入数据文件路径
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));//设置输出数据文件路径
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
第二步:编译打包
编译
javac -d bin/ -sourcepath src/ -cp lib/hadoop-common-2.6.0.jar:lib/hadoop-mapreduce-client-core-2.6.0.jar:lib/commons-cli-1.2.jar src/mapreduceExample/WordCount.java
编写配置文件vim mymainfest
编辑Class-Path: lib/hadoop-common-2.7.0.jar lib/hadoop-mapreduce-client-common-2.7.0.jar lib/commons-cli-1.2.jar注:mainfest 配置详见 https://docs.oracle.com/javase/tutorial/deployment/jar/manifestindex.html其他重要配置项Main-Class: 指定入口class,因为一个mapreduce项目,往往会有多种入口,因此不配置该项打包jar cvfm mapreduceExample.jar mymainfest lib/* src/* -C bin .
查看jar包内容jar tf mapreduceExample.jar
第三步: 运行META-INF/
META-INF/MANIFEST.MF
lib/commons-cli-1.2.jar
lib/hadoop-common-2.6.0.jar
lib/hadoop-mapreduce-client-core-2.6.0.jar
src/mapreduceExample/
src/mapreduceExample/WordCount.java
mapreduceExample/
mapreduceExample/WordCount.class
mapreduceExample/WordCount$TokenizerMapper.class
mapreduceExample/WordCount$IntSumReducer.class查看Usage
显示bin/hadoop jar /home/tanjiti/mapreduceExample/mapreduceExample.jar mapreduceExample.WordCount
Usage: just4test.WordCount <in> <out>
当然,这是个很简陋的使用说明,但起码知道了参数是哪些,好了,正式运行
bin/hadoop jar
/home/tanjiti/mapreduceExample/mapreduceExample.jar[jar所在路径]
mapreduceExample.WordCount[main Class,如果打包jar包时在manifest文件中指定了就不需要指定该参数]
/in[输入文件路径,在执行任务前必须存在]
/out[结果输出路径,在执行任务前不应该存在该路径]
先合并结果再查看bin/hdfs dfs -getmerge /out wordcount_result
部分结果如下tail wordcount_result
"/?app=vote&controller=vote&action=total&contentid=7%20and%201=2%20union%20select%20group_concat(md5(7));%23" 1
MapReduce的思想其实非常简单一句话来描述这个过程,就是开启一个MapReducer Job,设置好相应的Configuration,指定输入输出数据源的路径与格式,指定数据流K,V的格式(很多时候需要自定义格式,继承Writable),指定处理过程(map,combine,partition,sort,reduce)。
四、更多再实现这些基本功能后,我们下一步会考虑如何共享数据,是读写HDFS文件,还是采用Configuration配置,还是使用DistributedCache;如何何处理多个mapreducejob,是线性的方式,还是ControlledJob,还是ChainMapper、ChainReducer。
等功能都搞定了,我们再考虑性能优化的问题,例如数据预处理(合并小文件,过滤杂音、设置InputSplit的大小),是否启用压缩方式,设置Map和Reduce任务的数量等job属性,全靠实战来填坑。
参考
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html