分享交流
合作共赢!

Hadoop分布式模型安装部署和使用方法总结——伪分布式

一、Hadoop守护进程的环境配置

管理员可以使用etc/hadoop/hadoop-env.sh脚本定制Hadoop守护进程的站点特有环境变量;另外可选用的脚本还有etc/hadoop/mapred-env.sh和etc/hadoop/yarn-evn.sh两个。通常用于配置各守护进程jvm配置参数的环境变量有如下几个:

  • HADOOP_NAMENODE_OPTS: 配置NameNode;
  • HADOOP_DATANODE_OPTS: 配置DataNode;
  • HADOOP_SECONDARYNAMENODE_OPTS: 配置Secondary NameNode;
  • YARN_RESOURCEMANAGER_OPTS: 配置ResourceManager;
  • YARN_NODEMANAGER_OPTS: 配置NodeManager;
  • YARN_PROXYSERVER_OPTS: 配置WebAppProxy;
  • HADOOP_JOB_HISTORYSERVER_OPTS: 配置Map Reduce Job History Server;
  • HADOOP_PID_DIR: 守护进程ID文件的存储目录;
  • HADOOP_LOG_DIR: 守护进程日志文件存储目录;
  • HADOOP_HEAPSIZE /YARN_HEAPSIZE: 堆内存可使用的内存空间上限,默认为1000G;

例如:如果要为NameNode使用parallelGC, 可在hadoop-env.sh文件中使用如下语句:

export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC"

1.下载Hadoop

官网:https://hadoop.apache.org/release/2.10.0.html

wget下载

[root@centos01 package]# wget https://archive.apache.org/dist/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz

2.配置Hadoop环境变量

解压安装包至指定目录下:

[root@centos01 ~]# mkdir -pv /bdapps/
[root@centos01 package]# tar xf hadoop-2.10.0.tar.gz -C /bdapps/
[root@centos01 package]# ln -sv /bdapps/hadoop-2.10.0/ /bdapps/hadoop
‘/bdapps/hadoop’ -> ‘/bdapps/hadoop-2.10.0/’

3.编辑配置文件/etc/profile.d/hadoop.sh

编辑此文件,定义类似如下环境变量,设定Hadoop的运行环境:

export HADOOP_PREFIX="/bdapps/hadoop"
export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_PREFIX/sbin
export HADOOP_PREFIX="/bdapps/hadoop"
export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_PREFIX/sbin
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_YARN_HOME=${HADOOP_PREFIX}

4.配置Java环境

配置java环境参考琼杰笔记文档:Linux安装Tomcat服务器和部署Web应用 

设置JAVA_HOME变量:

[root@centos01 profile.d]# cat /etc/profile.d/java.sh 
export JAVA_HOME=/usr/java/jdk1.8.0_191-amd64

二、创建运行Hadoop进程的用户和相关目录

1.创建用户和组

处于安全等目的,通常需要用特定的用户来运行hadoop不同的守护进程,例如,以hadoop为组,分别用三个用户yarn, hdfs和mapred来运行相应的进程。

[root@centos01 ~]# groupadd hadoop
[root@centos01 ~]# useradd -g hadoop yarn
[root@centos01 ~]# useradd -g hadoop hdfs
[root@centos01 ~]# useradd -g hadoop mapred

2.创建数据和日志目录

Hadoop需要不同权限的数据和日志目录,这里以/data/hadoop/hdfs为hdfs数据存储目录。

[root@centos01 ~]# mkdir -pv /data/hadoop/hdfs/{nn,snn,dn}
[root@centos01 ~]# chown -R hdfs:hadoop /data/hadoop/hdfs/

然后,在hadoop的安装目录中创建logs目录,并修改hadoop所有文件属主和属组:

root@centos01 ~]# cd /bdapps/hadoop
[root@centos01 hadoop]# mkdir logs
[root@centos01 hadoop]# chmod g+w logs
[root@centos01 hadoop]# chown -R yarn:hadoop ./*

三、配置Hadoop

1.编辑配置文件etc/hadoop/core-site.xml

core-site.xml全局配置文件包含了NameNode主机地址以及其监听RPC端口等信息,对于伪分布式模型的安装来说,其主机地址为localhost,NameNode默认使用的RPC端口为8020,其简要的配置内容如下所示:

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
        <final>true</final>
    </property>
</configuration>

2.编辑配置文件etc/hadoop/hdfs-site.xml

hdfs-site.xml主要用于配置HDFS相关的属性,例如复制因子(即数据块的副本数)、NN和DN用于存储数据的目录等。数据块的副本数对于伪分布式Hadoop应该为1,而NN和DN用于存储的数据的目录伪前面步骤中专门为其创建的路径。另外,前面步骤中也为SNN创建了相关的目录,这里也一并配置其为启动状态。

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///data/hadoop/hdfs/nn</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///data/hadoop/hdfs/dn</value>
    </property>
    <property>
        <name>fs.checkpoint.dir</name>
        <value>file:///data/hadoop/hdfs/snn</value>
    </property>
    <property>
        <name>fs.checkpoint.edits.dir</name>
        <value>file:///data/hadoop/hdfs/snn</value>
    </property>
</configuration>

3.编辑配置文件etc/hadoop/mapred-site.xml

mapred-site.xml文件用于配置集群的MapReduce framword, 此处应该指定使用yarn,另外的可用值还有local和classic。mapred-site.xml默认不存在,但有模块文件mapred-site.xml.template,只需要将其复制mapred-site.xml即可。

[root@centos01 hadoop]# cp mapred-site.xml.template mapred-site.xml

配置mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

4.编辑配置文件etc/hadoop/yarn-site.xml

yarn-site.xml用于配置YARN进程及YARN的相关属性。首先需要指定ResourceManager守护进程的主机和监听的端口,对于伪分布式模型来说,其主机为localhost,默认端口为8032,;其次需要指定ResourceManager使用的scheduler,以及NodeManager的辅助服务。

<configuration>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>localhost:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>localhost:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>localhost:8031</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>localhost:8033</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>localhost:8088</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.auxservices.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    </property>
</configuration>

5.编辑环境变量配置文件etc/hadoop/hadoop-env.sh和etc/hadoop/yarn-env.sh

Hadoop定义依赖到的特定JAVA环境,也可以编辑这两个脚本文件,为其JAVA_HOME取消注释并配置合适的值即可。此外,Hadoop大多数守护进程默认使用的堆大小为1GB,但现实应用中,可能需要对其各类进程的堆大小做出调整,这只需要编辑此两个文件中的相关变量值即可,例如HADOOP_HEAPSIZE、HADOOP_JOB_HISTORY_HEAPSIZE、JAVA_HEAP_SIZE和YARN_HEAP_SIZE等。

6.编辑配置文件slaves文件

slaves文件存储了当前集群的所有slave节点的列表,对于伪分布式模型,其文件内容仅应该为localhost,这也的确是这个文件的默认值。因此,伪分布式模型中,此文件的内容保持默认即可。

四、格式化HDFS

在HDFS的NN启动之前需要先初始化其用于存储数据的目录。如果hdfs-site.xml中dfs.namenode.name.dir属性指定的目录不存在,格式化命令会自动创建之;如果事先存在,请确保其权限设置正确,此时格式操作会清除其内部的所有数据并重新建立一个新的文件系统。以hdfs的用户身份执行初始化命令:

[hdfs@centos01 ~]$ su - hdfs
[hdfs@centos01 ~]$ hdfs namenode -format

当执行结果显示包含如下内容表示格式化成功:

......
20/07/30 09:59:31 INFO common.Storage: Storage directory /data/hadoop/hdfs/nn has been successfully formatted.
......

hdfs命令:

[root@centos01 ~]# hdfs --help
Usage: hdfs [--config confdir] [--loglevel loglevel] COMMAND
       where COMMAND is one of:
  dfs                  run a filesystem command on the file systems supported in Hadoop.
  classpath            prints the classpath
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  journalnode          run the DFS journalnode
  zkfc                 run the ZK Failover Controller daemon
  datanode             run a DFS datanode
  debug                run a Debug Admin to execute HDFS debug commands
  dfsadmin             run a DFS admin client
  dfsrouter            run the DFS router
  dfsrouteradmin       manage Router-based federation
  haadmin              run a DFS HA admin client
  fsck                 run a DFS filesystem checking utility
  balancer             run a cluster balancing utility
  jmxget               get JMX exported values from NameNode or DataNode.
  mover                run a utility to move block replicas across
                       storage types
  oiv                  apply the offline fsimage viewer to an fsimage
  oiv_legacy           apply the offline fsimage viewer to an legacy fsimage
  oev                  apply the offline edits viewer to an edits file
  fetchdt              fetch a delegation token from the NameNode
  getconf              get config values from configuration
  groups               get the groups which users belong to
  snapshotDiff         diff two snapshots of a directory or diff the
                       current directory contents with a snapshot
  lsSnapshottableDir   list all snapshottable dirs owned by the current user
						Use -help to see options
  portmap              run a portmap service
  nfs3                 run an NFS version 3 gateway
  cacheadmin           configure the HDFS cache
  crypto               configure HDFS encryption zones
  storagepolicies      list/get/set block storage policies
  version              print the version

Most commands print help when invoked w/o parameters.

五、启动Hadoop

Hadoop2的启动等操作可以通过其位于sbin路径下的专用脚本运行:

  • NameNode: hadoop-daemon.sh (start|stop) namenode
  • DataNode: hadoop-daemon.sh (start|stop) datanode
  • Secondary NameNode: hadoop-daemon.sh (start|stop) secondarynamenode
  • ResourceManager: yarn-daemon.sh (start|stop) resourcemanager
  • NodeManager: yarn-daemon.sh (start|stop) nodemanager

1.启动HDFS服务

HDFS有三个守护进程:namenode、dataname和secondarynamenode, 它们都可通过hadoop-daemon.sh脚本启动或停止,以hdfs用户执行相关命令即可:

[root@centos01 sbin]# su - hdfs
Last login: Sat Aug 1 17:10:05 CST 2020 on pts/0
[hdfs@centos01 sbin]$ hadoop-daemon.sh start namenode
starting namenode, logging to /bdapps/hadoop/logs/hadoop-hdfs-namenode-centos01.out
[hdfs@centos01 sbin]$ hadoop-daemon.sh start datanode
starting datanode, logging to /bdapps/hadoop/logs/hadoop-hdfs-datanode-centos01.out
[hdfs@centos01 sbin]$ hadoop-daemon.sh start secondarynamenode
starting secondarynamenode, logging to /bdapps/hadoop/logs/hadoop-hdfs-secondarynamenode-centos01.out

以上三个命令均在执行完成后给出了一个日志信息保存指向的信息,但是,实际用于保存日志的文件是以“.log”为后缀的文件,而非以“.out”结尾。可通过日志文件中的信息来判断进程启动是否正常完成。如果所有进程正常启动,可通过jdk提供的jps命令来查看java的进程状态:

[hdfs@centos01 sbin]$ jps
6817 SecondaryNameNode
7074 DataNode
7129 Jps
6700 NameNode

2.启动YARN服务

YARN有两个守护进程:resourcemanager和nodemanager,它们都可以通过yarn-daemon.sh脚本启动或停止,以yarn用户执行相关的命令即可,如下:

[root@centos01 profile.d]# su - yarn
[yarn@centos01 sbin]$ yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /bdapps/hadoop/logs/yarn-yarn-resourcemanager-centos01.out
[yarn@centos01 sbin]$ yarn-daemon.sh start nodemanager
starting nodemanager, logging to /bdapps/hadoop/logs/yarn-yarn-nodemanager-centos01.out

通过jps命令查看java进程状态:

[yarn@centos01 ~]$ jps
8625 NodeManager
8532 ResourceManager
9231 Jps

六、Web UI概览

HDFS和YARN ResourceManager各自提供了一个Web接口,通过这些接口可检查HDFS集群即YARN集群的相关状态信息,它们的访问接口如下,实际使用中,需要将NameNodeHost和ResourceManagerHost分别改为其相应的主机地址:

  • HDFS-NameNode: http://<NameNodeHost>:50070/
  • YARN-ResourceManager: http://<ResourceManagerHost>:8088/

注意:yarn-site.xml文件中yarn.resourcemanager.webapp.address属性的值如果定义为“”“localhost:8088”,则WebUI仅监听于127.0.0.1地址上的8088端口。

 

七、运行测试程序

Hadoop-YARN自带了许多样例程序,位于hadoop安装路径下的share/hadoop/mapreduce/目录里,其中的hadoop-mapreduce-examples可用作mapreduce程序测试。

[hdfs@centos01 ~]$ yarn jar /bdapps/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

例如,运行pi程序,用Monte Carlo方法估算Pi(π)值。pi命令有两个参数,第一个参数是指要运行map的次数,第二个参数是指每个map任务取样的个数,而这两个数相乘即为总的取样数。

真分布式模型的安装部署参考琼杰笔记文档:

Hadoop分布式模型安装部署和使用方法总结——分布式

赞(0) 打赏
未经允许不得转载:琼杰笔记 » Hadoop分布式模型安装部署和使用方法总结——伪分布式
分享到: 更多 (0)

评论 抢沙发

评论前必须登录!

 

分享交流,合作共赢!

联系我们加入QQ群

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏