监控Yarn资源调度平台资源状态

数据科学和工程 2020-10-28

2664

背景

目前国内大部分企业级的大数据平台资源调度系统都是基于Yarn集群。生产环境上，各种大数据计算框架运行在Yarn上，就需要对Yarn平台的资源情况进行实时监控。虽然Yarn本身提供一个Web管理界面展示平台资源使用情况，但是这些运行状态数据需要实时获取和监控。随着智能化运维推进，需要对监控数据能实时分析、异常检测、自动故障处理。这些场景都需要能实时获取到Yarn平台的状态监控数据。

本文将详细介绍各种监控实现的方法，并重点介绍Java实现。

第一部分 Yarn状态数据接口

1.1 命令行方式

yarn
命令在{hadoop_home}/bin
路径下，对于部署hadoop
客户端的客户端需要加载命令环境变量。

参看任务信息

 # 查看所有任务信息
 yarn application -list
 # 查看正在运行的任务信息（带过滤参数appStates）
 yarn application -list -appStates RUNNING

这里参数appStates
的状态有：ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING,FINISHED,FAILED,KILLED

另外还可以指定计算框架的类型，例如：

 # 参看所有MapReduce任务
 yarn application -list -appTypes MAPREDUCE

参看指定任务状态信息

 yarn application -status application_1575989345612_32134

1.2 `Restful Api`
接口

ResourceManager
允许用户通过REST API
获取有关群集的信息：群集上的状态、群集上的指标、调度程序信息，另外还有群集中节点的信息以及集群上应用程序的运行信息。

查询整个集群指标

 GET http://http address:port/ws/v1/cluster/metrics

查询集群调度器详情

 GET http://http address:port/ws/v1/cluster/scheduler

监控任务

 curl http://http address:port/ws/v1/cluster/apps/state
 GET http://http address:port/ws/v1/cluster/apps/state

查看指定任务

 GET http://http address:port/ws/v1/cluster/apps/

查看指定任务的详细信息

 curl http://http address:port/proxy/ws/v2/mapreduce/info

杀死任务

yarn application -kill application_id

 curl -v -X PUT -d '{"state": "KILLED"}' http://http address:port>/ws/v1/cluster/apps/
 PUT http://http address:port/ws/v1/cluster/apps/state

1.2 `JMX Metrics`
监控

首先需要开启jmx
，编辑{hadoop_home}/etc/hadoop/yarn-env.sh
配置文件，最后添加下面三行配置：

 YARN_OPTS="$YARN_OPTS -Dcom.sun.management.jmxremote.authenticate=false"
 YARN_OPTS="$YARN_OPTS -Dcom.sun.management.jmxremote.port=8001"
 YARN_OPTS="$YARN_OPTS -Dcom.sun.management.jmxremote.ssl=false"

其中8001
是服务监听端口。jmx
提供了Cluster、Queue、Jvm、FSQueue
等Metrics
信息。

 # 获取YARN相关的jmx
 http://http address:8088/jmx
 # 如果想获取NameNode的jmx
 http://http address:50070/jmx

上面的方式会获取服务所有的信息(json
格式)。如果需要精准获得准确信息，org.apache.hadoop.jmx.JMXJsonServlet
类支持三个参数：callback
、qry
、get
。其中qry
用于过滤，下面的url
用于查询Yarn
上spark
用户在default
队列上任务信息。

 http://192.168.1.2:8088/jmx?qry=Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=default,user=spark

更详细的信息参考官网：https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/jmx/JMXJsonServlet.html

1.3 `Python Api`
接口

对于Python
有第三方包支持和yarn
进行交互，github
地址为:https://github.com/CODAIT/hadoop-yarn-api-python-client

案例代码：

 from yarn_api_client import ApplicationMaster, HistoryServer, NodeManager, ResourceManager
 
 rm = ResourceManager(address='192.168.1.2', port=8088)
 # 获取到ResourceManager的所有apps的信息
 rm.cluster_applications().data
 # 获取到ResourceManager的具体任务的信息
 rm.cluster_application('application_1437445095118_265798').data

对于Hadoop
安全集群，还需要部署认证包requests_kerberos
。具体可以参考说明文档：https://python-client-for-hadoop-yarn-api.readthedocs.io/en/latest/index.html

第二部分 Java实现

2.1 maven依赖

根据Hadoop
的版本添加下面的依赖包：

         <dependency>
             <groupId>org.apache.hadoop</groupId>
             <artifactId>hadoop-yarn-api</artifactId>
             <version>2.7.2</version>
         </dependency>
         <dependency>
             <groupId>org.apache.hadoop</groupId>
             <artifactId>hadoop-yarn-client</artifactId>
             <version>2.7.2</version>
         </dependency>
         <dependency>
             <groupId>org.apache.hadoop</groupId>
             <artifactId>hadoop-client</artifactId>
             <version>2.7.2</version>
         </dependency>

2.2 接口实现

我们将相关配置文件放在resources/conf
路径下面，涉及的文件有：

 # 集群配置文件
 core-site.xml
 hdfs-site.xml
 yarn-site.xml
 # 安全认证文件
 user.keytab
 krb5.conf

下面是案例代码：

 package com.main.yarnmonitor;
 
 import com.google.common.collect.Sets;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.security.UserGroupInformation;
 import org.apache.hadoop.yarn.api.records.ApplicationReport;
 import org.apache.hadoop.yarn.api.records.ApplicationResourceUsageReport;
 import org.apache.hadoop.yarn.api.records.ContainerReport;
 import org.apache.hadoop.yarn.api.records.YarnApplicationState;
 import org.apache.hadoop.yarn.client.api.YarnClient;
 import org.apache.hadoop.yarn.exceptions.YarnException;
 
 import java.io.File;
 import java.io.IOException;
 import java.util.EnumSet;
 import java.util.List;
 import java.util.Set;
 
 /**
  * @program: yarnmonitor
  * @description:
  * @author: rongxiang
  * @create: 2020-03-27 16:44
  **/
 
 
 public class yarnmonitor {
     //配置文件路径
     private static  String confPath = Thread.currentThread().getContextClassLoader().getResource("").getPath()+ File.separator + "conf";
     
     public static void main(String[] args) {
         //加载配置文件
         Configuration configuration = initConfiguration(confPath);
         //初始化安全集群Kerberos配置
         initKerberosENV(configuration);
         //初始化Yarn 客户端
         YarnClient yarnClient = YarnClient.createYarnClient();
         yarnClient.init(configuration);
         yarnClient.start();
         try {
             //获得运行的任务应用清单
             List<ApplicationReport> applications = yarnClient.getApplications(EnumSet.of(YarnApplicationState.RUNNING));
             //存储需要关注的任务信息
             HashMap<String, ArrayList<String>> applicationInformation = new HashMap<>();
 
             for (ApplicationReport application:applications) {
                 String applicationType = application.getApplicationType();
                 // 只关注CPU核数资源使用超过500的任务
                 if (getApplicationInfo(application).get(1)>=500) {
                     applicationInformation.put(String.valueOf(application.getApplicationId()), new ArrayList<String>(){{
                         add(String.valueOf(application.getName()));
                         add(String.valueOf(application.getApplicationType()));
                         add(String.valueOf(application.getQueue()));
                         add(String.valueOf(getApplicationInfo(application).get(0)));
                         add(String.valueOf(getApplicationInfo(application).get(1)));
                         add(String.valueOf(getApplicationInfo(application).get(2)));
                     }});
                 }
                 System.out.println(applicationInformation);
             }
             //设置监控信息发送邮件
             if(!applicationInformation.isEmpty()){
                 //发送邮件
                 sendMailYarn();
             }
         } catch (YarnException e) {
             e.printStackTrace();
         } catch (IOException e) {
             e.printStackTrace();
         }
     }
     
     /**
      * 从ApplicationReport获取信息
      * @return applicationInfo
      */    
     public static ArrayList<Integer> getApplicationInfo(ApplicationReport Application) {
         ArrayList<Integer> applicationInfo = new ArrayList<>();
         ApplicationResourceUsageReport resourceReport = Application.getApplicationResourceUsageReport();
         if (resourceReport != null) {
             Resource usedResources = resourceReport.getUsedResources();
             int allocatedMb = usedResources.getMemory();
             int allocatedVcores = usedResources.getVirtualCores();
             int runningContainers = resourceReport.getNumUsedContainers();
             //赋值
             applicationInfo.add(allocatedMb);
             applicationInfo.add(allocatedVcores);
             applicationInfo.add(runningContainers);
         }
         return applicationInfo;
     }
 
     /**
      * 初始化YARN Configuration
      * @return configuration
      */
     public static Configuration initConfiguration(String confPath) {
         Configuration configuration = new Configuration();
         System.out.println(confPath + File.separator + "core-site.xml");
         configuration.addResource(new Path(confPath + File.separator + "core-site.xml"));
         configuration.addResource(new Path(confPath + File.separator + "hdfs-site.xml"));
         configuration.addResource(new Path(confPath + File.separator + "yarn-site.xml"));
         return configuration;
     }
 
     /**
      * 安全集群配置（如果非安全集群这无需该方法）
      */
     public static void initKerberosENV(Configuration conf) {
         System.setProperty("java.security.krb5.conf", confPath+File.separator+"krb5.conf");
         System.setProperty("javax.security.auth.useSubjectCredsOnly", "false");
         System.setProperty("sun.security.krb5.debug", "false");
         try {
             UserGroupInformation.setConfiguration(conf);
             UserGroupInformation.loginUserFromKeytab("user@HADOOP.COM", confPath+File.separator+"user.keytab");
             System.out.println(UserGroupInformation.getCurrentUser());
         } catch (IOException e) {
             e.printStackTrace();
         }
     }
 }

Yarn客户端YarnClient
中定义了方法getApplications
，获取到正在运行的任务清单，返回数据类型是：List<ApplicationReport>
，如下：

  List<ApplicationReport> applications = yarnClient.getApplications(EnumSet.of(YarnApplicationState.RUNNING));

对于数据类型ApplicationReport
具有方法getApplicationResourceUsageReport()
获得每个Yarn任务的ApplicationResourceUsageReport
(任务资源报告)：

 ApplicationResourceUsageReport resourceReport = Application.getApplicationResourceUsageReport();

ApplicationResourceUsageReport
提供了获取各类资源的方法：

 Resource usedResources = resourceReport.getUsedResources();
 //任务使用的内存资源
 int allocatedMb = usedResources.getMemory();
 //任务使用的CPU资源
 int allocatedVcores = usedResources.getVirtualCores();
 //任务使用的容器的数量
 int runningContainers = resourceReport.getNumUsedContainers();

第三部分总结

Java的案例中我们使用了HashMap
(applicationInformation
)数据类型存储关注的任务信息，然后使用邮件接口发出。在实际使用中可以根据需要存储在elasticsearch
集群。

另外对于其他方法，作者没有实际使用，可能存在部分信息未涵盖，可以参考官网文档使用。

参考文献及资料

1、YARN Application Security
，链接：https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html

2、ApplicationResourceUsageReport
接口，链接：http://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/yarn/api/records/ApplicationResourceUsageReport.html

3、ResourceManager REST API’s
，链接：https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html

4、基于Yarn API
的Spark程序监控，链接:https://yq.aliyun.com/articles/710902

数据库

文章转载自数据科学和工程，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

监控Yarn资源调度平台资源状态

目录

背景

第一部分 Yarn状态数据接口

1.1 命令行方式

1.2 `Restful Api`
接口

1.2 `JMX Metrics`
监控

1.3 `Python Api`
接口

第二部分 Java实现

2.1 maven依赖

2.2 接口实现

第三部分总结

参考文献及资料

评论

监控Yarn资源调度平台资源状态

目录

背景

第一部分 Yarn状态数据接口

1.1 命令行方式

1.2 Restful Api接口

1.2 JMX Metrics监控

1.3 Python Api接口

第二部分 Java实现

2.1 maven依赖

2.2 接口实现

第三部分 总结

参考文献及资料

评论

1.2 `Restful Api`
接口

1.2 `JMX Metrics`
监控

1.3 `Python Api`
接口

第三部分总结