还在用Zabbix/Nagios ? Drop it （下）

一元殿 2019-09-01

457

点击上方“icloud布道师”，“星标或置顶公众号”

逆境前行，能帮你的只有自己

高可用监控方案
----02.分解部署

Mangement of monitored servers

· Monitor new online servers

o Modify the initialization process for the new server

Before the new service goes live,install node_exporter uniformly.
And starts the service as a background process,also open specific ports for network devices such as firewalls

o Compared with traditional monitoring system(Nagios/Zabbix)

In the early days of the new surveillance system, install both Nagios’s nrpe collection plugin and the exporter of Prometheus.Then, the accuracy of the collected data was analyzed and comparedfor a period of time

o Summary analysis

Compare the data of 60 days to check the accuracy and timeliness of alarm.Then observe for a period of time, modify the appropriate alarm threshold, modify the appropriate collection period, etc. Finally, summarize acomparison report, check whether there is any need to improve

· Migratelive servers to new monitoring system

o Determine device type

First of all, it is necessary to distinguish the type of equipment and have different installation procedures for different equipment.
For example, Linux server installs the corresponding collector of Linux,Windows installs the corresponding collector of Windows, network equipment also has corresponding collector

o Server refinement management

Server refinement management Label servers differently for different purposes Install different collections for different servers for different purposes Generate different dashboards for different servers

· Migrate existing business applications to new monitor system

o Get the necessary information

Gets a list of all back-end servers

Get all used application modules

Get all service API interfaces

o Make migration plans

Install different exporters to different use servers
Generate different dashboards
Create an overview of the API health check

o Customized monitoring scheme

Research new monitoring metrics according to business department’s requirement
Collect new monitoring Metrics for the business system
Create a business trend Dashboard

Component information

Main Monitoring System

NFS

1. Purpose
NFS is a distributed file system that enables users to access Shared files on remote computers over a network just as if they were local file systems
Here, we use it to store information such as Shared configuration files

For more information of grafana,see http://nfs.sourceforge.net/

2. Installed version
NFS version: Server nfs v4

3. Mode of installation
Adopt yum installation method

4. Installation configuration

yum install nfs-utils rpcbind -y
systemctl start rpcbind && systemctl enable rpcbind
systemctl start nfs-utils && systemctl enable nfs-utils

5. Server configuration
NFS is a C/S architecture program

For Server：

echo "/var/log/fmw 10.67.194.0/24(rw)" >> /etc/exports
systemctl restart nfs
chmod -R 767 /var/log/fmw

For Client：

mkdir /nfs
mount  -t  nfs   10.67.194.42:/var/log/fmw/    /nfs
 echo "10.67.194.42:/var/log/fmw /nfs   nfs    defaults,_rnetdev       1     1" >> /etc/fstab

6. Server requirements

use the NAS storage
8CPU
16GB Memory
500G SSD disk storage

7. Support necessary departments
Network department

Storage department

Nginx

1. Purpose
NGINX is a free, open-source HTTP server and reverse proxy. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.
Here, we use it as load balancing to distribute the incoming traffic polling to the two Grafana servers on the back end

For more information about grafana,see https://www.nginx.com/resources/wiki/
2. Installed version
nginx version: nginx/1.12.2

3. Mode of installation
Adopt yum installation method
4. Installation configuration

yum install epel-release -y
yum install nginx -y
systemctl enable nginx && systemctl start nginx

5. Server configuration

mv /etc/nginx/nginx.conf /etc/nginx/nginx.bak 

cat > /etc/nginx/nginx.conf <<EOF
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;

include /usr/share/nginx/modules/*.conf;

events {
    worker_connections 1024;
}

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;
    charset  utf-8;
    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 2048;

    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;

    include /etc/nginx/conf.d/*.conf;

    server {
        listen       80 default_server;
        listen       [::]:80 default_server;
        server_name  _;
        root         /usr/share/nginx/html;

        include /etc/nginx/default.d/*.conf;

        location / {
        root html;
        index  index.html index.htm;
        proxy_pass http://grafanaservers;
        }

        error_page 404 /404.html;
            location = /40x.html {
        }

        error_page 500 502 503 504 /50x.html;
            location = /50x.html {
        }
    }
        upstream grafanaservers{
        server 10.67.194.44:3000 weight=1;
        server 10.67.194.37:3000 weight=1;
        }
}

EOF

do not forget to restart nginx

6. Server requirements
4CPU
8GB Memory
500GB disk storage

7. Support necessary departments
mangement by ourself

Grafana

1. Purpose
Grafana is an open source, feature rich metrics dashboard and graph editor for Graphite, Elasticsearch, OpenTSDB, Prometheus and InfluxDB
Here, we use it to help Prometheus generate better graphs of data

For more information of grafana,see the https://grafana.com/

2. Installed version
Grafana version: Version 6.3.3 (commit: 84d0a71, branch: HEAD)

3. Mode of installation
Adopt yum installation method

4. Installation configuration

This is a highly available grafana group, so you need to deploy the same operation on both servers

wget https://dl.grafana.com/oss/release/grafana-6.3.3-1.x86_64.rpm

It’s a slow download
if you’re on the Intranet of homecredit.cn.You can do the same thing:

wget http://10.67.194.43:8099/grafana-6.3.3-1.x86_64.rpm

and then,install this rpm package

yum localinstall grafana-6.3.3-1.x86_64.rpm
systemctl enable grafana && systemctl start grafana

5. Server configuration
This is a highly available grafana group, so you need to deploy the same operation on both servers
Since the front-end nginx polling is distributed to these two grafana servers, the configuration information for these two grafana servers is saved to NFS
We use soft links for configuration

This two servers perform the operation as NFS client
Refer to NFS client operations

share the grafana configuration

 mkdir -p /nfs/grafana/{etc,lib}

 mv /etc/grafana /nfs/grafana/etc/
 mv /var/lib/grafana /nfs/grafana/lib/

 ln -s /nfs/grafana/etc/grafana /etc/grafana
 ln -s /nfs/grafana/lib/grafana /var/lib/grafana

Here are the configured profiles

 cat > /nfs/grafana/etc/grafana/grafana.ini <<EOF
 [paths]
[server]
enable_gzip = true
[database]
log_queries =
[remote_cache]
[dataproxy]
[analytics]
[security]
[snapshots]
[dashboards]
[users]
[auth]
[auth.anonymous]
[auth.github]
[auth.google]
[auth.generic_oauth]
[auth.grafana_com]
[auth.proxy]
[auth.basic]
[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
[smtp]
enabled = true 
host = relay.homecredit.cn:25
user = hcg\hccservicedesk
from_address = uathccservicedesk@homecredit.cn
from_name = Grafana.hcctool.net
ehlo_identity = dashboard.example.com
[emails]
[log]
filters = ldap:debug
[log.console]
[log.file]
[log.syslog]
[alerting]
enabled = true
execute_alerts = true
error_or_timeout = alerting
[explore]
[metrics]
[metrics.graphite]
[tracing.jaeger]
[grafana_com]
[external_image_storage]
[external_image_storage.s3]
[external_image_storage.webdav]
[external_image_storage.gcs]
[external_image_storage.azure_blob]
[external_image_storage.local]
[rendering]
[enterprise]
[panels]
[plugins]

 EOF

/var/lib/grafana
inside the configuration file does not need to change, go directly to the web side configuration
6. Server requirements

8CPU
16GB Memory
500G SSD disk storage
8CPU
16GB Memory
500G SSD disk storage

7. Support necessary departments
mangement by ourself

HAproxy

1. Purpose
Haproxy is a freely developed source code software written in the c language that provides high availability, load balancing, and application proxies based on HTTP and TCP.
Here,we use it to keep Prometheus highly available,normally one server serves and the other serves as a backup

2. Installed version
HAproxy version HA-Proxy version 1.9.0

3. Mode of installation
Source code compilation and installation

4. Installation configuration

 wget https://github.com/haproxy/haproxy/archive/v1.9.0.tar.gz
 tar -zxvf   v1.9.0.tar.gz -C /usr/local/
 cd /usr/local/haproxy-1.9.0/
 make -j $(nproc) TARGET=linux-glibc 
 make install PREFIX=/usr/local/haproxy

5. Server configuration

 mkdir /usr/local/haproxy/etc

 cat >/usr/local/haproxy/etc/haproxy.cfg <<EOF
 global
 log 127.0.0.1  local0
 maxconn 4096
 chroot /usr/local/haproxy
 uid 99                          
 gid 99                         
 daemon                        
 nbproc 4                  
 pidfile /usr/local/haproxy/run/haproxy.pid  
 maxconn     300 

 defaults
 log    global
 log    127.0.0.1      local3        
 mode    http       
 option  httplog                       
 option  httpclose     
 option  dontlognull    
 option  forwardfor      
 option  redispatch            
 retries 2                   
 balance roundrobin                   
 timeout connect      5000              
 timeout client       50000             
 timeout server      50000              
 mode    http
 option  httpchk GET /graph         

 frontend http          
 bind 10.67.194.43:9090      
 default_backend http_back  

 backend http_back    
 server  s1 10.67.194.44:9090 check inter 2000 rise 3 fall 3   
 server  s2 10.67.194.37:9090 check inter 2000 rise 3 fall 3 backup   

 listen admin_stats
         bind 0.0.0.0:9188
         mode http
         stats uri /haproxy-status
EOF

6. Server requirements
4CPU
8GB Memory
500G SSD disk storage

7. Support necessary departments
mangement by ourself

Prometheus

1. Purpose
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud
Here,we use it to replace Nagios/Zabbix
For more elaborate overviews of Prometheus, see the https://prometheus.io/docs/introduction/media/

2. Installed version
prometheus, version 2.11.0

go version: go1.12.7

3. Mode of installation
Binary installation

4. Installation configuration
This is a highly available grafana group, so you need to deploy the same operation on both servers

 wget https://github.com/prometheus/prometheus/releases/download/v2.11.1/prometheus-2.11.1.linux-amd64.tar.gz

It’s a slow download
if you’re on the Intranet of homecredit.cn.You can do the same thing:

 wget http://10.67.194.43:8099/prometheus-2.11.0.linux-amd64.tar.gz

And then

 tar -xvfz prometheus-*.tar.gz -C /opt/prometheus
 cd /

5. Server configuration
This is a highly available grafana group, so you need to deploy the same operation on both servers

mkdir -p /nfs/prometheus/{config,alert_rules,record_rules}

cat >/nfs/prometheus/alert_rules/alertingrules.yml <<EOF
 groups:
- name: node
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up{job!="ssh_status"} == 0
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 3 minutes."

  # Alert for any node instance's cpu that is greater than 80%/90% for >3m.
  - alert: CPUUsedPcentTooGreater
    expr: |
      instance:node_cpu_used:pcent{instance!~"10.111.4.114:.*",job!="cnode"} > 90
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "Node {{ $labels.instance }} cpu used percent too greater"
      description: |
        Test Node {{ $labels.instance }} cpu used percent greater than 90%
        (current value: {{ $value | printf "%.1f" }}%)
  - alert: CPUUsedPcentTooGreater
    expr: |
      instance:node_cpu_used:pcent{instance!~"10.111.4.114:.*",job!="cnode"} > 80
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "Node {{ $labels.instance }} cpu used percent too greater"
      description: |
        Node {{ $labels.instance }} cpu used percent greater than 80%
        (current value: {{ $value | printf "%.1f" }}%)

  # Alert for any node instance's memory that is greater than 80%/90% for >3m.
  - alert: MemoryUsedPcentTooGreater
    expr: instance:node_memory_MemUsed:pcent{instance!="10.111.2.4:9100"} > 90
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "Node {{ $labels.instance }} memory used percent too greater"
      description: |
        Node {{ $labels.instance }} memory used percent greater than 90%
        (current value: {{ $value | printf "%.1f" }}%)
  - alert: MemoryUsedPcentTooGreater
    expr: instance:node_memory_MemUsed:pcent{instance!="10.111.2.4:9100"} > 80
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "Node {{ $labels.instance }} memory used percent too greater"
      description: |
        Node {{ $labels.instance }} memory used percent greater than 80%
        (current value: {{ $value | printf "%.1f" }}%)

  # Alert for any filesystem.
  - alert: FilesystemUsedPcentTooGreater
    expr: mountpoint:node_filesystem:pcent{cluster="paracloud"} > 90
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "Mountpoint {{ $labels.mountpoint }} used percent too greater"
      description: >
        Node {{ $labels.instance }} mountpoint {{ $labels.mountpoint }} used percent greater than 90%
        (current value: {{ $value | printf "%.1f" }}%)
  - alert: FilesystemUsedPcentTooGreater
    expr: mountpoint:node_filesystem:pcent{cluster="paracloud"} > 80
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "Mountpoint {{ $labels.mountpoint }} used percent too greater"
      description: >
        Node {{ $labels.instance }} mountpoint {{ $labels.mountpoint }} used percent greater than 80%
        (current value: {{ $value | printf "%.1f" }}%)
  - alert: FilesystemIUsedPcentTooGreater
    expr: mountpoint:node_filesystem_files:ipcent{cluster="paracloud"} > 90
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "Mountpoint {{ $labels.mountpoint }} inode used percent too greater"
      description: >
        Node {{ $labels.instance }} mountpoint {{ $labels.mountpoint }} inode used percent greater than 90%
        (current value: {{ $value | printf "%.1f" }}%)
  - alert: FilesystemIUsedPcentTooGreater
    expr: mountpoint:node_filesystem_files:ipcent{cluster="paracloud"} > 80
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "Mountpoint {{ $labels.mountpoint }} inode used percent too greater"
      description: >
        Node {{ $labels.instance }} mountpoint {{ $labels.mountpoint }} inode used percent greater than 80%
        (current value: {{ $value | printf "%.1f" }}%)

  # Alert for GZ ln31 /home avail less than 5GiB
  - alert: FilesystemAvailTooLess
    expr: node_filesystem_avail{job="node",cluster="gz",mountpoint="/home"} < 5368709120
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "Mountpoint {{ $labels.mountpoint }} avail too less"
      description: "Node {{ $labels.instance }} mountpoint {{ $labels.mountpoint }} less than 5GiB (current value: {{ $value | humanize1024 }}B)"

/nfs/prometheus/config/prometheus.yml

cat >/nfs/prometheus/config/prometheus.yml <<EOF
 global:
  scrape_interval: 10s
  scrape_timeout: 10s
  evaluation_interval: 10s
  external_labels:
    hczone: china
    datacenter: wuqing
    location: wuqing
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 10.67.194.37:9093
    scheme: http
    timeout: 10s
rule_files:
- /opt/prometheus/current/record_rules/recordingrules.yml
- /opt/prometheus/current/alert_rules/alertingrules.yml
scrape_configs:

- job_name: prometheus
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - 10.67.194.37:9090
    labels:
      cluster: tooltest
      service: monitor
      env: test

- job_name: consul 
  scrape_interval: 5s
  scrape_timeout: 5s
  metrics_path: /metrics
  scheme: http
  consul_sd_configs:
    - server: 10.67.194.37:8500
      services: []
      tags:
        - tooltest
        - test
  relabel_configs:        
    #- source_labels: [__meta_consul_service]
    #  target_label: job
    - regex: __meta_consul_service_metadata_(.*)
      replacement: '${1}'
      action: labelmap

- job_name: http_get 
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /probe
  params:
    module: [http_2xx]  # Look for a HTTP 200 response.
  static_configs:
    - targets:
      - http://baidu.com
      - https://baidu.com
      - http://www.google.com
      - https://www.google.com
      labels:
        cluster: tooltest
        env: test
        service: hcpay
    - targets:
      - http://10.67.194.40:80  # Target to probe with http on port 80.
      - http://10.67.194.41:9997 # Target to probe with http on port 9997
      labels:
        cluster: tooltest
        env: test
        service: capp 
  relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.67.194.37:9115  # The blackbox exporter's real hostname:port.   

- job_name: ping 
  scrape_interval: 1s
  scrape_timeout: 1s
  metrics_path: /probe
  params:
    module: [icmp]  # Look for a HTTP 200 response.
  static_configs:
    - targets:
      - baidu.com
      - 10.25.13.105
      - 10.29.248.11  
      - 10.67.194.41 
      - 223.5.5.5
      - google.com  
      labels:
        cluster: tooltest
        env: test
        service: ping_check
  relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.67.194.37:9115  # The blackbox exporter's real hostname:port.                  

- job_name: ssh_check 
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /probe
  params:
    module: [ssh_banner]  # Look for a HTTP 200 response.
  static_configs:
    - targets:
      - 10.25.13.105 
      - 10.67.194.41
      - 10.67.194.42  
      labels:
        cluster: tooltest
        env: test
        service: ssh_check
  relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.67.194.37:9115  # The blackbox exporter's real hostname:port.   

- job_name: jira_uat
  scheme: https
  metrics_path: '/plugins/servlet/prometheus/metrics'
  static_configs:
    - targets: ['jira-uat.homecredit.cn']

EOF

/nfs/prometheus/record_rules/recordingrules.yml

 cat >/nfs/prometheus/record_rules/recordingrules.yml <<EOF
 groups:
- name: node
  rules:
  # Filesystem
  - record: mountpoint:node_filesystem:used
    expr: >
      node_filesystem_size_bytes{device=~"/dev/.*", mountpoint!~"/var/lib/docker/.*"}
      - node_filesystem_free_bytes{device=~"/dev/.*", mountpoint!~"/var/lib/docker/.*"}
  - record: mountpoint:node_filesystem:pcent
    expr: >
      mountpoint:node_filesystem:used{device=~"/dev/.*", mountpoint!~"/var/lib/docker/.*"}
      * 100
      / (
        mountpoint:node_filesystem:used{device=~"/dev/.*", mountpoint!~"/var/lib/docker/.*"}
        + node_filesystem_avail_bytes{device=~"/dev/.*", mountpoint!~"/var/lib/docker/.*"}
      )
  - record: mountpoint:node_filesystem_files:iused
    expr: >
      node_filesystem_files{device=~"/dev/.*", mountpoint!~"/var/lib/docker/.*"}
      - node_filesystem_files_free{device=~"/dev/.*", mountpoint!~"/var/lib/docker/.*"}
  - record: mountpoint:node_filesystem_files:ipcent
    expr: >
      mountpoint:node_filesystem_files:iused{device=~"/dev/.*", mountpoint!~"/var/lib/docker/.*"}
      * 100
      / node_filesystem_files{device=~"/dev/.*", mountpoint!~"/var/lib/docker/.*"}

  - record: instance:node_memory_MemUsed:pcent
    expr: |
      (
        node_memory_MemTotal_bytes
        - node_memory_MemAvailable_bytes
      )
      * 100
      / node_memory_MemTotal_bytes

  - record: instance:node_cpu_used:pcent
    expr: |
      100 
      - (
          sum(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (service,cluster,job,instance) 
          / count(node_cpu_seconds_total{mode="idle"}) by (service,cluster,job,instance)
      )
      * 100

 EOF

Making soft links

  ln -s /nfs/prometheus/alert_rules  /opt/prometheus/prometheus-2.11.0.linux-amd64/alert_rules
   ln -s /nfs/prometheus/config  /opt/prometheus/prometheus-2.11.0.linux-amd64/config
   ln -s /nfs/prometheus/record_rules  /opt/prometheus/prometheus-2.11.0.linux-amd64/record_rules

Make the startup script

Server 1

cat > /opt/prometheus/prometheus-2.11.0.linux-amd64/ << EOF
#!/bin/bash

set -o nounset
set -o errexit
#set -x

readonly ROOT_DIR=/opt/prometheus/prometheus-2.11.0.linux-amd64/
readonly CONFIG_DIR=${ROOT_DIR}/config
readonly DATA_DIR=${ROOT_DIR}/data
readonly LOG_FILE=${ROOT_DIR}/log/prometheus.log

export PATH=$PATH:${ROOT_DIR}

nohup prometheus --config.file=${CONFIG_DIR}/prometheus.yml --storage.tsdb.path=${DATA_DIR} --web.external-url=http://10.67.194.44:9090 --web.enable-lifecycle --web.enable-admin-api --web.console.templates=${ROOT_DIR}/consoles --web.console.libraries=${ROOT_DIR}/console_libraries > ${LOG_FILE} 2>&1 &
EOF

Server 2

cat > /opt/prometheus/prometheus-2.11.0.linux-amd64/ << EOF
#!/bin/bash

set -o nounset
set -o errexit
#set -x

readonly ROOT_DIR=/opt/prometheus/prometheus-2.11.0.linux-amd64/
readonly CONFIG_DIR=${ROOT_DIR}/config
readonly DATA_DIR=${ROOT_DIR}/data
readonly LOG_FILE=${ROOT_DIR}/log/prometheus.log

export PATH=$PATH:${ROOT_DIR}

nohup prometheus --config.file=${CONFIG_DIR}/prometheus.yml --storage.tsdb.path=${DATA_DIR} --web.external-url=http://10.67.194.37:9090 --web.enable-lifecycle --web.enable-admin-api --web.console.templates=${ROOT_DIR}/consoles --web.console.libraries=${ROOT_DIR}/console_libraries > ${LOG_FILE} 2>&1 &
EOF

6. Server requirements

8CPU
16GB Memory
500G SSD disk storage
8CPU
16GB Memory
500G SSD disk storage

7. Support necessary departments
mangement by ourself

Alertmanager

1. Purpose
Alertmanager is mainly used for receiving warning information from Prometheus. It supports rich warning notification channels and is an advanced warning notification system that is easy to realize warning information reemphasis, noise reduction, grouping and policy routing.

For more information of alertmanager,seehttps://prometheus.io/docs/alerting/alertmanager/

2. Installed version
Alertmanager version: alertmanager, version 0.18.0

3. Mode of installation
Binary installation

4. Installation configuration
The interface of the alertmanager was specified in the Prometheus.yml
file above

wget https://github.com/prometheus/alertmanager/releases/download/v0.18.0/ale rtmanager-0.18.0.linux-amd64.tar.gz

It’s a slow download
if you’re on the Intranet of homecredit.cn.You can do the same thing:

wget http://10.67.194.43:8099/alertmanager-0.18.0.linux-amd64.tar.gz

And then

tar -zxvf v0.18.0/ale rtmanager-0.18.0.linux-amd64.tar.gz   -C /opt/alertmanager/alertmanager-0.18.0.linux-amd64

5. Server configuration

 cd /opt/alertmanager/alertmanager-0.18.0.linux-amd64
 mkdir cofig

  cat >config/alertmanager.yml << EOF
  global:
    resolve_timeout: 5m
    smtp_from: 'uathccservicedesk@homecredit.cn'
    smtp_smarthost: 'relay.homecredit.cn:25'
    smtp_auth_username: 'hcg\hccservicedesk'
    smtp_auth_password: ''
    smtp_require_tls: false

  route:
    group_by: ['instance']
    group_wait: 10m 
    group_interval: 10s 
    repeat_interval: 10m 
    receiver: 'email.alert.prometheusalert'
 receivers:
   - name: 'email.alert.prometheusalert'
     email_configs:
   - to: 'XXX@homecredit.cn'
EOF

Make the startup script

cat > run.sh << EOF
#!/bin/bash

set -o nounset
set -o errexit
#set -x

readonly ROOT_DIR=/opt/alertmanager/alertmanager-0.18.0.linux-amd64
readonly CONFIG_DIR=${ROOT_DIR}/config
readonly DATA_DIR=${ROOT_DIR}/data
readonly LOG_FILE=${ROOT_DIR}/log/alertmanager.log

export PATH=$PATH:${ROOT_DIR}

nohup alertmanager --config.file=${CONFIG_DIR}/alertmanager.yml --storage.path=${DATA_DIR} --log.level=debug > ${LOG_FILE} 2>&1 &       
EOF

6. Server requirements
4CPU
8GB Memory
500G disk storage

7. Support necessary departments
mangement by ourself

Consul

1. Purpose
Consul supports multiple data center is a distributed service discovery of high availability and software configuration of Shared services.
Here, we use consul to as service discovery of Prometheus, change the configuration does not require frequent restart when Prometheus

For more information of consul,see https://www.consul.io/

2. Installed version
consul version：Consul v1.6.0

3. Mode of installation
Binary installation

4. Installation configuration
This is a highly available grafana group, so you need to d

 wget  wget https://releases.hashicorp.com/consul/1.6.0/consul_1.6.0_linux_amd64.zip

It’s a slow download
if you’re on the Intranet of homecredit.cn.You can do the same thing:

 wget http://10.67.194.43:8099/consul_1.6.0_linux_amd64.zip

And then unzip unzip consul_1* mkdir /opt/consul cp consul /opt/consul && cd /opt/consul

5. Server configuration
There are four nodes in total, including three servers and one client

For server 1

 mkdir -p /opt/consul/data
 mkdir /etc/consul.d
 cat > /opt/consul/run-s1.sh << EOF
 #!/bin/bash
 DIR=/opt/consul/data
 ADDRESS=$(ip a | grep inet | egrep 'eth|ens' |awk '{print $2}' |awk -F "/"  '{print $1}')
 CONSUL_CONFIG=/etc/consul.d
 nohup ./consul agent -server \
 -bootstrap-expect 3 \
 -data-dir=${DIR} \
 -config-dir=${CONSUL_CONFIG} \
 -node=${ADDRESS} \
 -bind=${ADDRESS} \
 -advertise=${ADDRESS} \
 -client=0.0.0.0 \
 -ui \
 >./consul.out  2>&1 &
 EOF

For server 2

 mkdir -p /opt/consul/data
 mkdir /etc/consul.d
 cat > /opt/consul/run-s2.sh << EOF
 #!/bin/bash
 DIR=/opt/consul/data
 ADDRESS=$(ip a | grep inet | egrep 'eth|ens' |awk '{print $2}' |awk -F "/"  '{print $1}')
 CONSUL_CONFIG=/etc/consul.d
 nohup ./consul agent -server \
 -bootstrap-expect 3 \
 -data-dir=${DIR} \
 -config-dir=${CONSUL_CONFIG} \
 -node=${ADDRESS} \
 -bind=${ADDRESS} \
 -advertise=${ADDRESS} \
 -client=0.0.0.0 \
 -join=10.67.194.42 \
 -ui \
 >./consul.out  2>&1 &
 EOF

For server 3
It is exactly the same as the script for server2
For client 1

  mkdir -p /opt/consul/data
 mkdir /etc/consul.d
 cat > /opt/consul/run-s2.sh << EOF
 #!/bin/bash
 DIR=/opt/consul/data
 ADDRESS=$(ip a | grep inet | egrep 'eth|ens' |awk '{print $2}' |awk -F "/"  '{print $1}')
 CONSUL_CONFIG=/etc/consul.d
 nohup ./consul agent -client \
 -bind=${ADDRESS} \
 -client=0.0.0.0 \
 -join=10.67.194.42 \
 >./consul.out  2>&1 &
 EOF

View the results with the following command
./consul members

Here is an example of a json file

  cat > node.json << EOF
   {
  "ID": "wqdcsrv031.cn.infra_node_exporter",
  "Name": "node",
  "Tags": [
    "node_exporter",
    "rhel",
    "test",
    "tooltest"
  ],
  "Address": "10.67.194.40",
  "Port": 10010,
  "Meta": {
    "zone": "WUQING",
    "cluster": "tooltest",
    "service": "splunk",
    "context": "monitoring",
    "ostype": "rhel",
    "env": "test"
  },
  "EnableTagOverride": false,
  "Checks": [
      {
    "HTTP": "http://10.67.194.40:10010/metrics",
    "Interval": "10s"
  }
 ]
 }
 EOF

By the consul API interface to register the file

curl -X PUT -d 'node.json'     http://10.67.194.41:8500/v1/agent/service/register

10.67.194.41 is the address of client

6. Server requirements

4CPU
8GB Memory
500G disk storage
4CPU
8GB Memory
500G disk storage
4CPU
8GB Memory
500G disk storage
4CPU
8GB Memory
500G disk storage

7. Support necessary departments
mangement by ourself

Monitoring Metrics Collection

For regular servers，We just need to install the node_exporter,The rest are optional personalization plug-ins

node_exporter

1. Purpose
As the exporter to the Linux server, contains most of the data we are likely to use
Here, we’re using node_exporter to provide metrics to Prometheus

2. Installed version
node_exporter version: version 0.18.1

3. Mode of installation
No complicated operations are required, and it is still binary package startup mode

4. Installation configuration

 wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz

It’s a slow download
if you’re on the Intranet of homecredit.cn.You can do the same thing:

 wget http://10.67.194.43:8099/node_exporter-0.18.1.linux-amd64.tar.gz

 mkdir /opt/node_exporter | tar -zxvf node_exporter-* -C /opt/node_exporter

5. Server configuration

 #!/bin/bash

  set -o nounset
  set -o errexit
   #set -x

   readonly ROOT_DIR=/opt/node_exporter/
   readonly METRIC_DIR=${ROOT_DIR}/metric
   LOG_FILE=${ROOT_DIR}/node_exporter.log

  export PATH=$PATH:${ROOT_DIR}

  nohup node_exporter --collector.logind \
  --collector.processes \
  --collector.systemd \
  --collector.tcpstat \
  --collector.textfile.directory=${METRIC_DIR} \
  --no-collector.mdadm

6. Support necessary departments

linux team

jvm_exporter

1.Purpose

1. Purpose
Monitoring JVM status

2.Installed version

jmx_prometheus_javaagent-0.12.0.jar

3. Mode of installation

The javaagent approach starts with the jar

4. Installation configuration

wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.12.0/jmx_prometheus_javaagent-0.12.0.jar

5. Server configuration

java -javaagent:./jmx_prometheus_javaagent-0.12.0.jar=8080:config.yaml -jar yourJar.jar

Oracle_exporter

Refer to https://github.com/iamseth/oracledb_exporter

Mysqld_exporter

Refer to https://github.com/prometheus/mysqld_exporter

Consul_exporter

Refer to https://github.com/prometheus/consul_exporter

Jira_exporter

Refer to https://github.com/AndreyVMarkelov/jira-prometheus-exporter

Memcached_exporter

Refer to https://github.com/prometheus/memcached_exporter

HAproxy_exporter

Refer to https://github.com/prometheus/haproxy_exporter

etc…

For The Future

See the picture

数据库

文章转载自一元殿，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

还在用Zabbix/Nagios ? Drop it （下）

· Monitor new online servers

o Modify the initialization process for the new server

o Compared with traditional monitoring system(Nagios/Zabbix)

In the early days of the new surveillance system, install both Nagios’s nrpe collection plugin and the exporter of Prometheus.Then, the accuracy of the collected data was analyzed and comparedfor a period of time

o Summary analysis

Compare the data of 60 days to check the accuracy and timeliness of alarm.Then observe for a period of time, modify the appropriate alarm threshold, modify the appropriate collection period, etc. Finally, summarize acomparison report, check whether there is any need to improve

· Migratelive servers to new monitoring system

o Determine device type

o Server refinement management

· Migrate existing business applications to new monitor system

o Get the necessary information

o Make migration plans

o Customized monitoring scheme

Main Monitoring System

NFS

Nginx

Grafana

HAproxy

Prometheus

Alertmanager

Consul

Monitoring Metrics Collection

node_exporter

Oracle_exporter

Mysqld_exporter

Consul_exporter

Jira_exporter

Memcached_exporter

HAproxy_exporter

etc…

评论