2020-03-07发表2024-05-31更新Java19 分钟读完 (大约2824个字)0次访问

使用Prometheus监控Spring Boot服务监控状态并预警

当我们开发了一个Web Api之类的服务上线后，总以为可以松一口，可以舒心的撸几把游戏，但是事与愿违，有时候服务突然就挖塌了，而且还不自知，还在开开心心的排位赛，此时上级领导、客户、客户的领导都来慰问、鼓励，给你加油、打气。当然这只是一种情况，但足以看出服务监控还是很重要的。

服务高可用一般就是监控+冗余，在一个应用挂掉之后，其他应用继续提供服务，并给运维及时告警，快速恢复挂掉的节点，当然，我们小公司的各种资源都是有限的，所以需要找一个部署快、成本低的监控系统，这里就找到一个开源、流行的监控系统Prometheus,现在，我们就搞一下咯。
这里需要介绍几个要用到的Prometheus组件:

Prometheus:服务端收集、存储数据
Exporter:本地服务暴露metrics给Prometheus
AlertManager:接收预警信息并告警

Docker安装Prometheus

创建Prometheus配置和规则文件

首先我们需要配置好我们的Prometheus,配置方式可以去https://prometheus.io/docs/prometheus/latest/configuration/configuration/查看,这里还是需要对一些参数有些了解:

global:
  # How frequently to scrape targets by default.
  [ scrape_interval: <duration> | default = 1m ]

  # How long until a scrape request times out. 
  [ scrape_timeout: <duration> | default = 10s ]

  # How frequently to evaluate rules.
  [ evaluation_interval: <duration> | default = 1m ]

  # The labels to add to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    [ <labelname>: <labelvalue> ... ]

  # File to which PromQL queries are logged.
  # Reloading the configuration will reopen the file.
  [ query_log_file: <string> ]

# Rule files specifies a list of globs. Rules and alerts are read from
# all matching files.
rule_files:
  [ - <filepath_glob> ... ]

# A list of scrape configurations.
scrape_configs:
  [ - <scrape_config> ... ]

# Alerting specifies settings related to the Alertmanager.
alerting:
  alert_relabel_configs:
    [ - <relabel_config> ... ]
  alertmanagers:
    [ - <alertmanager_config> ... ]

# Settings related to the remote write feature.
remote_write:
  [ - <remote_write> ... ]

# Settings related to the remote read feature.
remote_read:
  [ - <remote_read> ... ]

现在我们开始新建一个prometheus.yml并做如下配置：

global:
    scrape_interval:     30s
    evaluation_interval: 30s 
alerting:       #配置AlertManager的地址
    alertmanagers:
    - static_configs:
      - targets: [ '182.61.35.33:9093']
scrape_configs:
    - job_name: 'spring boot'
      scrape_interval: 5s
      metrics_path: '/actuator/prometheus'
      static_configs:
        - targets: ['182.61.35.33:8080']
          labels: 
            appname: 'app'

一个job下面可以配置多个应用，即static_configs,现在我们就用docker来运行一个prometheus吧

1
2
3

docker run -d -p 9090:9090 --name=prometheus \
 -v  /root/prometheus/:/etc/prometheus/  \
prom/prometheus

这时候就可以看到这样的界面了

但是我们的AlertManager和应用都还没启动,接下来，我们慢慢的完成

Spring Boot接入Prometheus

新建项目

这里我们新建一个应用，并引入一下几个包

<dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
        <!-- https://mvnrepository.com/artifact/io.micrometer/micrometer-registry-prometheus -->
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-prometheus</artifactId>
            <version>1.3.5</version>
        </dependency>

配置prometheus metrics

在配置文件中作出以下配置

management:
  metrics:
    export:
      prometheus:
        enabled: true
  endpoints:
    web:
      exposure:
        include: health,info,env,prometheus,metrics,httptrace,threaddump,heapdump

启动spring boot程序,访问/actuator/prometheus路径，可以得到一下信息

# HELP jvm_threads_states_threads The current number of threads having NEW state
# TYPE jvm_threads_states_threads gauge
jvm_threads_states_threads{state="runnable",} 11.0
jvm_threads_states_threads{state="blocked",} 0.0
jvm_threads_states_threads{state="waiting",} 12.0
jvm_threads_states_threads{state="timed-waiting",} 3.0
jvm_threads_states_threads{state="new",} 0.0
jvm_threads_states_threads{state="terminated",} 0.0
# HELP jvm_gc_pause_seconds Time spent in GC pause
# TYPE jvm_gc_pause_seconds summary
jvm_gc_pause_seconds_count{action="end of major GC",cause="Metadata GC Threshold",} 1.0
jvm_gc_pause_seconds_sum{action="end of major GC",cause="Metadata GC Threshold",} 0.053
jvm_gc_pause_seconds_count{action="end of minor GC",cause="Metadata GC Threshold",} 1.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="Metadata GC Threshold",} 0.008
# HELP jvm_gc_pause_seconds_max Time spent in GC pause
# TYPE jvm_gc_pause_seconds_max gauge
jvm_gc_pause_seconds_max{action="end of major GC",cause="Metadata GC Threshold",} 0.053
jvm_gc_pause_seconds_max{action="end of minor GC",cause="Metadata GC Threshold",} 0.008
# HELP jvm_buffer_memory_used_bytes An estimate of the memory that the Java virtual machine is using for this buffer pool
# TYPE jvm_buffer_memory_used_bytes gauge
jvm_buffer_memory_used_bytes{id="direct",} 8192.0
jvm_buffer_memory_used_bytes{id="mapped",} 0.0
# HELP tomcat_sessions_active_max_sessions  
# TYPE tomcat_sessions_active_max_sessions gauge
tomcat_sessions_active_max_sessions 0.0
# HELP process_uptime_seconds The uptime of the Java virtual machine
# TYPE process_uptime_seconds gauge
process_uptime_seconds 14.293
# HELP logback_events_total Number of error level events that made it to the logs
# TYPE logback_events_total counter
logback_events_total{level="warn",} 0.0
logback_events_total{level="debug",} 0.0
logback_events_total{level="error",} 0.0
logback_events_total{level="trace",} 0.0
logback_events_total{level="info",} 8.0
# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young generation memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total 7.6557328E7
# HELP jvm_gc_live_data_size_bytes Size of old generation memory pool after a full GC
# TYPE jvm_gc_live_data_size_bytes gauge
jvm_gc_live_data_size_bytes 1.6091968E7
# HELP jvm_buffer_total_capacity_bytes An estimate of the total capacity of the buffers in this pool
# TYPE jvm_buffer_total_capacity_bytes gauge
jvm_buffer_total_capacity_bytes{id="direct",} 8192.0
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0
# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
# TYPE process_cpu_usage gauge
process_cpu_usage 0.015306358354323044
# HELP jvm_gc_memory_promoted_bytes_total Count of positive increases in the size of the old generation memory pool before GC to after GC
# TYPE jvm_gc_memory_promoted_bytes_total counter
jvm_gc_memory_promoted_bytes_total 6929264.0
# HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
# TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 28.0
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="PS Survivor Space",} 0.0
jvm_memory_used_bytes{area="heap",id="PS Old Gen",} 1.6091968E7
jvm_memory_used_bytes{area="heap",id="PS Eden Space",} 1.6104544E7
jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 3.5507288E7
jvm_memory_used_bytes{area="nonheap",id="Code Cache",} 6837440.0
jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 4960320.0
# HELP system_cpu_usage The "recent cpu usage" for the whole system
# TYPE system_cpu_usage gauge
system_cpu_usage 1.0
# HELP tomcat_sessions_active_current_sessions  
# TYPE tomcat_sessions_active_current_sessions gauge
tomcat_sessions_active_current_sessions 0.0
# HELP jvm_gc_max_data_size_bytes Max size of old generation memory pool
# TYPE jvm_gc_max_data_size_bytes gauge
jvm_gc_max_data_size_bytes 1.332740096E9
# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{area="heap",id="PS Survivor Space",} 1.1010048E7
jvm_memory_committed_bytes{area="heap",id="PS Old Gen",} 7.8118912E7
jvm_memory_committed_bytes{area="heap",id="PS Eden Space",} 1.00139008E8
jvm_memory_committed_bytes{area="nonheap",id="Metaspace",} 3.8273024E7
jvm_memory_committed_bytes{area="nonheap",id="Code Cache",} 6881280.0
jvm_memory_committed_bytes{area="nonheap",id="Compressed Class Space",} 5505024.0
# HELP jvm_classes_unloaded_classes_total The total number of classes unloaded since the Java virtual machine has started execution
# TYPE jvm_classes_unloaded_classes_total counter
jvm_classes_unloaded_classes_total 7.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads 22.0
# HELP tomcat_sessions_alive_max_seconds  
# TYPE tomcat_sessions_alive_max_seconds gauge
tomcat_sessions_alive_max_seconds 0.0
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
# TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads 26.0
# HELP tomcat_sessions_created_sessions_total  
# TYPE tomcat_sessions_created_sessions_total counter
tomcat_sessions_created_sessions_total 0.0
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{area="heap",id="PS Survivor Space",} 1.1010048E7
jvm_memory_max_bytes{area="heap",id="PS Old Gen",} 1.332740096E9
jvm_memory_max_bytes{area="heap",id="PS Eden Space",} 6.422528E8
jvm_memory_max_bytes{area="nonheap",id="Metaspace",} -1.0
jvm_memory_max_bytes{area="nonheap",id="Code Cache",} 2.5165824E8
jvm_memory_max_bytes{area="nonheap",id="Compressed Class Space",} 1.073741824E9
# HELP tomcat_sessions_rejected_sessions_total  
# TYPE tomcat_sessions_rejected_sessions_total counter
tomcat_sessions_rejected_sessions_total 0.0
# HELP process_start_time_seconds Start time of the process since unix epoch.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.583576824548E9
# HELP system_cpu_count The number of processors available to the Java virtual machine
# TYPE system_cpu_count gauge
system_cpu_count 8.0
# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
# TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{id="direct",} 1.0
jvm_buffer_count_buffers{id="mapped",} 0.0
# HELP tomcat_sessions_expired_sessions_total  
# TYPE tomcat_sessions_expired_sessions_total counter
tomcat_sessions_expired_sessions_total 0.0
# HELP jvm_classes_loaded_classes The number of classes that are currently loaded in the Java virtual machine
# TYPE jvm_classes_loaded_classes gauge
jvm_classes_loaded_classes 7386.0

打包镜像

先编辑一个简单的Dockerfile文件

1
2
3

FROM openjdk:8-jdk-alpine
COPY prometheus-0.0.1-SNAPSHOT.jar app.jar
ENTRYPOINT ["java","-jar","/app.jar"]

接下就是打包成镜像

[root@instance-p0a4erj8 spring]# docker build -t springdemo:v1 . 
Sending build context to Docker daemon  21.18MB
Step 1/3 : FROM openjdk:8-jdk-alpine
8-jdk-alpine: Pulling from library/openjdk
e7c96db7181b: Already exists 
f910a506b6cb: Pull complete 
c2274a1a0e27: Pull complete 
Digest: sha256:94792824df2df33402f201713f932b58cb9de94a0cd524164a0f2283343547b3
Status: Downloaded newer image for openjdk:8-jdk-alpine
 ---> a3562aa0b991
Step 2/3 : COPY prometheus-0.0.1-SNAPSHOT.jar app.jar
 ---> e39e208a176c
Step 3/3 : ENTRYPOINT ["java","-jar","/app.jar"]
 ---> Running in c23af9dfc64a
Removing intermediate container c23af9dfc64a
 ---> 4cd9c1315046
Successfully built 4cd9c1315046
Successfully tagged springdemo:v1

现在我们来运行一下docker run -d -p 8080:8080 springdemo:v1


  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::        (v2.2.5.RELEASE)

2020-03-07 09:50:23.711  INFO 1 --- [           main] c.e.prometheus.PrometheusApplication     : Starting PrometheusApplication v0.0.1-SNAPSHOT on 67ef3675d196 with PID 1 (/app.jar started by root in /)
2020-03-07 09:50:23.729  INFO 1 --- [           main] c.e.prometheus.PrometheusApplication     : No active profile set, falling back to default profiles: default
2020-03-07 09:50:28.317  INFO 1 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat initialized with port(s): 8080 (http)
2020-03-07 09:50:28.392  INFO 1 --- [           main] o.apache.catalina.core.StandardService   : Starting service [Tomcat]
2020-03-07 09:50:28.393  INFO 1 --- [           main] org.apache.catalina.core.StandardEngine  : Starting Servlet engine: [Apache Tomcat/9.0.31]
2020-03-07 09:50:28.674  INFO 1 --- [           main] o.a.c.c.C.[Tomcat].[localhost].[/]       : Initializing Spring embedded WebApplicationContext
2020-03-07 09:50:28.675  INFO 1 --- [           main] o.s.web.context.ContextLoader            : Root WebApplicationContext: initialization completed in 4621 ms
2020-03-07 09:50:29.755  INFO 1 --- [           main] o.s.s.concurrent.ThreadPoolTaskExecutor  : Initializing ExecutorService 'applicationTaskExecutor'
2020-03-07 09:50:30.975  INFO 1 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 14 endpoint(s) beneath base path '/actuator'
2020-03-07 09:50:31.262  INFO 1 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
2020-03-07 09:50:31.294  INFO 1 --- [           main] c.e.prometheus.PrometheusApplication     : Started PrometheusApplication in 10.981 seconds (JVM running for 15.956)
2020-03-07 09:50:48.347  INFO 1 --- [nio-8080-exec-1] o.a.c.c.C.[Tomcat].[localhost].[/]       : Initializing Spring DispatcherServlet 'dispatcherServlet'
2020-03-07 09:50:48.347  INFO 1 --- [nio-8080-exec-1] o.s.web.servlet.DispatcherServlet        : Initializing Servlet 'dispatcherServlet'
2020-03-07 09:50:48.368  INFO 1 --- [nio-8080-exec-1] o.s.web.servlet.DispatcherServlet        : Completed initialization in 21 ms

好啦，现在程序运行起来了，这时候我们去prometheus看看

我们再把刚才启动的docker给停掉，

可以看到状态发生了变化，接下来我们再继续配置AlertManager，使其能够在服务挂掉的时候，自动提醒。

安装AlertManager

AlertManager支持多种提醒方式，这里我暂时只配置邮件方式提醒。首先需要新建一个配置文件

global:
    smtp_smarthost: 'smtp.163.com:25'
    smtp_from: 'wmymtx@163.com'
    smtp_auth_username: 'wmymtx@163.com'
    smtp_auth_password: 'aaaaaaaa'
  
  route:
    group_interval: 1m   #当第一个报警发送后，等待'group_interval'时间来发送新的一组报警信息
    repeat_interval: 1m   # 如果一个报警信息已经发送成功了，等待'repeat_interval'时间来重新发送他们
    receiver: 'mail-receiver'
  receivers:
  - name: 'mail-receiver'
    email_configs:
      - to: '326076105@qq.com'

现在，我们基于Docker来启动AlertManager

docker run -d -p 9093:9093 --name alertmanager  -v /root/prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml prom/alertmanager


[root@instance-p0a4erj8 ~]# docker logs ac65138c1c8b
level=info ts=2020-03-08T02:53:25.577Z caller=main.go:231 msg="Starting Alertmanager" version="(version=0.20.0, branch=HEAD, revision=f74be0400a6243d10bb53812d6fa408ad71ff32d)"
level=info ts=2020-03-08T02:53:25.577Z caller=main.go:232 build_context="(go=go1.13.5, user=root@00c3106655f8, date=20191211-14:13:14)"
level=info ts=2020-03-08T02:53:25.578Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=172.17.0.10 port=9094
level=info ts=2020-03-08T02:53:25.581Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2020-03-08T02:53:25.621Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2020-03-08T02:53:25.622Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2020-03-08T02:53:25.626Z caller=main.go:497 msg=Listening address=:9093
level=info ts=2020-03-08T02:53:27.581Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000163621s
level=info ts=2020-03-08T02:53:35.582Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.001131157s

现在访问9093端口可以看到这样的界面

这里，我们再创建一个告警规则的配置文件

groups:
    - name: alert-rule
      rules:
      - alert: HttpMonitor
        expr: sum(up{job="spring boot"}) == 0
        for: 1m
        labels:
          severity: critical

在我们的prometheus.yml加入预警规则