Ease Monitor

1. Product Principles

The Ease Monitor is designed by the following principles:

  • Focus on the SLA. Monitoring the API which consumed by end-user.

  • Metrics Aggregation. Connecting the infrastructure and middleware metrics with application.

  • Quick Fault Location. The failure always happens, quickly address and recovery the fault is the key.

In other words, the Ease Monitor is designed for two major scenarios:

  1. Health Checking.

    • Capacity Management. By comparing the data trend, it can help engineering team decides whether add more resource.

    • Performance Management. Managing the application stack performance, make sure every pieces of stack works fine on production.

  2. Diagnosis.

    • Locating the Failure. Once the failure or exception happens, it helps developer find the root location of failure quickly.

    • Performance Analysis. It helps find the software bottle neck and hot spot which developer can dive into the code.

The following is a case usually happens. A slow SQL or Java Full GC could cause the the whole site running very very slow.

Diagnosis Case

2. Design Principles

The Ease Monitor is a kind of APM - Application Performance Management, but it’s a bit different with the traditional APM software.

There are two aspects impacts the design of the Ease Monitor:

  • Different Engineering Angle. We know, there are several engineers role in a company, and they looking at the whole system from different angles. For example,

    • Manager likes the whole system’s health.
    • Software Developer likes to know the application’s running status.
    • Operation Team likes to know the infrastructure and the middleware’s running status.
  • No reinvent the wheels. Developing a monitor system looks like reinvent another wheel. So, we won’t want to reinvent everything, and need make use the Ease Monitor is opening and flexible enough to be compatible with the current mainstream monitoring technologies.

So, the Ease Monitor had the following design principles:

  • Using Mainstream Technology. Most of engineering teams in this world can operate and maintain it.

  • Every Components can be Replaced or Tailored. People has different requirement and business, so, the design must give the enough flexibility that anyone can modify it easily.

  • Tracing the Services Requests. The monitor must trace request crossing the distributed system from end to end.

  • Guiding the Engineering. The monitor must can guide the engineers at least two things, 1) Easily address the issues, 2) Easily make the engineering decision.

  • Leverage the Automation. The monitor could connect with other control system to do the automated operations, such as: auto-scaling, auto-scheduling, etc.

  • The Whole Stack Metrics Monitoring. We must monitoring the three layers softwares:

    • Service Layer. The key index of service, such as, HTTP request, Status code, Throughput, Latency …, etc.
    • Platform Layer. The key index of a number of middleware, such as, Nginx, Redis, Tomcat, Kafka, MySQL…, etc.
    • Infrastructure Layer. The key index of operation system, such as, CPU, Memory, Disk, Network etc.
  • Customized Dashboard. The dashboard can be configured by everyone who have different interests.

3. Architecture

Ease Monitor Architecture

The whole architecture based on the open source technology.

  • Agents
    • Metrics Collection - Telegraf
    • Logs Collection - Filebeat or Fluentd
    • Java Agent - EaseAgent (This is developed by us & open source with Apache 2.0 license, See the Technical Details section)
  • Backend Pipeline
    • Data Bus - Apache Kafka
    • Data ETL - Logstash
    • Data Store - ElasticSearch
    • Event Data - InfluxDB
    • Event Trigger & Handler - This is developed by us. (See the Technical Details section)
    • UI Portal - This is developed by us. (See the Technical Details section)

The whole architecture not only can monitor big clusters, but also every components can be flexible to replaced or tailored.

4. Prerequisite and Limitation

Currently, Ease Monitor only supports

  • Java Language Service, and the version need be >= 1.6
  • Linux Operation System

5. Features Showcase

5.1 Overview Dashboard

The Overview dashboard shows the overall health and capacity.

Overview Dashboard

5.2 Service SLA Daily Report

The following diagram shows the daily SLA report, it could be the whole site or individual service.

SLA

5.3 Service Dashboard

The following Service Dashboard put the service traffic, the upstream and downstream services, TOP API,Top 5 slowest tracing request,and the related the resource and the metrics.

Service Dashboard

5.4 Tracing

The real-time topology could let us understand the architecture of the services.

topology

The Tracing could let us understand the chains of the services call and its performance.

External Service

5.5 Top N List

Top N lists show the operations or APIs consumed the time most.

Service Top API List

Service Top N List

JDBC Top Operation List

JDBC Top N List

5.6 Customized Dashboard

The customized dashboard.

Infrastructure Dashboard

Dashboard

6. Technical Details

6.1 EaseAgent

  • EaseAgent is a Java agent for APM(Application Performance Management) system.
  • EaseAgent majorly focuses on the Spring Boot development environments.
  • EaseAgent is compatible with mainstream monitoring ecosystems, such as Kafka, ElasticSearch, Prometheus, Zipkin, etc.
  • EaseAgent collects the basic metrics and the service tracking logs, which is very helpful for performance analysis and troubleshooting.

6.1.1 Design Principles

  • Design for Micro-Service architecture, collecting the data from a service perspective.
  • Instrumenting a Java application in a non-intrusive way.
  • Lightweight and very low CPU, memory, and I/O resource usage.
  • Safe to Java application/service.

6.1.2 Compatibility and Requirement

  • Collecting Metric & Tracing Logs.
    • JDBC 4.0
    • HTTP ServletHTTP Filter
    • Spring Boot 2.2.x: WebClientRestTemplateFeignClient
    • RabbitMQ Client 5.xKafka Client 2.4.x
    • Jedis 3.5.xLettuce 5.3.x
  • Collecting Access Logs.
    • HTTP ServletHTTP Filter
    • Spring Cloud Gateway
  • Instrumenting the traceId and spanId automatically
  • Supplying the health check endpoint
  • Supplying the readiness check endpoint for SpringBoot2.2.x

6.1.3 Data Collection

  1. HTTP Requests Metrics.
  2. JDBC Connection and Statement Metrics, and related context information (such as, URL, SQL statement, etc.)
  3. Compatible Zipkin protocol to trace the distributed services. which includes:
    • HTTP receive and send
    • SQL execution
    • Redis access

6.1.4 Usage

Downloads easeagent.jar from release , and just simply add the follow arguments for Java application running:

-javaagent=easeagent.jar

6.2 Event

Currently, the Ease Monitor event handling would deal with the following cases.

  • Metric - Duration - Threshold. A metric keeps exceeding the threshold in certain duration. (e.g. cpu > 80% lasts 2mins)

  • Metric - Duration - Percentile - Threshold. A metric’s percentile(e.g. P99) exceed the threshold in certain duration. (e.g. response time P90 > 300ms lasts 2mins)

  • Metric - Duration - Function - Threshold. Support some simple functions - Sum/Average/Min/Max/Count to trigger the event.

  • Logs - Duration - Keywords - Times. Monitor a certain keyword(support the regular expression), if the keyword matched the configured times, then report the event.

6.3 Data Store Schema

The following data schema is used for ElasticSearch storing.

6.3.1 Indices Schema

Index mapping templateIndex patternDescription
ease-monitor-metrics-*ease-monitor-metrics-YYYY.MM.DDSaves time series based metrics of monitored object from different categories. The metrics from different monitored object will be saved into a dedicated document type.
ease-monitor-aggregate-metrics-*ease-monitor-aggregate-metrics-YYYY.MM.DDSaves calculated performance statistics from different dimensions monitoring requirement needed. The statistics from different dimensions will be saved into a dedicated document type. Due to the statistic calculation are executed on these input metrics directly as streaming and the results will be saved into this index in advance, so the statistics can be loaded and used without any further aggregation(e.g. grouping and computing). This will definitely help the performance of ad-hoc query on the fine-grained metrics ES stored, especially on a large metrics data volume. This index was designed only to save these statistics ones can be calculated by a simple (fast) and fixed (can be implemented on product design stage instead of runtime stage) functions.
ease-monitor-logs-*ease-monitor-logs-YYYY.MM.DDSaves the logs outputted from OS, middleware and application. The different logs will be saved into a dedicated document type.

6.3.2 Document Types Schema

The Document Types Schema include the following things:

  • Index mapping template

    • ease-monitor-metrics-* - for metrics data
    • ease-monitor-aggregate-metrics-* - for java agent metrics data
    • ease-monitor-logs-* - for logs
  • Category

    • application - for Java Agent metrics data.
    • platform - for a number of middleware metrics - such as: nginx, redis, tomcat, mysql, kafka … etc.
    • infrastructure - for CPU, MEM, DISK, NET metrics.
  • Document Type

    • The name of the component.
    • The group of the performance counters and statistics.

For example:

Index mapping templateCategoryDocument typeDescription
ease-monitor-metrics-*applicationhttp_requestSaves application HTTP request records, which contains URL address and parameters, execution duration, response code and other useful fields.
platformjvm_memorySaves JVM performance counters and statistics for heap, non-heap and each spaces.
jvm_gcSaves JVM performance counters and statistics for garbage collector.
tomcat_globalSaves the performance counters and statistics of global request processor and thread pool.
tomcat_cacheSaves the performance counters and statistics of each context cache.
tomcat_servletSaves the performance counters and statistics of each servlet.
nginxSaves nginx performance counters and statistics.
mysqlSaves mysql performance counters and statistics.
redis_serverSaves redis server performance counters and statistics.
redis_keyspaceSaves redis key space performance counters and statistics.
infrastructurecpuSaves the percentage utilization of special logic core.
memorySaves the percentage utilization and capacity in bytes.
interfaceSaves the performance counters and statistics for each interface separately (without ’lo’ loop device), e.g. tx and rx bytes.
diskSaves the performance counters and statistics for each block device separately, e.g. iops, mbps. (busy percentage indicator will be added in future).
dfSaves the utilization counters for each block device
ease-monitor-aggregate-metrics-*applicationhttp_requestSaves the calculated values of separated and total executions per second in every 1, 5, 15 minutes. The request count will be saved as well.
jdbc_statementSaves the calculated values of separated and total executions per second in every 1, 5, 15 minutes. And also saves minimal, mean, maximal and 25%, 50%, 75%, 95%, 98%, 99%, 99.9% user’s execution duration. The execution count will be saved as well.
jdbc_connectionSaves the calculated values of database connection establishment per second in every 1, 5, 15 minutes range. And also saves minimal, mean , maximal and 25%, 50%, 75%, 95%, 98%, 99%, 99.9% user’s connection establishment duration. The establishment count will be saved as well.
ease-monitor-logs-*application<component-name>Saves log records collected from application’s component.
platformtomcat_exceptionSaves the exception messages of the stack.
nginx_accessSaves HTTP access records from nginx access log.
nginx_errorSaves error records from nginx error log.
mysql_slow_sqlSaves slow SQL records from MySQL log.
infrastructureos_syslogSaves log records from OS ‘syslog’ file.
os_dmesgSaves log records from OS ‘dmesg’ file.