In-depth exploration of the knowledge system in the field of surveillance-LINUX-php.cn

Introduction

Monitoring is the most important part of the entire operation and maintenance and even the entire product life cycle. It provides timely warnings to detect faults beforehand, and provides detailed data afterwards for tracing and locating problems. There are many good open source products in the industry to choose from. Choosing an open source monitoring system is a time-saving, labor-saving and most efficient solution. Of course, friends who don’t know much about monitoring may have a deeper understanding of the entire monitoring system after reading the following article.

1. Monitoring target

Let’s first understand what monitoring is, the importance of monitoring, and the goals of monitoring. Of course, everyone is in a different industry, company, business, position, and has a different understanding of monitoring. But we need to pay attention to monitoring. It needs to be considered from the company's business perspective, rather than the use of a certain monitoring technology.

Uninterrupted real-time monitoring of the system: In fact, it is uninterrupted real-time monitoring of the system (this is monitoring);

Real-time feedback on the current status of the system: When we monitor a certain hardware or a certain system, we need to be able to see the status of the current system in real time, whether it is normal, abnormal, or faulty;

Ensure service reliability and security: The purpose of our monitoring is to ensure the normal operation of systems, services, and businesses;

Ensure the continuous and stable operation of the business: If our monitoring is perfect, even if a fault occurs, we can receive the fault alarm as soon as possible and handle it as soon as possible, thereby ensuring the continuous and stable operation of the business;

In-depth exploration of the knowledge system in the field of surveillance

2. Monitoring method

Now that we understand the importance of monitoring and the purpose of monitoring, we need to understand the methods of monitoring.

Understand the monitoring objects: Do you understand the objects we want to monitor? For example, how does the CPU work?

Performance benchmark indicators: What properties of this thing do we want to monitor? For example, CPU usage, load, user mode, kernel mode, and context switching.

Alarm threshold definition: What is considered a fault and requires an alarm? For example, what is the load of the CPU that is considered high? How much of the user mode and kernel mode are running respectively?

Troubleshooting process: After receiving a fault alarm, how do we deal with it? Is there any more efficient process?

3. Monitoring core

We have learned about the monitoring methods, monitoring objects, performance indicators, alarm threshold definitions, and the steps of the fault handling process. Of course, we need to know what is the core of monitoring?

Discover the problem: When the system fails and alarms, we will receive the fault alarm information;

Positioning problem: Failure emails usually write about a certain host failure and the specific failure content. We need to analyze the alarm content. For example, if a server cannot be connected: we need to consider whether it is a network problem or too high a load. If the connection cannot be made for a long time, or a certain development triggers firewall prohibition related policies, etc., we need to analyze the specific cause of the failure;

Solve the problem: Of course, after we understand the cause of the fault, we need to solve the fault according to the priority of fault resolution;

Summary problem: After we solve the major fault, we need to summarize the cause and prevention of the fault to avoid recurrence in the future;

4. Monitoring tools

Next we need to choose a monitoring tool that is suitable for the company's business. Here I have briefly classified the monitoring tools

Old monitoring tools:

MRTG (Multi Route Trffic Grapher) is a set of software that can be used to draw network traffic graphs. It was developed by Tobias Oetiker and Dave Rand in Olten, Switzerland, and is licensed under the GPL. The best version of MRTG was launched in 1995. It is written in perl language and can be used across platforms. The SNMP protocol is used for data collection. MRTG draws the collected data through the Web page to draw images in GIF or PNG format.

Ganglia is a cross-platform, scalable, high-performance distributed monitoring system such as clusters and grids. It is based on a layered design, uses a wide range of technologies, and uses RRDtool to store data. It has a visual interface and is suitable for automated monitoring of cluster systems. Its carefully designed data structure and algorithm make the connection overhead from the monitoring end to the monitored end very low. Thousands of clusters are currently using this monitoring system, which can easily handle a cluster environment of 2,000 nodes.

Cacti (meaning cactus in English) is a set of network traffic monitoring graphical analysis tools developed based on PHP, MySQL, SNMP and RRDtool. It obtains data through snmpget and uses RRDtool for drawing, but users do not need to understand the complex parameters of RRDtool. . It provides very powerful data and user management functions. Each user can be designated to view the tree structure, host device and any picture. It can also be combined with LDAP for user authentication, and can also customize templates. In terms of historical data display and monitoring, its function is quite good.

Cacti makes the monitoring of different devices reusable by adding templates, has customizable drawing functions, and has powerful computing capabilities (data overlay function)

Nagios is an enterprise-level monitoring system that can monitor the running status and network information of services, monitor the status of specified local or remote hosts and services, and provide abnormal alarm notification functions.

Nagios runs on Linux and UNIX platforms. At the same time, a web interface is provided to facilitate system administrators to view network status, various system problems, and system-related logs.

The function of Nagios focuses on monitoring the availability of services and can trigger alarms based on the status of monitoring indicators.

At present, Nagios also occupies a certain market share. However, Nagios has not kept pace with the times and can no longer meet the changing monitoring needs. The scalability of the architecture and the ease of use need to be enhanced. Its advanced functions are integrated in business version of Nagios XI.

Smokeping is mainly used to monitor network performance, including regular ping, www server performance, DNS query performance, SSH performance, etc. The bottom layer is also supported by RRDtool. It is characterized by very beautiful drawings. Network packet loss and delay are marked with colors and shadows. It supports stacking multiple pictures together. Its author has also developed tools such as MRTG and RRDtll.

Smokeping’s website is: http://tobi.oetiker.cn/hp

OpenTSDB, an open source monitoring system, uses Hbase to store all time series (no sampling required) data to build a distributed, scalable time series database. It supports second-level data collection, supports permanent storage, can do capacity planning, and can be easily integrated into existing alarm systems.

OpenTSDB can obtain corresponding collection indicators from large-scale clusters (including network devices, operating systems, and applications in the cluster), and store, index, and serve them, making these data easier to understand, such as Webization, graphics, etc.

Ace monitoring tool:

Zabbix is a distributed monitoring system that supports multiple collection methods and collection clients. It has a dedicated Agent and also supports multiple protocols such as SNMP, IPMI, JMX, Telnet, and SSH. It will store the collected data. to the database, then analyze and organize it, and trigger an alarm when conditions are met. Its flexible scalability and rich functions are unmatched by other monitoring systems. Relatively speaking, its overall functionality is excellent. From the comparison of the above various monitoring systems, Zabbix has advantages, with its rich functions, scalability, secondary development capabilities and simplicity of use. Readers can build their own with just a little study monitoring system.

Xiaomi’s monitoring system: open-falcon. The goal of open-falcon is to make the most open and easy-to-use Internet enterprise-level monitoring product.

Three-party monitoring tools:

There are many good third-party monitoring on the market now, such as: Monitoring Bao, Monitoring Easy, Tingyun, and many cloud vendors have their own monitoring, but we are not going to introduce it here. If you want to learn about third-party monitoring, you can do it by yourself Go to the official website for consultation. (Avoid saying advertising)

5. Monitoring process

So much has been introduced above, so what monitoring tool is the most suitable? I recommend several open source monitoring tools: Zabbix, Open-Falcon, and LEPUS (dedicated to monitoring databases).

But this article is still based on Zabbix to build the entire monitoring system ecosystem.

Then let’s talk about the entire process of Zabbix:

Data collection: Zabbix collects data from the system through SNMP, Agent, ICMP, SSH, IPMI, etc.;

Data storage: Zabbix is stored on MySQL and can also be stored on other database services;

Data analysis: When we need to review and analyze the fault afterwards, Zabbix can provide us with relevant information such as graphics and time, so that we can determine the location of the fault;

Data display: web interface display (mobile APP, java_php can also develop a web interface);

Monitoring and alarming: phone alarm, email alarm, WeChat alarm, SMS alarm, alarm upgrade mechanism, etc. (no matter what alarm is available);

Alarm processing: When receiving an alarm, we need to process it according to the level of the fault, such as: important and urgent, important and not urgent, etc. According to the level of the fault, cooperate with relevant personnel to handle it quickly;

6. Monitoring indicators

We have learned about the monitoring methods, goals, processes, and what tools are available for monitoring. Some people may be wondering, what exactly do we want to monitor? Then I have sorted it out here:

6.1 Hardware Monitoring

In the early days, we used computer room inspections to check the flashing lights of hardware equipment to determine whether they were faulty. This was a waste of manpower and was repetitive and non-technical work. Everyone understands.

Of course we can now monitor the details of the hardware through IPMI and set alarm thresholds for CPU, memory, disk, temperature, fan, voltage, etc. (We can write a reasonable alarm range for the monitoring alarm content by ourselves)

IPMI Monitoring Hardware Service Reference Material

6.2 System Monitoring

Small and medium-sized enterprises are basically all Linux servers, so we must monitor the usage of system resources. System monitoring is the basis of the monitoring system.

Main objects to monitor:

CPU has several important concepts: context switching, run queue and usage.

These are also several key indicators of our CPU monitoring.

Normally, the run queue of each processor should not be higher than 3, the "user mode/kernel mode" ratio of CPU utilization is maintained at 70/30, and the idle state is maintained at 50%. Context switching should be based on the busyness of the system. Let’s consider it comprehensively.

Commonly used tools for CPU include: htop, top, vmstat, mpstat, dstat, glances

Zabbix provides system monitoring template: Zabbix Agent Interface

Memory: Usually we need to monitor the memory usage and SWAP usage. At the same time, we can use zabbix to draw the memory usage curve graph to find a service memory overflow, etc.

Commonly used tools for memory include: free, top, vmstat, glances

Memory usage: IO is divided into disk IO and network IO. In addition to monitoring more detailed data when doing performance tuning, daily monitoring only focuses on disk usage, disk throughput, disk write busyness, and the network also monitors network card traffic.

Commonly used tools include: iostat, iotop, df, iftop, sar, glances

Other system monitoring includes running process ports, number of processes, logged in users, Open File, etc. (see zabbix’s own OS Linux template for details)

6.3 Application Monitoring

After understanding the hardware monitoring and system monitoring, our further operation is to log in to the server to see which services the server is running, and they all need to be monitored.

Application service monitoring is also an important part of the monitoring system, such as: LVS, Haproxy, Docker, Nginx, PHP, Memcached, Redis, MySQL, Rabbitmq, etc. Related services need to be monitored using zabbix

The author has written about the detailed operation process of service monitoring before, so I will not show them one by one here.

Zabbix provides application service monitoring: Zabbix Agent UserParameter
Java monitoring provided by Zabbix: Zabbix JMX Interface
percona provides MySQL database monitoring: percona-monitoring-plulgins

6.4 Network Monitoring

As an e-commerce website targeting users across the country, it is also necessary to keep track of the network status of various places and computer rooms at all times.

Network monitoring is something we must consider when building a monitoring platform, especially for scenarios with multiple computer rooms. The network status between each computer room, the network status in the computer room and across the country are what we need to focus on. So how to master this status information? We need to use the network monitoring tool Smokeping.

Smokeping is the work of Tobi Oetiker, the author of rrdtool. It is written in Perl. It is mainly used to monitor network performance, www server performance, dns query performance, etc. It uses rrdtool for drawing and supports distribution. It can directly collect data from multiple agents. summary.

At the same time, since you have relatively few monitoring points, you can also use many commercial monitoring tools, such as Monitoring Bao, Tingyun, Keynote, Borui, etc. At the same time, these service providers can also help you monitor the status of your CDN.

6.5 Traffic Analysis

Website traffic analysis is a knowledge that must be mastered by operation and maintenance personnel. For example, for an e-commerce company:

Through statistics and analysis of order sources, we can understand whether our advertising investment on a certain website has achieved the expected results.

You can distinguish the number of visitors from different regions and even the transaction volume of goods.

Baidu statistics, Google analytics, webmaster tools, etc., you only need to embed a js in the page.

However, the data is always in the hands of the other party, and personalized customization is inconvenient, so Google released an open source analysis tool called piwik

6.6 Log monitoring

Normally, as the system runs, the operating system will generate system logs, and the application will generate application access logs, error logs, operation logs, and network logs. We can use ELK for log monitoring.

For log monitoring, the most common requirements are collection, storage, query, and display.

The open source community has corresponding open source projects: logstash (collection) elasticsearch (storage search) kibana (display)

We call the combined technology of these three ELK Stack, so ELK Stack refers to the combination of Elasticsearch, Logstash, and Kibana technology stacks.

If log information is collected, if there is an exception in the deployment update, it can be seen immediately on kibana.

Of course, you can also filter error logs through Zabbix to generate alerts.

6.7 Security Monitoring

Although there are many open source security products for Linux, such as four-layer iptables, seven-layer WEB protection, Nginx lua, and WAF, the relevant logs are finally collected into ELK Stack, and different attack types are displayed graphically. But it is always a time-consuming thing, and personally I think the effect is not very good. At this time we can choose to connect to third-party service providers.

Three-party vendors provide comprehensive vulnerability libraries, covering services, backdoors, databases, configuration detection, CGI, SMTP and other types

Comprehensive detection of host and Web application vulnerabilities, combined with independent mining and industry sharing, to update 0day vulnerabilities immediately to eliminate the latest security risks

6.8 API Monitoring

As APIs become more and more important, it is obvious that we also need such data to tell whether the APIs we provide are functioning properly.
Monitor API interface GET, POST, PUT, DELETE, HEAD, OPTIONS requests. Availability, correctness, and response time are the three major performance indicators

6.9 Performance Monitoring

Comprehensive monitoring of web page performance, DNS response time, HTTP connection establishment time, page performance index, response time, availability, element size, etc.
Zabbix provides URL monitoring: Zabbix Web Monitoring

6.10 Business Monitoring

A monitoring platform without business indicator monitoring is not a complete monitoring platform. Usually in our monitoring system, we must monitor our important business indicators and set thresholds for alarm notifications.

For example, e-commerce industry:

How many orders are generated per minute;

How many users are registered per minute;

How many active users are there every day;

How many promotion activities are there every day;

How many users are introduced to the promotion activity;

How much traffic does the promotion bring in;

How much profit does the promotion bring?

Etc. Important indicators can be added to Zabbix and then displayed through screen.

7. Monitoring and alarm

There are many ways to notify fault alarms. Of course, the most commonly used methods are SMS, email, and SMS alarm

8. Alarm handling

How do we deal with faults after a general alarm? First, we can automatically handle it through the alarm upgrade mechanism. For example, if the Nginx service is down, we can set the alarm upgrade to automatically start Nginx. But if a serious failure occurs in a general business, we usually assign different operation and maintenance personnel to handle it according to the level of the failure and the business of the failure. Of course, different business forms, different architectures, and different services may adopt different methods. There is no fixed model that can be applied.

9. Interview monitoring

In operation and maintenance interviews, we are often asked questions related to monitoring. So how to answer this question? I provide you with a simple answer idea for this article.

Hardware monitoring. Monitoring router switches through SNMP (you can communicate with some manufacturers to learn how to do this), server temperature and others, can be achieved through IPMI. Of course, if there is no hardware and everything is in the cloud, just skip this step.

System monitoring. Such as CPU load, context switching, memory usage, disk read and write, disk usage, disk inode usage. Of course, these need to be configured with triggers, because the default setting is too low and will cause frequent alarms.

Service monitoring. For example, the LAMP architecture used by the company, nginx comes with its own Status module, PHP also has related Status, MySQL can be monitored through the percona official tool, and Redis obtains information through its own info for filtering, etc. The methods are similar. Or bring your own service. Either use scripts to implement the content you want to monitor, as well as alarm and graphics functions.

Network Monitoring. If it is a cloud host and it is not across computer rooms, you can choose not to monitor the network. Of course you said we are across computer rooms and so on. It is recommended to use smokeping for network-related monitoring. Or leave it directly to your network engineer, because there are specialties in the industry.

Security Monitoring. If it is a cloud host, you can consider using its own security protection. Of course you can also use iptables. If it is hardware, then it is recommended to use a hardware firewall. Using the cloud, you can purchase anti-DDoS to avoid malfunctions that may cause downtime for a day. If it is a system, then basic solutions such as permissions, passwords, backup, and recovery must be done well. web can also use Nginx Lua to implement a web-level firewall. Of course, you can also use integrated Openresty.

Web monitoring. There are still many topics about web monitoring. For example, you can use the built-in web monitoring to monitor page-related delays, js response time, download time, etc. Here I recommend using professional commercial software, Monitoring Bao or Tingyun to achieve this. After all, there are computer rooms all over the country. (If it is a multi-machine room, let’s talk about it separately)

Log monitoring. If it is the web, you can use to monitor Nginx’s 50x and 40x error logs, and PHP’s ERROR log. In fact, these requirements are nothing more than collection, storage, query, and display. We can actually use the open source ELKstack to achieve this. Logstash (collection), elasticsearch (storage search), kibana (display)
Business monitoring. We have done so much, but in the end we still ensure the operation of the business. Only in this way can the monitoring we do make sense. Therefore, the monitoring at the business level requires meetings and discussions with the development and director to monitor the more important business indicators (which need to be confirmed by a meeting) and then can be implemented through a simple script, and finally set the trigger.

Traffic Analysis. Usually we use a bunch of tools like awk sed xxx to analyze logs. This is not very convenient for us to count IP, PV, and UV. Then you can use Baidu Statistics, Google Statistics, and Commerce to develop embedded codes. In order to avoid privacy, you can also use piwik to do related traffic analysis.

Visualization. Use screen and introduce some third-party libraries to beautify the interface. At the same time, we also need to know that the order volume suddenly increases or decreases. In other words, a large wave of traffic suddenly came. Where did this traffic come from? Was it promoted or was it attacked? The monitoring platform can be combined to sort out the business relationships between various systems.

Automated monitoring. As we have done so much work above, of course we cannot add keys one by one. This can be achieved through Zabbix's active mode and passive mode. Of course it's best to do this via API.
Summarize

If we really want to achieve a more complete monitoring system, the current open source software cannot satisfy it well. Qualified companies have begun to develop their own monitoring systems, such as Xiaomi's open source Open-Falcon. There are also relatively good open source monitoring frameworks such as Sensu, etc., plus influxdb and grafana, which can be used to customize the monitoring platform to suit your own enterprise.

Of course, what I said is still very simple. My experience is limited and my ideas can only provide so much. The above are some of the methods and experiences I share about monitoring. (Old birds please don’t comment)

The above is the detailed content of In-depth exploration of the knowledge system in the field of surveillance. For more information, please follow other related articles on the PHP Chinese website!