How do I diagnose the performance of the Linux server?

This article: Open source China Community, description link: https://urlify.cn/y7nyqa

Performance diagnosis of Linux

When you log in to a Linux server in order to solve a performance problem: What should you check in the first minute?

In Netflix, we have a huge EC2 Linux cloud, as well as a large number of performance analysis tools to monitor and diagnose its performance. These include ATLAS for cloud monitoring, as well as the Vector analyzed on demand instance. Although these tools can help us solve most of the problems, we sometimes need to log in to a server instance and run some standard Linux performance tools.

In this article, the Netflix Performance Engineering team will explain to you the first 60 seconds of the command line to do the best performance analysis, using the standard Linux tool you should get.

Total sixty seconds:

By running the following ten commands, you can roughly understand the process and resource usage of the system in the system within 60 seconds. By viewing the error messages and resource saturation output (they are easy to understand) by viewing these commands (they are easy to understand), you can then optimize the resources. Saturation means that the load of a certain resource exceeds its ability to handle it. Once saturated, it usually exposes in length or waiting time of the request queue.

Uptime Dmesg | Tail Vmstat 1 MPSTAT -P All 1 Pidstat 1 Iostat -XZ 1 Free -M Sar -N DEV 1 SAR-N TCP, etcp 1 TOP

Some of these commands need to pre-install the SYSSTAT package. The information displayed by these commands can help you implement a USE method (a method for positioning performance bottlenecks), such as the usage rate, saturation, and error message of various resources (such as CPU, memory, disk, etc.). In addition, in the process of positioning problems, you can eliminate certain possibilities that cause problems by using these commands, help you reduce the range of instructions and specify the direction for the next step.

The following sections will be used as an example in a production environment, simply introduce these commands. For details, please refer to their MAN documentation.

1UPTIME

$ uptime 23:51:26 Up 21:31, 1 User, Load Average: 30.02, 26.43, 19.02

This is a way to quickly view the average load of the system, indicating how many tasks to run in the system (process). In the Linux system, these numbers contain processes that need to run in the CPU and the process that is waiting for I / O (usually a disk I / O). It is just a rough display for the system load, it can be seen slightly. You also need other tools to learn more about specific situations.

These three numbers show a result of the total load of the system within a minute, five minutes and fifteen minutes, compressed by the index ratio. From this we can see how the system is changed over time. For example, you are checking a problem, then seeing the value of 1 minute is less than 15 minutes, then this problem may have passed, you can not observe in time.

In this example, the system load increases over time because the last minute load value exceeds 30, and the average load of 15 minutes is only 19. Such a significant gap contains many meanings, compared to the CPU load. To confirm, you have to run the VMSTAT or MPSTAT command, these two commands are referred to in Section 3 and 4 of the following.

2DMESG | TAIL

$ Dmesg | tail [1880957.563150] perl invoked oom-killer: gfp_mask = 0x280da, order = 0, oom_score_adj = 0 […] [1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child [1880957.563408] Killed Process 18694 (Perl) Total-VM: 1972392KB, Anon-RSS: 1953348KB, FILE-RSS: 0KB [2320864.954447] TCP: Possible Syn Flooding on Port 7001. Dropping Request. Check SNMP Counters.

This command explicitly has the most recent 10 system messages if they exist. Find errors that can lead to performance issues. The above example contains OOM-KILLER, and TCP discards a request. Don’t miss this step! The dmesg command will always be worth a try.

3VMSTAT 1

$ VMSTAT 1PROCS ——— Memory ———- – SWAP – —– IO —- -system—— CPU —- RB SWPD Free Buff Cache SO BI BO IN CS US Sy Wa St 34 0 0 200889792 73708 591828 0 0 0 5 6 10 96 1 3 0 0 32 0 0 200889920 73708 591860 0 0 0 592 13284 4282 98 1 1 0 0 32 0 200890112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 200890208 73712 591860 0 0 0 0 0 15898 4840 98 1 1 0 0 ^ c

Vmstat (8) is a referusion of virtual memory statistics, which is a common tool (created for BSD decades). It prints a statistical summary of a critical server in each row.

The VMSTAT command specifies a parameter 1 run to print a statistical summary of each second. (This version of VMSTAT) The first line of the first line of the output is explicitly the average of the boot, not the value of the previous second. Now, we skip the first line unless you want to understand and remember each column.

Check these columns:

R: The number of processes that are running and waiting for operation in the CPU. It provides a better signal than the average load to determine if the CPU is saturated because it does not contain I / O. Explanation: The value of “R” is greater than the number of CPUs, said it is already saturated.

Free: Explicitly memory in KB. If there are many numbers, you have enough free memory. The “free -m” command is the seventh command below, which can better describe the status of idle memory.

Si, SO: SWAP-INS and SWAP-OUTS. If they are not zero, they represent your memory.

US, SY, ID, WA, ST: These are all CPU decomposition time of all CPUs. They are user time (user), system time (kernel), idle, waiting for I / O (Wait), and STOLEN (STOLEN) (other visitors, or using Xen, visitor yourself independent) Drive domain).

The CPU decomposition time will confirm whether the CPU is busy via the user time plus system time. Waiting for I / O time unchanged, it indicates a disk bottleneck; this is the idle of the CPU, because the task is blocked on the I / O waiting to hangs. You can treat I / O as another form of the CPU idle, which gives a clue why CPU idle.

For I / O treatment, the system time is important. A average system time higher than 20%, it can be worth further: Perhaps the kernel is too low in processing I / O.

In the above example, the CPU time is almost fully spending the user-level, indicating that the application takes up too many CPU times. The average usage rate of the CPU is also over 90%. This is not necessarily a problem; check the saturation in the “R” column.

4MPSTAT -P ALL 1

$ mpstat -p all 1linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 x86_64 (32 CPU)

07:38:49 PM CPU% USR% NICE% SYS% iowait% IRQ% SOFT% STEAL% GUEST% GNICE% IDLE 07:38:50 PM All 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.78 07:38:50 PM 0 96.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 07:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 07:38:50 PM 3 96.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.03 […] This command prints the CPU decomposition time of each CPU, which can be used to check for an imbalanced usage. A separate CPU is very busy representatives representing a single-threaded application.

5PIDSTAT 1

$ PIDSTAT 1LINUX 3.13.0-49-generic (TitanClusters-xxxxx) 07/14/2015 x86_64 (32 CPU)

07:41:02 PM Uid PID% USR% System% Guest% CPU CPU Command 07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 RCUOS / 0 07:41:03 PM 0 4214 5.66 5.66 0.00 11.32 15 Mesos-Slave 07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 Java 07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java 07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 Java 07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 PIDSTAT

07:41:03 PM Uid PID% USR% System% Guest% CPU CPU Command 07:41:04 PM 0 4214 6.00 2.00 0.00 8.00 15 Mesos-Slave 07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 27 Java 07: 41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 28 Java 07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 snmp-pass 07:41:04 PM 60004 60154 1.00 4.00 0.00 5.00 9 Pidstat ^ C

The PidStat command is a bit like the TOP command to summarize the statistics of each process, but the loop prints a scrolling statistical summary to replace the TOP screen. It can be used in real time, and you can also see what you see (copy paste) into your survey record.

Reverse Front is a technical platform focused on programmer circle, you can harvest the latest technology dynamics, the latest internal test qualifications, BAT and other experiences, boutique learning materials, professional routes, deputy care, WeChat search against front and attention!

The above example indicates that the two Java processes are consumed by the CPU. % CPU This column is a total of all CPUs; 1591% indicates that this Java process consumes nearly 16 CPUs.

6iostat-xz 1

$ iostat-xz 1linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 x86_64 (32 CPU)

AVG-CPU:% User% Nice% System% iowait% Steal% iDLE 73.96 0.00 3.73 0.03 0.06 22.21

Device: rrqm / s wrqm / sr / sw / s rkB / s wkB / s avgrq-sz avgqu-sz await r_await w_await svctm% util xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09 xvdb 0.01 0.00 1.02 8.94 127.97 598.53 145.79 0.00 0.43 1.78 0.28 0.25 0.25 xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26 dm-0 0.00 0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04 dm-1 0.00 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00 DM-2 0.00 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03 […] ^ c This is a great tool for viewing block devices (disks) cases, whether it is workload Still performance performance. View a column:

R / S, W / S, RKB / S, WKB / S: These represent the number of reads per second, number of write times, read the number of KB, and write KB. These are used to describe the workload. Performance issues may be just due to excessive loads.

AWAIT: The I / O at milliseconds is an average time consumption. This is the actual time of application consumption because it includes queuing time and processing time. The average time than expected may mean the saturation of the device, or the equipment has a problem.

Avgqu-SZ: The average number of requests issued to the device. The value is greater than 1 means it is saturated (although the device can handle the request, especially a virtual device consisting of a plurality of disks.)

% UTIL: Device utilization. This value is a percentage of a busy state in which the device is in operation is displayed. If the value is greater than 60%, it usually shows that the performance is poor (it can be seen from the AWAIT), although it depends on the device itself. The value is close to 100%, often means saturated.

If the storage device is a logical disk device for many backend disks, 100% utilization may only mean that some I / O occupations are currently being processed, however, the backend disk may be far unsatisfactory, and may be able to handle more Many work.

Remember, the disk I / O performance is not necessarily the problem of the program. Many techniques are usually asynchronous I / O, which makes applications not blocked and subjected to delay (eg, pre-read, and write buffer).

7free -m

$ free -m Total Used Free Shared Buffers Cached Mem: 24599424545 221453 83 59 541 – / + Buffers / Cache: 23944 222053 SWAP: 0 0 0

Two columns on the right explicit:

Buffers: Buffer cache for block device I / O.

Cached: The page cache for the file system.

We just want to check this size that is not close to zero, which may result in higher disk I / O (use iostat confirmation), and worse performance. The above example looks good, there are many M size each column.

Compared to the first line, the memory usage provided by – / + buffers / cache will be more accurate. Linux will use the memory that is not allowed to use as a cache. Once the application needs, it will be redistributed to it. So some of the memory that is used as cache is actually an idle memory. In order to explain this, even some people have specially built a website: LinuxateMyram.

If you install ZFS on Linux, this will become more confused because ZFS its own file system cache does not count free -m. Sometimes it is found that the system has no time memory available, and in fact, memory is all in the CFS cache.

8SAR-N DEV 1

$ SAR-N dev 1linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 x86_64 (32 CPU)

12:16:48 AM IFACE RXPCK / S TXPCK / S RXKB / S TXKB / S RXCMP / S TXCMP / S RXMCST / S% IFUTIL 12:16:49 AM Eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.00 12:16:49 AM LO 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.00 12:16:49 AM Docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0012: 16: 49 AM IFACE RXPCK / S TXPCK / S RXKB / S TXKB / S RXCMP / S TXCMP / S RXMCST / s% iFUTIL 12:16:50 AM ETH0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.00 20.00 3.25 3.25 0.00 0.00 0.00 0.00 12:16:50 AM Docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ^ C

This tool can be used to check the throughput of the network interface: rxkb / s and txkb / s, and whether the limit is reached. In the above example, the flow received by ETH0 reached 22MBytes / S, which is 176MBITS / SEC (limit is 1Gbit / sec)

The% IFUTIL is also provided in the versions offered as an indicator of the equipment usage (maximum value for reception, and transmitting). We can also measure this value with Brendan’s NICSTAT tool. As nicstat, this value shows this value is hard to get accurate, in this example, it is not working in normal work (0.00).

9SAR-N TCP, ETCP 1

$ SAR-N TCP, ETCP 1Linux 3.13.0-49-generic (titanclusters-xxxx) 07/14/2015 x86_64 (32 CPU)

12:17:19 Am Active / S Passive / S ISEG / S OSEG / S 12:17:20 AM 1.00 0.00 10233.00 18846.00

12:17:19 AM Atmptf / S Estres / S Retrans / S Isegerr / s OrsTS / S 12:17:20 AM 0.00 0.00 0.00 0.00 0.00

12:17:20 Am Active / S Passive / S ISEG / S OSEG / S 12:17:21 AM 1.00 0.00 8359.00 6039.00

12:17:20 AM Atmptf / S Estres / S Retrans / S Isegerr / s OrsTS / S 12:17:21 AM 0.00 0.00 0.00 0.00 0.00 ^ C

This is a summary view of some key TCP indicators. These include:

Active / S: The number of TCP connections (for example, through connect ()) is initiated.

Passive / s: The number of TCP connections per second (for example, by accept ()).

Retrans / S: TCP number of TCPs per second.

Active and Passive connections are often very useful for describing a rough measure server load: a new received connection (passive), downlink connections (Active). It is understood that the Active connection is external, and the Passive connection is within, although it is strictly not completely correct (for example, a localhost to localhost connection).

Revitalization is a sign of a network and server issues. It may be due to an unreliable network (eg, public network), maybe it may be due to the server overload and packet. The above example shows that there is only one new TCP connection per second.

10top

$ TOPTOP – 00:15:40 Up 21:56, 1 User, Load Average: 31.09, 29.87, 29.92 Tasks: 871 Total, 1 Running, 868 Sleeping, 0 Stopped, 2 Zombie% CPU (s): 96.8 US, 0.4 SY, 0.0 Ni, 2.7 ID, 0.1 Wa, 0.0 Hi, 0.0 Si, 0.0 St Kib Mem: 25190241 + Total, 24921688 Used, 22698073 + Free, 60448 Buffers Kib Swap: 0 Total, 0 Used, 0 Free. 554208 Cached Mempid USER PR NI VIRT RES SHR S% CPU% MEM TIME + COMMAND 20248 root 20 0 0.227t 0.012t 18748 S 3090 5.2 29812: 58 java 4213 root 20 0 2722544 64640 44232 S 23.5 0.0 233: 35.37 mesos-slave 66128 titancl + 20 0 24344 2332 1172 R 1.0 0.0 0: 00.07 TOP 5235 Root 20 0 38.227G 547004 49996 S 0.7 0.2 2: 02.74 Java 4299 root 20 20.015G 2.682G 16836 S 0.3 1.1 33: 14.4 Java 1 Root 20 0 33620 2920 1496 S 0.0 0.0 0: 03.82 init 2 root 20 0 0 0 0 S 0.0 0.0 0: 00.02 kthreadd 3 root 20 0 0 0 0 s 0.0 0.0 0: 05.35 ksoftirqd / 0 5 root 0 -20 0 0 0 s 0.0 0.0 0: 00.00 kWorker / 0: 0H 6 root 20 0 0 0 0 s 0.0 0.0 0: 06.94 KWORKER / U256: 0 8 root 20 0 0 0 0 S 0.0 0.0 2: 38.05 RCU_SCHED

The TOP command contains many indicators we have checked before. It can be convenient to perform it to see the result of the previous command output, which indicates that the load is variable.

One disadvantage of TOP is that it is difficult to see the trend of data over time. The scroll output provided by VMSTAT and PIDSTAT will be clearer. If you don’t pause the output at a speed (Ctrl-S pause, Ctrl-Q continues), the clues of some intermittent problems may also be lost due to the clear screen.

Subsequent analysis

More commands and methods can be used for more in-depth analysis. View the Linux performance tool tutorial tutorial at the Velocity 2015 conference, which contains more than 40 commands covering observable, benchmark management, tuning, static performance tuning, analysis, and tracking.

The reliability and performance issue of the system’s scale response system is one of our hobbies.

Reverse Front is a technical platform focused on programmer circle, you can harvest the latest technology dynamics, the latest internal test qualifications, BAT and other experiences, boutique learning materials, professional routes, deputy care, WeChat search against front and attention!

Related Posts