Linux ftrace - 内核性能分析 - 火焰图

原创 digoal 2022-01-20

1093

作者

digoal

日期

2021-12-16

Know your IO: buffered and unbuffered/direct IO

A fundamental property that MUST be known is whether the read call is done buffered (using page cache of linux) or unbuffered (reading directly into the executiong process' address space). The IO call doesn't determine whether the IO is performed buffered or unbuffered, the file descriptor flags determine that.

The file descriptor flags are set with open() or openat() calls, and can also be changed with the fcntl() call inside the execute.

The file descriptor flags bitmap can be seen in the fdinfo field in /proc/[PID]/fdinfo/[file descriptor #]:

[root@yb-1 ~]# ls /proc/self/fdinfo   
0  1  2  3   
[root@yb-1 ~]# cat /proc/self/fdinfo/0   
pos:    0   
flags:  02002   
mnt_id: 26

I haven't been able to find a simple explanation of the flags. There are two methods that I use to understand what these flags are. One is to trace the call with which the flag is opened with strace, which shows something like:

openat(AT_FDCWD, "/dev/sdb", O_RDONLY|O_DIRECT) = 3

The other one is to use the lsof utility with '+fg':

lsof +fg -p $(pgrep -x dd)   
...   
dd      32863 root    0r   BLK    DIR,LG   8,16 0x83b00000      1503 /dev/sdb   
...

You can see the 'DIR' for 'O_DIRECT'.

Now that the methods to determine buffered or unbuffered IO are known: by default linux performs everything buffered. When a file is opened unbuffered alias with O_DIRECT it's an explicit choice of the application to do so.

This blogpost assumes buffered IO starting from here.

Linux layers

We established we're doing buffered IO. Now, how to understand the latency of a read call? There are several ways of measuring read latency, and some measurements take their data from different points in the operating system. Most operating systems are setup in layered way to keep different activities grouped and keep things organized.

A fundamental grouping is the separation of userspace, where an application is running, and kernelspace, where (obviously) the kernel is running, and the hardware/device layers.

#	Layer	Components
1	Userspace	user application
2	Userspace	GNU C library (glibc)
-	---	---
3	Kernelspace	System Call Interface
4	Kernelspace	Subsystems: virtual filesystem, memory management, process management
5	Kernelspace	Architecture Dependent Code, device drivers
-	---	---
6	Hardware	Physical devices

This is a random grouping that I created for the sake of the argument of measuring latency, you can probably create many more or different of such tables.

The table is meant to be read from the top to the bottom, so from userspace to the hardware layer, which is typically how a call might might flow.

Not all calls flow to the bottom. If code is performing work on data in already allocated memory, it will not go further than layer #2. This work is done in the user space layer. Typically, all CPU time spent there is recorded as 'user' CPU time.

Anytime when during processing there is something needed from outside the current user space, such as getting some more memory or freeing memory, reading something from a system device, such as from a network stack, but also from a disk, the process needs to call the system, which is done via a system call. This includes performing a disk IO.

The system call performs a mode switch to kernel mode. In kernel mode devices can be accessed. All CPU time spent in kernel mode is recorded as 'system' CPU.

Measuring IO

When you look at IO figures and latencies, you have understand two things. The first thing to understand is whether the measurement or measurements you are looking at is an average or an absolute measurement. Averages hide peaks, and the more time the measurement is taken over, the more averaged the figure is.

The other important thing is the above described layer the measurement comes from.

iostat, sar, node exporter or anything else that takes figures from /proc come from level 5.
bcc (ebpf) tools xfsdist and xfsslower come from from level 3.
bcc (ebpf) tools biosnoop, biolatency, biotop comes from level 5.
perf trace, strace come from level 3.
application figures come from level 1.

Measurements at different layers do include different amounts of work and thus might yield different figures. Also some tools can add (significant) latency (!!), such as strace and perf. Be extremely cautious in production situations!

I hope if you made it this far that these layers can have significance. In a lot of situations, when you encounter high latencies using figures obtained from for example application measured IO times, it simply means your disk device encountered that latency, and the latency figures are roughly equal over the the different layers.

How does a latency figure look like

But let's first look at latency figures. I said you have to know if a figure is an average or an absolute measurement. The reason for that is that IO latencies are not/never a single figure. They fluctuate.

This might come as a shock to some people, because lots (and I mean: LOTS) of people talk about IO latency stating a single figure. IO latency fluctuates, even with modern NVMe SSD, because there are lots and lots of things happening at the same on a modern computer system, which can and sometimes will influence your latency.

Also storage latency in modern clouds fluctuates, and you should/have to assess whether the fluctuations are acceptable. Some clouds have outliers that are quite high, but because of the low frequency of them still have an acceptable average latency.

If the fluctuations that I just described surprises you, and you thought that because you are running in a professional cloud means this is all consistent, please perform a fio benchmark on your cloud to find out. I know I was very surprised to see HDD era latencies. I am not kidding here!

Know your environment

In most cases there is no need to perform a deep dive into the kernel internals, and your physical IO latency simply was high. This can reasonably easy be established by trying to link latencies from the application and tools that obtained them from layer 3, and comparing them with latencies obtained from layer 5.

In a lot of cases that I seen there is a lacking understanding of disk bandwidth. The first observation is that people have a hard time understanding or appropriately reporting IOPS (operations per second) and MBPS (megabyte per second). My advise here is: do not fight the figures (perform measurements to see if these can be reached), but work with the figures (carefully use the figures to see how well your application performs IO).

The second observation is that I see a lot of people ignoring the fact that with a little bit of work you can understand the IO limits of your system. Please put in the effort to go to the cloud vendor's documentation and work out what IO limits you actually got. Without them you are flying blind.

In all cases make your environment, which in the cloud is CPU and IO, predictable and consistent. Do not use bursting. Any assessment done with any form of bursting can and should be moved to the bin because that alters the limits during your run. Essentialy, bursting makes you measure two different systems to which the bursting made your virtual machine switch to.

When IO limits are hit, both for IOPS and MBPS, in the cloud and outside of the cloud with local devices, requests will queue. The queueing meaning the request has to wait before it gets served, and therefore appears to the application layer as increased latency. If you want to know more about this, please read up on https://en.wikipedia.org/wiki/Queueing_theory.

Finally diving into the read call

When you encounter an issue and it provides inconsistent figures between userspace (level 1 or 3) and kernel device level (level 5), you might want to look deeper into the internals of a read call. This is rare!

The call I am diving into is pread64(). The pread64() call is what Postgresql is using on linux to read data into its addressing space, as well as the YugabyteDB.

Most linux tools will do a read() call, so I prepared a small c program that performs a pread64 call:

#include <stdlib.h>   
#include <stdio.h>   
#include <fcntl.h>   
#include <unistd.h>   
int main() {   
  int fd, sz;   
  char *c = (char *) calloc(100, sizeof(char));   
  fd = open("tt", O_RDONLY);   
  if (fd < 0) { perror("tt"); exit(1); }   
  sz = pread(fd, c, 10, 0);   
  printf("called pread(% d, c, 10). returned that"   
        " %d bytes were read.\n", fd, sz);   
  c[sz] = '\0';   
  printf("Those bytes are as follows: % s\n", c);   
}

Save it in a file called 'pread.c'. You compile the program in the following way:

cc pread.c -o pread

And prepare the file it is going to read:

echo "x" > tt

You can validate it's doing pread64() call by running the program with strace:

# strace -e pread64 ./pread   
pread64(3, "x\n", 10, 0)                = 2   
called pread( 3, c, 10). returned that 2 bytes were read.   
Those bytes are as follows: x   
+++ exited with 0 +++

Now let's perform the actual deep dive. Using ftrace, it's possible obtain the exact functions executed inside the kernel.

ftrace is using the tracefs filesystem, which normally is mounted at /sys/kernel/debug/tracing. Inside tracefs, perform the following tasks:

For security, turn tracing off (should be off obviously):

echo 0 > tracing_on

Set the current tracer to 'function_graph':

echo function_graph > current_tracer

Enable tracing for the pread64 system call:

echo ksys_pread64 > set_graph_function

Trace annotation: add the program executable name and pid in the trace (funcgraph-proc), and add the name of the function to the end too (funcgraph-tail).

echo funcgraph-tail > trace_options   
echo funcgraph-proc > trace_options

Turn tracing on:

echo 1 > tracing_on

Now go the other session and execute pread:

./pread   
called pread( 3, c, 10). returned that 2 bytes were read.   
Those bytes are as follows: x

Turn tracing off:

echo 0 > tracing_on

And look in the trace:

 1)  <...>-40173   |               |  ksys_pread64() {   
 1)  <...>-40173   |               |    __fdget() {   
 1)  <...>-40173   |   0.076 us    |      __fget_light();   
 1)  <...>-40173   |   0.768 us    |    } /* __fdget */   
 1)  <...>-40173   |               |    vfs_read() {   
 1)  <...>-40173   |               |      rw_verify_area() {   
 1)  <...>-40173   |               |        security_file_permission() {   
 1)  <...>-40173   |               |          selinux_file_permission() {   
 1)  <...>-40173   |               |            __inode_security_revalidate() {   
 1)  <...>-40173   |               |              _cond_resched() {   
 1)  <...>-40173   |   0.041 us    |                rcu_all_qs();   
 1)  <...>-40173   |   0.362 us    |              } /* _cond_resched */   
 1)  <...>-40173   |   0.696 us    |            } /* __inode_security_revalidate */   
 1)  <...>-40173   |   0.042 us    |            avc_policy_seqno();   
 1)  <...>-40173   |   1.390 us    |          } /* selinux_file_permission */   
 1)  <...>-40173   |   0.042 us    |          bpf_lsm_file_permission();   
 1)  <...>-40173   |   0.043 us    |          __fsnotify_parent();   
 1)  <...>-40173   |   0.053 us    |          fsnotify();   
 1)  <...>-40173   |   2.785 us    |        } /* security_file_permission */   
 1)  <...>-40173   |   3.122 us    |      } /* rw_verify_area */   
 1)  <...>-40173   |               |      __vfs_read() {   
 1)  <...>-40173   |               |        new_sync_read() {   
 1)  <...>-40173   |               |          xfs_file_read_iter [xfs]() {   
 1)  <...>-40173   |               |            xfs_file_buffered_aio_read [xfs]() {   
 1)  <...>-40173   |               |              xfs_ilock [xfs]() {   
 1)  <...>-40173   |               |                down_read() {   
 1)  <...>-40173   |               |                  _cond_resched() {   
 1)  <...>-40173   |   0.040 us    |                    rcu_all_qs();   
 1)  <...>-40173   |   0.362 us    |                  } /* _cond_resched */   
 1)  <...>-40173   |   0.686 us    |                } /* down_read */   
 1)  <...>-40173   |   1.042 us    |              } /* xfs_ilock [xfs] */   
 1)  <...>-40173   |               |              generic_file_read_iter() {   
 1)  <...>-40173   |               |                generic_file_buffered_read() {   
 1)  <...>-40173   |               |                  _cond_resched() {   
 1)  <...>-40173   |   0.041 us    |                    rcu_all_qs();   
 1)  <...>-40173   |   0.383 us    |                  } /* _cond_resched */   
 1)  <...>-40173   |               |                  pagecache_get_page() {   
 1)  <...>-40173   |               |                    find_get_entry() {   
 1)  <...>-40173   |   0.041 us    |                      PageHuge();   
 1)  <...>-40173   |   1.151 us    |                    } /* find_get_entry */   
 1)  <...>-40173   |   1.487 us    |                  } /* pagecache_get_page */   
 1)  <...>-40173   |   0.042 us    |                  mark_page_accessed();   
 1)  <...>-40173   |               |                  _cond_resched() {   
 1)  <...>-40173   |   0.042 us    |                    rcu_all_qs();   
 1)  <...>-40173   |   0.364 us    |                  } /* _cond_resched */   
 1)  <...>-40173   |               |                  _cond_resched() {   
 1)  <...>-40173   |   0.042 us    |                    rcu_all_qs();   
 1)  <...>-40173   |   0.359 us    |                  } /* _cond_resched */   
 1)  <...>-40173   |               |                  pagecache_get_page() {   
 1)  <...>-40173   |               |                    find_get_entry() {   
 1)  <...>-40173   |   0.041 us    |                      PageHuge();   
 1)  <...>-40173   |   0.382 us    |                    } /* find_get_entry */   
 1)  <...>-40173   |   0.708 us    |                  } /* pagecache_get_page */   
 1)  <...>-40173   |               |                  touch_atime() {   
 1)  <...>-40173   |               |                    atime_needs_update() {   
 1)  <...>-40173   |               |                      current_time() {   
 1)  <...>-40173   |   0.051 us    |                        ktime_get_coarse_real_ts64();   
 1)  <...>-40173   |   0.044 us    |                        timestamp_truncate();   
 1)  <...>-40173   |   0.707 us    |                      } /* current_time */   
 1)  <...>-40173   |   1.076 us    |                    } /* atime_needs_update */   
 1)  <...>-40173   |   1.399 us    |                  } /* touch_atime */   
 1)  <...>-40173   |   7.264 us    |                } /* generic_file_buffered_read */   
 1)  <...>-40173   |   7.627 us    |              } /* generic_file_read_iter */   
 1)  <...>-40173   |               |              xfs_iunlock [xfs]() {   
 1)  <...>-40173   |   0.041 us    |                up_read();   
 1)  <...>-40173   |   0.389 us    |              } /* xfs_iunlock [xfs] */   
 1)  <...>-40173   | + 10.019 us   |            } /* xfs_file_buffered_aio_read [xfs] */   
 1)  <...>-40173   | + 10.454 us   |          } /* xfs_file_read_iter [xfs] */   
 1)  <...>-40173   | + 10.863 us   |        } /* new_sync_read */   
 1)  <...>-40173   | + 11.227 us   |      } /* __vfs_read */   
 1)  <...>-40173   |   0.045 us    |      __fsnotify_parent();   
 1)  <...>-40173   |   0.048 us    |      fsnotify();   
 1)  <...>-40173   | + 15.714 us   |    } /* vfs_read */   
 1)  <...>-40173   | + 17.359 us   |  } /* ksys_pread64 */

Depending on the amount of concurrency, you might need to find the correct pread64 call.

What do we see? At the bottom of the trace we see this pread64 that the this call took 17 microseconds. That is very fast. How can this be so fast?

There already is much going on however, these are some important functions:

security_file_permission: security related functions, normally no latency issues.
xfs_ilock: the inode is locked.
pagecache_get_page: lookup the page in the page cache.
xfs_iulock: the inode is unlocked.

That's right, there was no physical IO in this case! That also explains the low latency! Obviously, for this to happen the requested pages must be in the cache already, if they are not in the cache, they must be read from disk.

Join me for the next post about the anatomy of a read call that includes a physical read.

https://zhuanlan.zhihu.com/p/267925553

Ftrace是直接内置在内核的跟踪程序。

Ftrace给Linux提供可以查看追踪内核内部事情的能力，可以更好地查找问题区域并跟踪错误。Ftrace可以跟踪内核崩溃的events，这样可以准确的找到导致内核崩溃的原因，并有助于开发人员正确地解决这个问题。

如果你看过之前文章：，那么可以理解成ftrace是内核态的strace，用于追踪内核态的调用记录，但是功能比strace强大的多的多。

Ftrace使能

ftrace的API接口位于内核Debugfs文件系统中，挂载在/sys/kernel/debug。当ftrace使能后，会在debugfs中创建tracing目录:

[~]# cd /sys/kernel/debug/tracing   
[tracing]#

Ftrace依赖内核开关使能：

CONFIG_FUNCTION_TRACER
CONFIG_FUNCTION_GRAPH_TRACER
CONFIG_STACK_TRACER
CONFIG_DYNAMIC_FTRACE

Function tracing

Function tracer是ftrace中比较主要的一个tracer。它主要是gcc编译加入-pg参数后，每个函数调用会调用"mcount()"。

在使能CONFIG_DYNAMIC_FTRACE后，系统启动后trace_call会转化成NOP以保证100%的性能效率。在编译时候mcount()会被记录，在启动后转换成NOP。NOP对跟踪是无用的，在打开function(或functiongraph) tracer时才会被转化成跟踪调用函数。建议使能CONFIG_DYNAMIC_FTRACE，以增强性能。

另外需要注意的是，即使转换成NOP不影响性能，但是编译使能-pg参数还是会带来轻微的性能开销。

1）想要知道哪些tracer是可用的，可以cat available_tracers

[tracing]# cat available_tracers    
function_graph function sched_switch nop

2）使能function tracer，只要echo "function"到current_tracer

[tracing]# echo function > current_tracer   
[tracing]# cat current_tracer   
function   
[tracing]# cat trace | head -10   
    # tracer: function   
    #   
    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION   
    #              | |       |          |         |   
                bash-16939 [000]  6075.461561: mutex_unlock <-tracing_set_tracer   
              <idle>-0     [001]  6075.461561: _spin_unlock_irqrestore <-hrtimer_get_next_event   
              <idle>-0     [001]  6075.461562: rcu_needs_cpu <-tick_nohz_stop_sched_tick   
                bash-16939 [000]  6075.461563: inotify_inode_queue_event <-vfs_write   
              <idle>-0     [001]  6075.461563: mwait_idle <-cpu_idle   
                bash-16939 [000]  6075.461563: __fsnotify_parent <-vfs_write

参考标题的解释说明。前两个表示跟踪进程的名字和PID。

CPU表示跟踪进程运行在哪个CPU。

TIMESTAMP表示运行时间戳（从启动算起）。

FUNCTION表示function调用关系。A<-B表示从B调用到A。

3）function graph tracer以另一种方式展示函数调用关系

 [tracing]# echo function_graph > current_tracer    
    [tracing]# cat trace | head -20   
    # tracer: function_graph   
    #   
    # CPU  DURATION                  FUNCTION CALLS   
    # |     |   |                     |   |   |   |   
     1)   1.015 us    |        _spin_lock_irqsave();   
     1)   0.476 us    |        internal_add_timer();   
     1)   0.423 us    |        wake_up_idle_cpu();   
     1)   0.461 us    |        _spin_unlock_irqrestore();   
     1)   4.770 us    |      }   
     1)   5.725 us    |    }   
     1)   0.450 us    |    mutex_unlock();   
     1) + 24.243 us   |  }   
     1)   0.483 us    |  _spin_lock_irq();   
     1)   0.517 us    |  _spin_unlock_irq();   
     1)               |  prepare_to_wait() {   
     1)   0.468 us    |    _spin_lock_irqsave();   
     1)   0.502 us    |    _spin_unlock_irqrestore();   
     1)   2.411 us    |  }   
     1)   0.449 us    |  kthread_should_stop();   
     1)               |  schedule() {

在DURATION中新增的标记“+”表示调用时间超过10microseconds。如果超过100microseconds，那么就是“!”。

Using trace_printk()

Printk()是调试利器，但是存在一个问题。如果在调用非常频繁的区域使用，比如timer中断处理函数、调度或者网络中，printk()可能导致系统崩溃或创建live lock。有时候也会发现添加printk()后bug不再复现。这些都是printk()引入的一些问题。

Ftrace引入了新的print形式叫做trace_printk()，他可以像printk()一样调用，也可以运行在任何上下文中（比如中断代码、NMI代码）。Trace_printk()的优点是不会输出到控制台，而是写入到trace buffer中，通过trace文件查看。

通过调用trace_printk()写入到trace文件只需约十分之一微秒。使用printk()，特别是写入串行控制台时，每次写入可能需要几毫秒的时间。trace_printk()的性能优势可以在追踪内核最敏感的区域时影响很小。

比如：

trace_printk("read foo %d out of bar %p\n", bar->foo, bar);

cat trace文件

 [tracing]# cat trace   
    # tracer: nop   
    #   
    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION   
    #              | |       |          |         |   
               <...>-10690 [003] 17279.332920: : read foo 10 out of bar ffff880013a5bef8

trace_printk()可以应用在任何tracer中，包括function和function graph tracer.

 [tracing]# echo function_graph > current_tracer   
    [tracing]# insmod ~/modules/foo.ko   
    [tracing]# cat trace   
    # tracer: function_graph   
    #   
    # CPU  DURATION                  FUNCTION CALLS   
    # |     |   |                     |   |   |   |   
     3) + 16.283 us   |      }   
     3) + 17.364 us   |    }   
     3)               |    do_one_initcall() {   
     3)               |      /* read foo 10 out of bar ffff88001191bef8 */   
     3)   4.221 us    |    }   
     3)               |    __wake_up() {   
     3)   0.633 us    |      _spin_lock_irqsave();   
     3)   0.538 us    |      __wake_up_common();   
     3)   0.563 us    |      _spin_unlock_irqrestore();

Starting and stopping the trace

1）打开trace开关

[tracing]# echo 1 > tracing_on

2）关闭trace开关

[tracing]# echo 0 > tracing_on

注意如下方法无效：

echo 0> tracing_on   /* this will not work! */

3）跟踪命令的方式：

[tracing]# echo 0 > tracing_on   
[tracing]# echo function_graph > current_tracer   
[tracing]# echo 1 > tracing_on; run_test; echo 0 > tracing_on

https://linux.cn/article-9838-1.html

在内核层面上分析事件有很多的工具：SystemTap、ktap、Sysdig、LTTNG 等等，你也可以在网络上找到关于这些工具的大量介绍文章和资料。

而对于使用 Linux 原生机制去跟踪系统事件以及检索/分析故障信息的方面的资料却很少找的到。这就是 ftrace，它是添加到内核中的第一款跟踪工具，今天我们来看一下它都能做什么，让我们从它的一些重要术语开始吧。

内核跟踪和分析

内核分析Kernel profiling可以发现性能“瓶颈”。分析能够帮我们发现在一个程序中性能损失的准确位置。特定的程序生成一个概述profile — 这是一个事件总结 — 它能够用于帮我们找出哪个函数占用了大量的运行时间。尽管这些程序并不能识别出为什么会损失性能。

瓶颈经常发生在无法通过分析来识别的情况下。要推断出为什么会发生事件，就必须保存发生事件时的相关上下文，这就需要去跟踪tracing。

跟踪可以理解为在一个正常工作的系统上活动的信息收集过程。它使用特定的工具来完成这项工作，就像录音机来记录声音一样，用它来记录各种系统事件。

跟踪程序能够同时跟踪应用级和操作系统级的事件。它们收集的信息能够用于诊断多种系统问题。

有时候会将跟踪与日志比较。它们两者确时很相似，但是也有不同的地方。

对于跟踪，记录的信息都是些低级别事件。它们的数量是成百上千的，甚至是成千上万的。对于日志，记录的信息都是些高级别事件，数量上通常少多了。这些包含用户登录系统、应用程序错误、数据库事务等等。

就像日志一样，跟踪数据可以被原样读取，但是用特定的应用程序提取的信息更有用。所有的跟踪程序都能这样做。

在内核跟踪和分析方面，Linux 内核有三个主要的机制：

跟踪点tracepoint：一种基于静态测试代码的工作机制
探针kprobe：一种动态跟踪机制，用于在任意时刻中断内核代码的运行，调用它自己的处理程序，在完成需要的操作之后再返回
perf_events —— 一个访问 PMU（性能监视单元Performance Monitoring Unit）的接口

我并不想在这里写关于这些机制方面的内容，任何对它们感兴趣的人可以去访问 Brendan Gregg 的博客。

使用 ftrace，我们可以与这些机制进行交互，并可以从用户空间直接得到调试信息。下面我们将讨论这方面的详细内容。示例中的所有命令行都是在内核版本为 3.13.0-24 的 Ubuntu 14.04 中运行的。

ftrace：常用信息

ftrace 是 Function Trace 的简写，但它能做的远不止这些：它可以跟踪上下文切换、测量进程阻塞时间、计算高优先级任务的活动时间等等。

ftrace 是由 Steven Rostedt 开发的，从 2008 年发布的内核 2.6.27 中开始就内置了。这是为记录数据提供的一个调试 Ring 缓冲区的框架。这些数据由集成到内核中的跟踪程序来采集。

ftrace 工作在 debugfs 文件系统上，在大多数现代 Linux 发行版中都默认挂载了。要开始使用 ftrace，你将进入到 sys/kernel/debug/tracing 目录（仅对 root 用户可用）：

# cd /sys/kernel/debug/tracing

这个目录的内容看起来应该像这样：

аvailable_filter_functions  options            stack_trace_filter   
available_tracers           per_cpu             trace   
buffer_size_kb              printk_formats      trace_clock   
buffer_total_size_kb        README              trace_marker   
current_tracer              saved_cmdlines      trace_options   
dyn_ftrace_total_info       set_event           trace_pipe   
enabled_functions           set_ftrace_filter   trace_stat   
events                      set_ftrace_notrace  tracing_cpumask   
free_buffer                 set_ftrace_pid      tracing_max_latency   
function_profile_enabled    set_graph_function  tracing_on   
instances                   set_graph_notrace   tracing_thresh   
kprobe_events               snapshot            uprobe_events   
kprobe_profile              stack_max_size      uprobe_profile

我不想去描述这些文件和子目录；它们的描述在官方文档中已经写的很详细了。我只想去详细介绍与我们这篇文章相关的这几个文件：

available_tracers —— 可用的跟踪程序
current_tracer —— 正在运行的跟踪程序
tracing_on —— 负责启用或禁用数据写入到 Ring 缓冲区的系统文件（如果启用它，数字 1 被添加到文件中，禁用它，数字 0 被添加）
trace —— 以人类友好格式保存跟踪数据的文件

可用的跟踪程序

我们可以使用如下的命令去查看可用的跟踪程序的一个列表：

root@andrei:/sys/kernel/debug/tracing#: cat available_tracers   
blk mmiotrace function_graph wakeup_rt wakeup function nop

我们来快速浏览一下每个跟踪程序的特性：

function —— 一个无需参数的函数调用跟踪程序
function_graph —— 一个使用子调用的函数调用跟踪程序
blk —— 一个与块 I/O 跟踪相关的调用和事件跟踪程序（它是 blktrace 使用的）
mmiotrace —— 一个内存映射 I/O 操作跟踪程序
nop —— 最简单的跟踪程序，就像它的名字所暗示的那样，它不做任何事情（尽管在某些情况下可能会派上用场，我们将在后文中详细解释）

函数跟踪程序

在开始介绍函数跟踪程序 ftrace 之前，我们先看一个测试脚本：

#!/bin/sh   
dir=/sys/kernel/debug/tracing   
sysctl kernel.ftrace_enabled=1   
echo function > ${dir}/current_tracer   
echo 1 > ${dir}/tracing_on   
sleep 1   
echo 0 > ${dir}/tracing_on   
less ${dir}/trace

这个脚本是非常简单的，但是还有几个需要注意的地方。命令 sysctl ftrace.enabled=1 启用了函数跟踪程序。然后我们通过写它的名字到 current_tracer 文件来启用 current tracer。

接下来，我们写入一个 1 到 tracing_on，它启用了 Ring 缓冲区。这些语法都要求在 1 和 > 符号前后有一个空格；写成像 echo 1> tracing_on 这样将不能工作。一行之后我们禁用它（如果 0 写入到 tracing_on，缓冲区不会被清除并且 ftrace 并不会被禁用）。

我们为什么这样做呢？在两个 echo 命令之间，我们看到了命令 sleep 1。我们启用了缓冲区，运行了这个命令，然后禁用它。这将使跟踪程序采集了这个命令运行期间发生的所有系统调用的信息。

在脚本的最后一行，我们写了一个在控制台上显示跟踪数据的命令。

一旦脚本运行完成后，我们将看到下列的输出（这里只列出了一个小片断）：

# tracer: function   
#   
# entries-in-buffer/entries-written: 29571/29571   #P:2   
#   
#                           _-----=> irqs-off   
#                           / _----=> need-resched   
#                           | / _---=> hardirq/softirq   
#                           || / _--=> preempt-depth   
#                           ||| /   delay   
#           TASK-PID   CPU#  ||||   TIMESTAMP  FUNCTION   
#           | |     |   ||||    |       |   
        trace.sh-1295  [000] ....   90.502874: mutex_unlock <-rb_simple_write   
        trace.sh-1295  [000] ....   90.502875: __fsnotify_parent <-vfs_write   
        trace.sh-1295  [000] ....   90.502876: fsnotify <-vfs_write   
        trace.sh-1295  [000] ....   90.502876: __srcu_read_lock <-fsnotify   
        trace.sh-1295  [000] ....   90.502876: __srcu_read_unlock <-fsnotify   
        trace.sh-1295  [000] ....   90.502877: __sb_end_write <-vfs_write   
        trace.sh-1295  [000] ....   90.502877: syscall_trace_leave <-int_check_syscall_exit_work   
        trace.sh-1295  [000] ....   90.502878: context_tracking_user_exit <-syscall_trace_leave   
        trace.sh-1295  [000] ....   90.502878: context_tracking_user_enter <-syscall_trace_leave   
        trace.sh-1295  [000] d...   90.502878: vtime_user_enter <-context_tracking_user_enter   
        trace.sh-1295  [000] d...   90.502878: _raw_spin_lock <-vtime_user_enter   
        trace.sh-1295  [000] d...   90.502878: __vtime_account_system <-vtime_user_enter   
        trace.sh-1295  [000] d...   90.502878: get_vtime_delta <-__vtime_account_system   
        trace.sh-1295  [000] d...   90.502879: account_system_time <-__vtime_account_system   
        trace.sh-1295  [000] d...   90.502879: cpuacct_account_field <-account_system_time   
        trace.sh-1295  [000] d...   90.502879: acct_account_cputime <-account_system_time   
        trace.sh-1295  [000] d...   90.502879: __acct_update_integrals <-acct_account_cputime

这个输出以“缓冲区中的信息条目数量”和“写入的全部条目数量”开始。这两者的数据差异是缓冲区中事件的丢失数量（在我们的示例中没有发生丢失）。

在这里有一个包含下列信息的函数列表：

进程标识符（PID）
运行这个进程的 CPU（CPU#）
进程开始时间（TIMESTAMP）
被跟踪函数的名字以及调用它的父级函数；例如，在我们输出的第一行，rb_simple_write 调用了 mutex-unlock 函数。

function_graph 跟踪程序

function_graph 跟踪程序的工作和函数跟踪程序一样，但是它更详细：它显示了每个函数的进入和退出点。使用这个跟踪程序，我们可以跟踪函数的子调用并且测量每个函数的运行时间。

我们来编辑一下最后一个示例的脚本：

  postgresqllinux火焰图traceftrace
  「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」
 关注作者
 
 
 【版权声明】本文为墨天轮用户原创内容，转载时必须标注文章的来源（墨天轮），文章链接，文章作者等基本信息，否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容，欢迎发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。
   评论