Cgroup v2 and Page Cache #
The cgroup subsystem is the way to distribute and limit system resources fairly. It organizes all data in a hierarchy where the leaf nodes depend on their parents and inherit their settings. In addition, the cgroup provides a lot of helpful resource counters and statistics.
The control groups are everywhere. Even though you may not use them explicitly, they are already turned on by default in all modern GNU/Linux distributives and got integrated into systemd
. It means that each service in a modern linux system runs under its own cgroup.
Overview #
We already touched the cgroup subsystem several times during this series, but let’s take a closer look at the entire picture now. The cgroup plays a critical role in the understanding Page Cache usage. It also helps to debug issues and configure software better by providing detailed stats. As was told earlier, the LRU lists use cgroup memory limits to make eviction decisions and to size the length of the LRU lists.
Another important topic in cgroup v2, which was unachievable with the previous v1, is a proper way of tracking Page Cache IO writebacks. The v1 can’t understand which memory cgroup generates disk IOPS and therefore, it incorrectly tracks and limits disk operations. Fortunately, the new v2 version fixes these issues. It already provides a bunch of new features which can help with Page Cache writeback.
The simplest way to find out all cgroups and their limits are to go to the /sys/fs/cgroup
. But you can use more convenient ways to get such info:
systemd-cgls
andsystemd-top
to understand what cgroupssystemd
has;below
atop
-like tool for cgroups https://github.com/facebookincubator/below
Memory cgroup files #
Now let’s review the most important parts of the cgroup memory controller from the perspective of Page Cache.
memory.current
– shows the total amount of memory currently used by the cgroup and its descendants. It, of course, includes Page Cache size.
NOTE
It may be tempting to use this value in order to set your cgroup/container memory limit, but wait a bit for the following chapter.
memory.stat
– shows a lot of memory counters, the most important for us can be filtered byfile
keyword:
$ grep file /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope/memory.stat
file 19804160 ❶
file_mapped 0 ❷
file_dirty 0 ❸
file_writeback 0 ❹
inactive_file 6160384 ❺
active_file 13643776 ❺
workingset_refault_file 0 ❻
workingset_activate_file 0 ❻
workingset_restore_file 0 ❻
where
- ❶
file
– the size of the Page Cache; - ❷
file_mapped
– mapped file memory size withmmap()
; - ❸
file_dirty
– dirty pages size; - ❹
file_writeback
– how much data is being flushing at the moment; - ❺
inactive_file
andactive_file
– sizes of the LRU lists; - ❻
workingset_refault_file
,workingset_activate_file
andworkingset_restore_file
– metrics to better understand memory thrashing and refault logic.
memory.numa_stat
– shows the above stats but for each NUMA node.memory.min
,memory.low
,memory.high
andmemory.max
– cgroup limits. I don’t want to repeat the cgroup v2 doc and recommend you to go and read it first. But what you need to keep in mind is that using the hardmax
ormin
limits is not the best strategy for your applications and systems. The better approach you can choose is to set onlylow
and/orhigh
limits closer to what you think is the working set size of your application. We will talk about measuring and predicting in the next section.memory.events
– shows how many times the cgroup hit the above limits:
memory.events
low 0
high 0
max 0
oom 0
oom_kill 0
memory.pressure
– this file contains Pressure Stall Information (PSI). It shows the general cgroup memory health by measuring the CPU time that was lost due to lack of memory. This file is the key to understanding the reclaiming process in the cgroup and, consequently, Page Cache. Let’s talk about PSI in more detail.
Pressure Stall Information (PSI) #
Back before PSI times, it was hard to tell whether a system and/or a cgroup has resource contention or not; whether a cgroup limits are overcommitted or under-provisioned. If the limit for a cgroup can be set lower, then where is its threshold? The PSI feature mitigates these confusions and not only allows us to get this information in real-time but also allows us to set up user-space triggers and get notifications to maximize hardware utilization without service degradation and OOM risks.
The PSI works for memory, CPU and IO controllers. For example, the output for memory:
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
where
some
– means that at least one task was stalled on memory for some average percentage of wall-time during 10, 60 and 300 seconds. The “total” field shows the absolute value in microseconds in order to reveal any spikes;full
– means the same but for all tasks in the cgroup. This metric is a good indication of issues and usually means underprovisioning of the resource or wrong software settings.
EXAMPLE
systemd-oom
daemon, which is a part of modern GNU/Linux systems, uses the PSI to be more proactive than kernel’s OOM in recognition of memory scarcity and finding targets for killing.
I also highly recommend reading the original PSI doc.
Writeback and IO #
One of the most significant features of the cgroup v2 implementation is the possibility to track, observe and limit Page Cache async writeback for each cgroup. Nowadays, the kernel writeback process can identify which cgroup IO limit to use in order to persist dirty pages to disks.
But what is also important is that it works in another direction too. If a cgroup experiences memory pressure and tries to reclaim some pages by flushing its dirty pages, it will use its own IO limits and won’t harm the other cgroups. Thus the memory pressure translates into the disk IO and if there is a lot of writes, eventually, into the disk pressure for the cgroup. Both controllers have the PSI files, which should be used for proactive management and tuning your software settings.
In order to control dirty pages flush frequency, the linux kernel has several sysctl
knobs. If you want, you can make the background writeback process more or less aggressive:
$ sudo sysctl -a | grep dirty
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200
Some of the above works for cgroups too. The kernel chooses and applies what reaches first for the entire system or for a cgroup.
The cgroup v2 also brings new IO controllers: io.cost
and io.latency
. They provide 2 different approaches for limiting and guaranteeing disk operations. Please, read the cgroup v2 documentation for more details and distinctions. But I would say that if your setup is not complex, starting with less invasive io.latency
makes sense.
As with the memory controller, the kernel also provides a bunch of files to control and observe IO:
io.stat
– the stat file with per device data;io.latency
– the latency target time in microseconds;io.pressure
– the PSI file;io.weight
– the target weight ifio.cost
was chosen;io.cost.qos
andio.cost.model
– the configuration file of theio.cost
cgroup controller.
Memory and IO cgroup ownership #
Several processes from multiple cgroups can obviously work with the same files. For example, cgroup1
can open and read the first 10 KiB of the file, and sometime later, another cgroup2
can append 2 KiB to the end of the same file and read the first 4KiB. The question is, whose memory and IO limits will the kernel use?
The logic of memory ownership (therefore and Page Cache) is built based on each page. The ownership of a page is charged on the first access (page fault) and won’t switch to any other cgroup until this page will be completely reclaimed and evicted. The term ownership means that these pages will be used to calculate the cgroup Page Cache usage and will be included in all stats.
For example, cgroup1
is the owner of the first 10KiB, and cgroup2
– is the owner of the last 2KiB. No matter what cgroup1
will do with the file, it can even close it, cgroup1
remains the owner of the first 4KiB (not all 10KiB) as long as cgroup2
works with this first 4KiB of the file. In this situation, kernel keeps the pages in Page Caches and keeps updating LRU lists accordingly.
For the cgroup IO, ownership works per inode. So for our example cgroup2
owns all writeback operations for the file. The inode is assigned to the cgroup on the first writeback, but unlike the memory ownership logic, the IO ownership may migrate to another cgroup if the kernel notices that this other cgroup generates more dirty pages.
In order to troubleshoot memory ownership, we should use the pair of procfs
files: /proc/pid/pagemap
and /proc/kpagecgroup
. The page-type
tool supports showing per page cgroup information, but it’s hard to use it for a directory of files and get a well-formatted output. That’s why I wrote my own cgtouch
tool in order to troubleshoot cgroup memory ownership.
$ sudo go run ./main.go /var/tmp/ -v
/var/tmp/file1.db
cgroup inode percent pages path
- 85.9% 28161 not charged
1781 14.1% 4608 /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope
--
/var/tmp/ubuntu-21.04-live-server-amd64.iso
cgroup inode percent pages pat
- 0.0% 0 not charged
2453 100.0% 38032 /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/run-u10.service
--
Files: 2
Directories: 7
Resident Pages: 42640/70801 166.6M/276.6M 60.2%
cgroup inode percent pages path
- 39.8% 28161 not charged
1781 6.5% 4608 /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope
2453 53.7% 38032 /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/run-u10.service
Safe ad-hoc tasks #
Let’s assume we need to run the wget
command or manually install some packages by calling a configuration management system (e.g. saltstack
). Both of these tasks can be unpredictably heavy for disk I/O. In order to run them safely and not interact with any production load, we should not run them in the root cgroup or the current terminal cgroup, because they usually don’t have any limits. So we need a new cgroup with some limits. It would be very tedious and cumbersome to manually create a cgroup for your task and manually configure it for every ad-hoc task. But fortunately, we don’t have to, so all modern GNU/Linux distributives come with the systemd
out of the box with cgroup v2. The systemd-run
with many other cool features from the systemd
makes our life easier and saves a lot of time.
So, for example, wget
task can be run in the following manner:
systemd-run --user -P -t -G --wait -p MemoryMax=12M wget http://ubuntu.ipacct.com/releases/21.04/ubuntu-21.04-live-server-amd64.iso
Running as unit: run-u2.service ⬅ LOOK HERE
Press ^] three times within 1s to disconnect TTY.
--2021-09-11 19:53:33-- http://ubuntu.ipacct.com/releases/21.04/ubuntu-21.04-live-server-amd64.iso
Resolving ubuntu.ipacct.com (ubuntu.ipacct.com)... 195.85.215.252, 2a01:9e40::252
Connecting to ubuntu.ipacct.com (ubuntu.ipacct.com)|195.85.215.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1174243328 (1.1G) [application/octet-stream]
Saving to: ‘ubuntu-21.04-live-server-amd64.iso.5’
...
The run-u2.service
is my brand new cgroup with a memory limit. I can get its metrics:
$ find /sys/fs/cgroup/ -name run-u2.service
/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/run-u2.service
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/run-u2.service/memory.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=70234
full avg10=0.00 avg60=0.00 avg300=0.00 total=69717
$ grep file /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/run-u2.service/memory.stat
file 11100160
file_mapped 0
file_dirty 77824
file_writeback 0
file_thp 0
inactive_file 5455872
active_file 5644288
workingset_refault_file 982
workingset_activate_file 0
workingset_restore_file 0
As you can see from the above we have near 12MiB file memory and some refault.
To get all power of systemd and cgroup please read its resource control doc.
Read next chapter →