Automatic Kernel Tunables

Version 0.1
2006/0
2/06

Nadia Derbey
Nadia.Derbey@bull.net



1. Overview
2. Issues
3. Tunables adjustment routine
4. "All kernel" solution
    4.1. Initialization routine (akt_init_params)
    4.2. Adjustment routine (akt_adjust)
        4.2.1. Upwards adjustment
        4.2.2. Downwards adjustment
    4.3. Enabling Automatic Kernel Tunables
5. Module solution
6. Impacts
    6.1. Time impact with a limitation to 128 semaphores
    6.2. Time impact with no limitation
    6.3. Results interpretation
7. Mixing the tunables adjustments
8. Final framework
    8.1. Architecture
    8.2. The akt kernel component
       8.2.1. Data structures
       8.2.2. Provided routines
            8.2.2.1. akt_add_node()
            8.2.2.2. akt_rem_node()
            8.2.2.3. akt_register()
            8.2.2.4. akt_unregister()
            8.2.2.5. akt_get_valid()
       8.2.3. Configuration
    8.3. The akt_mod module
        8.3.1. akt_adj_routine()
        8.3.2. akt_init()
        8.3.3. akt_mod and configfs
            8.3.3.1. Configfs objects for akt_mod
    8.4. Complete description of the mechanism
    8.5. Serialization
        8.5.1. akt_table array
        8.5.2. akt_node structures
        8.5.3. Locks hierarchy
        8.5.4. RCU locks
9. Application
10. Deliverables
11. Reference documents
End of document


1. Overview

The procfs pseudo-filesystem, which is used as an interface to kernel data structures, mostly contains read-only files: these files are used to get statistics about the system activity, or more generally, information about the system resources.

In the procfs pseudo-filesystem there are also some writable files: they allow kernel variables to be dynamically changed. This means that any new value stored in those files is taken into account by the kernel without any need to reboot the system.
These files contain default values that do not necessarily suit the system needs for a given activity. As an example, one of the values contained in /proc/sys/kernel/sem gives the maximum allowed semaphore ids in the system. The default value for this variable is 128. But if a running application creates a huge number of semaphores, this value might rapidly become underevaluated.

This raises the need for 2 things:
  1. A way to periodically check the resources that are used on the system
  2. A way to automatically adjust the kernel tunables as the resources are seen to be running out.

2. Issues

  1. Given the wide variety of resources and associated tunables, it seems impossible to define a general framework to check any resource usage and adjust the associated tunable.
  2. Defining a framework in the user space implies a latency that makes it difficult to react fast enough to resources running out.
  3. Defining a framework in the kernel space might lead to a situation where the kernel spends all its time checking the resources usage and adjusting the needed tunables.
To solve these issues, we propose the following solutions:

3. Tunables adjustment routine

In all the following chapters, we assume we need to adjust tunables that control the maximum number of a given resource (see chapter 2).

It should be noted that adjusting such a tunable can be done in both directions:
In order to define a generic routine that adjusts a tunable value, we need the following information:
A new structure is defined to hold all this information (akt_node).

Ex: in the semaphores case, this structure should be filled as follows:

4. "All kernel" solution

This solution consists in providing 2 routines: an initialization routine and an adjustment routine, described in the following chapters. These routines may be called by any kernel component that wishes to adjust its tunables.

4.1. Initialization routine (akt_init_params)

This routine is called to initialize the information needed by the generic adjustment routine (see chapter 3). It takes the following parameters:
This routine should be called from the initialization routine of any kernel component that wants to activate the adjustment in the future. Ex: sem_init() in the semaphores case.

4.2. Adjustment routine (akt_adjust)

This routine is called to activate the adjustment of a tunable if needed. It takes the following parameters:
This routine should be called by any kernel component when it allocates a new resource (to adjust up) or when it releases a previously created resource (to adjust down), given that this resource is controlled by a tunable.

4.2.1. Upwards adjustment

A tunable upwards adjustment is asked for when allocating a resource that is controlled by a tunable.
Since the threshold is expressed as a percentage, adjustment is needed if the checked variable has reached threshold% of the tunable value (the checked variable address and the tunable address are both present in the structure pointed to by the 2nd paramater of the routine).
If the tunable adjustment is needed, the new tunable value is set to (200 - threshold)% of its current value.

Example: in the semaphore case:

4.2.2. Downwards adjustment

This is the reverse operation.
A tunable downwards adjustment is asked for when releasing a resource that is controled by a tunable. Adjustment is needed if the checked variable has fallen under threshold% of the tunable previous value, i.e. under threshold% of (tunable * 100 / (200 - threshold)), i.e. under (threshold * tunable / (200 - threshold)).
In that case, the new tunable value is set back to its previous value, i.e. to (tunable * 100 / (200 - threshold)).

Example: in the semaphore case:

4.3. Enabling Automatic Kernel Tunables

This functionality can be enabled or disabled by configuring the kernel: a choice is added to the pseudo filesystems menu (in fs/Kconfig). It depends on the procfs choice: support for the automatic kernel tuning is proposed only if /procfs support is selected.

5. Module solution

This is a solution that avoids inserting too much code in the kernel tree. It is basically the same solution as the one described in chapter 4, except that:

6. Impacts

A small test has been written that loops only creating semaphores. This test is used to do time measurements: each call to semget() is immediately preceeded and followed by a call to gettimeofday().
Then the difference between both times is computed. Since semget() creates one semaphore at a time, we can deduce the time spent creating one semaphore.
The test has been run in the following environments:
  1. without CONFIG_AKT, i.e. base kernel:
  2. with CONFIG_AKT, but module not loaded:
  3. with CONFIG_AKT and module loaded:
  4. without CONFIG_AKT, kernel code changed to directly set sc_semmni to IPCMNI (32768):
Environments 1 and 2 have been compared, as well as environments 3 and 4 (the other combinations are not comparable).

6.1. Time impact with a limitation to 128 semaphores

In this case, the time spent in semget() was output every 10 calls to semget().
Time is expresses in microseconds.


semaphore number
without CONFIG_AKT
with CONFIG_AKT
module unloaded

0
4
4

10
1
2

20
1
1

30
1
1

40
1
1

50
1
1

60
2
1

70
1
2

80
1
2

90
2
1

100
2
3

110
2
1

120
1
2
Total
13
20
22

6.2. Time impact with no limitation

In this case, the time spent in semget() was output in the following way:


semaphore number
without CONFIG_AKT
sc_semmni = 32768
(hardcoded)
with CONFIG_AKT
module loaded

0
6
4

10
1
1

20
1
1

30
1
1

40
1
1

50
2
1

60
3
1

70
1
2

80
2
2

90
2
1

100
3
1

110
2
1

120
2
2

200
3
2

300
6
2

400
5
3

500
5
3

600
8
4

700
6
5

800
9
5

900
10
6

1000
10
6

2000
20
16

3000
37
29

4000
71
71

5000
107
122

6000
167
173

7000
184
220

8000
247
298

9000
308
368

10000
445
420

15000
629
636

20000
866
861

25000
1045
1047

30000
1291
1285
Total
33
5506
5601

6.3. Results interpretation

The time measurements show that the overhead due to AKT code itself is negligible.
But we can see that enlarging the semaphores array (whether using the automatic adjustment or hardcoding a large number of entries) implies performances degradations when creating the last semaphores: less than 10 microseconds are spent when creating a semaphore when there are still up to 1000 semaphores already created, while more than 1millisecond is needed when we reach 25000 semaphores. This overhead is due to the way the semaphores array is scanned.

7. Mixing the tunables adjustments

As seen in chapter 6.3, if tunables adjustment is not carefully done, we may reach a situation where all the kernel arrays have a huge amount of entries, thus leading to huge processing times and to a large occupied memory space.
In order to avoid such problems, we propose to mix what have been previously presented with a configuration mechanism. This mechanism would enable to give the list of tunables that are candidate to be adjusted. Any tunable out of this list would not be authorized to be adjusted.
For example, if an administrator wants to privilege an application that needs a large number of semaphores and shared memory segments, he will fill a configuration file with the associated list of tunables. This means that only sc_semmni and shm_ctlmni will be adjusted as needed, leaving any other tunable unchanged. We propose that this configuration part be supported by configfs.
Deciding which tunables are candidate to adjustment could be done with the help of the profiles described in [4], [5] or [6].

8. Final framework

8.1. Architecture

The following scheme shows the complete architecture of the final framework:

Fig 1: AKT framework architecture


The AKT framework is made of an AKT kernel component and a module (akt_mod) that provide all the necessary routines and interact together in order to make the kernel tunables dynamically adjustable.
The principle is the following: any kernel component that wants one or more of its kernel tunables to be dynamically adjusted registers these tunables in the akt kernel component. These tunables are not authorized to be dynamically adjusted until the akt_mod module is loaded. When akt_mod is loaded, users may declare tunables as being adjustable. At that time, akt_mod sets to valid the corresponding kernel structures. This validity is tested as soon as akt_adjust() is called (upon resource creation) in order to know whether the tunable should be dynamically adjusted.

8.2. The akt kernel component

The akt kernel module is in charge of:

8.2.1. Data structures

8.2.2. Provided routines

8.2.2.1. akt_add_node()

akt_node_t *akt_add_node(char threshold, short key, int min, int max, int *tunable, int *checked)

This is the routine called by any kernel component that wants to use akt services in the future (i.e. make a tunable dynamically adjusted). Its processing is as follows:
  1. It allocates an akt_node structure and fills it with the values passed in as parameters.
  2. It chains that structure into one of the lists pointed to by akt_table: the appropriate index in that table is found by hashing the key parameter.
  3. It returns to the caller the pointer to the allocated akt_node structure
This routine is exported to be used by kernel modules if needed.

8.2.2.2. akt_rem_node()

int akt_rem_node(akt_node_t *node)

This routine does the reverse operations of akt_add_node(). It is called by any kernel component that doesn't need akt services anymore. Actually it is useful for kernel modules when they are unloaded, if ever they have added an akt_node upon loading.
Its processing is the following:
  1. It unchains the akt_node structure from the list pointed to by akt_table (the appropriate index in that table is found by hashing the key member of the akt_node structure passed in as a parameter).
  2. It frees the structure.
  3. It returns 0 if successful and a negative value upon failure

8.2.2.3. akt_register()

int akt_register(short key)

This routine is used to set to valid the akt_node structure that contains the key passed in as a parameter. It is called by the akt_mod module when it needs to start dynamic adjustment on a given tunable.

8.2.2.4. akt_unregister()

int akt_unregister(short key)

This routine is used to set to invalid the akt_node structure that contains the key passed in as a parameter. It is called by the akt_mod module when it needs to stop dynamic adjustment on a given tunable.

8.2.2.5. akt_get_valid()

int akt_get_valid(short key)

This routine returns the value of the valid member for the akt_node structure that contains the key passed in as a parameter. It is called by the akt_mod module when it needs to check if a given tunable is dynamically adjustable.

8.2.3. Configuration

The kernel can be configured to support (or not support) AKT: a new boolean choice is added to the "pseudo-filesystems" menu (under the "filesystems" menu). It depends on procfs support.

8.3. The akt_mod module

This module defines the adjustment routine and allows a user to enable or disable dynamic adjustment for a given tunable (this is done with the help of configfs pseudo-filesystem).

8.3.1. akt_adj_routine()

int akt_adj_routine(int cmd, akt_node_t *params)

This is the routine that does the actual adjustment (see chapter 4.2). Before doing anything, it checks whether the tunable in the akt_node structure passed in is authorized to be adjusted (by testing the valid member of the structure). If adjustment is not allowed, it just returns.

8.3.2. akt_init()

This is the entry point of the akt_mod module: it starts the configfs part of akt_mod and sets the akt_adjust kernel pointer to akt_adj_routine() address.

8.3.3. akt_mod and configfs

akt_mod uses configfs in the following way:
  1. When it is loaded, the module creates an "akt_config" directory under configfs. When first created, this directory only contains:
  2. Each time a user wants to make a tunable automatically adjustable, he has to create a "tunXXX" directory (XXX is the decimal value of the tunable constant as defined in /usr/include/libtune.h)
  3. When 'tunXXX' is created it contains a file called 'valid'. This file contains the value 1: this means that the corresponding tunable can be automatically adjusted from now on.
    This file is read-write but cannot be removed.
  4. When a user doesn't need the tunable XXX to be adjusted anymore, he can do one of:

8.3.3.1. Configfs objects for akt_mod

8.4. Complete description of the mechanism

This description is based on the scheme presented in Fig. 1, and we will assume that the validated tunable is sc_semmni (TUN_SEMMNI).
  1. During the kernel components initialization, an akt_node structure is registered for each tunable as needed, by calling akt_add_node().
    Ex: in sem_init(), the following should be called
    sem_akt = akt_add_node(SEMTHRESH, TUN_SEMMNI, SEMMNI, IPCMNI, &sc_semmni, &(sem_ids.in_use));
    Notes:
    1. akt kernel component initialization part should occur early during the system initialization, in order to enable any kernel component to register its akt_node structure(s).
    2. a module that wants to use akt services will have to call akt_add_node() from its initialization routine, and akt_rem_node() from its exit routine.
    When the kernel has finished booting, each akt_table entry points to a list of akt_node structures declared by the kernel components.

  2. When akt_mod is loaded, its initialization routine (akt_init()) registers in configfs: the /config/akt_mod directory is created.

  3. akt_init() also stores the address of the adjustment routine (akt_adj_routine()) into akt_adjust.
    Note:
    At that point, if a resource is created (ex: sys_semget() called), akt_adjust is not NULL anymore, so akt_adj_routine() will be called (ex: akt_adj_routine(sem_akt)), but it will immediately return, since the akt_node (ex: sem_akt) has not been validated yet.

  4. A user wants to allow the tunable XXX to be dynamically adjustable: he creates tunXXX under /config/akt_mod.
    Ex: for TUN_SEMMNI the following command should be typed:
    mkdir /config/akt_mod/tun1027

    Note:
    The file /config/akt_mod/tunXXX/valid is automatically created during this operation.

  5. The make_item() method (akt_tun_mkdir()) is called: it invokes akt_register(XXX) (ex: akt_register(TUN_SEMMNI). This turns on the valid member of the akt_node corresponding structure (ex: sem_akt->valid = 1).

  6. At that point, if a resource is created (ex: sys_semget() called), akt_adj_routine() is called (ex: akt_adj_routine(sem_akt)), ant it will dynamically adjust the corresponding tunable as needed (ex: sc_semmni), since the akt_node (ex: sem_akt) has been validated.

  7. This applies to resource removal too.

  8. A user doesn't need the tunable XXX to be adjusted anymore: he stores the '0' character into /config/akt_mod/tunXXX/valid.
    Ex: to disallow sc_semmni dynamic adjustment, just type
    echo 0 > /config/akt_mod/tun1027/valid

  9. The store_attibute() method (akt_tun_set_valid(0)) is called: it invokes akt_unregister(XXX) (ex: akt_unregister(TUN_SEMMNI). This turns off the valid member of the akt_node corresponding structure (ex: sem_akt->valid = 0).
    Note:
    At that point, if a resource is created (ex: sys_semget() called), akt_adjust is not NULL, so akt_adj_routine() will be called (ex: akt_adj_routine(sem_akt)), but it will immediately return, since the akt_node (ex: sem_akt) has been reset to 0.

  10. A user wants to re-allow the tunable XXX to be dynamically adjustable: he stores '1' into /config/akt_mod/tunXXX/valid.
    Ex: to re-allow sc_semmni dynamic adjustment, just type:
    echo 1 > /config/akt_mod/tun1027/valid

    The store_attribute() method (akt_tun_set_valid(1)) is called: it invokes akt_register(XXX) (ex: akt_register(TUN_SEMMNI). This turns on the valid member of the akt_node corresponding structure (ex: sem_akt->valid = 1). Thus, if a resource is created, akt_adj_routine() will be called and will dynamically adjust the corresponding tunable as needed (since the akt_node has been re-validated).

  11. A user wants to check if the tunable XXX is allowed to be dynamically adjustable: he displays the contents of /config/akt_mod/tunXXX/valid.
    Ex: to check if sc_semmni dynamic adjustment is allowed, just type:
    cat /config/akt_mod/tun1027/valid

  12. The show_attribute() method (akt_tun_get_valid()) is called: it invokes akt_get_valid(XXX) (ex: akt_get_valid(TUN_SEMMNI). This routine displays the value of the valid member of the akt_node corresponding structure (ex: sem_akt->valid).

  13. This is another method of disallowing dynamic adjustment of a tunable XXX: the user removed /config/akt_mod/tunXXX directory. The drop_item() method (akt_tun_rmdir() is called: it invokes akt_unregister(XXX). This turns off the valid member of the akt_node corresponding structure. So, if a resource is created, akt_adj_routine() will be called, but it will return immediately.

8.5. Serialization

The following scheme shows how akt_table and akt_node's are accessed by the various routines provided by the akt framework:


Fig 2: Accesses to the AKT data structures


8.5.1. akt_table array

akt_add_node(), akt_rem_node(), akt_register() and akt_unregister() access a given akt_node through akt_table: the key parameter is hashed to get an index into akt_table. This index points to the list that contains the akt_node  that will be:
Thus, many ways of concurrently accessing a given list of akt_node structures are possible.
Examples:
So a lock is needed to serialize the access to each list of akt_node structures, in order to maintain that list integrity. This lock (list_lck) should be a member of the structure stored in each index of akt_table[]. Storing the lock into each akt_table index is more efficient from a performance point of vue, than defining a global lock for the entire table: that way, 2 distinct lists of this table may be processed in parallel.

8.5.2. akt_node structures

akt_adj_routine() accesses an akt_node structure to get all the information that is needed to do the actual adjustment. This routine has a direct access to a given akt_node (a pointer to it is passed in as parameter): since it may be called by any kernel component that is creating or releasing a resource, the performances should not be degraded. Thus it is more efficient to directly access the akt_node than to hash the associated key, and look for the akt_node into a list.

Here too, there are many ways of concurrently accessing a given akt_node structure.
Examples:
A lock is needed here too, to maintain each node integrity. This lock (node_lck) is a member of the akt_node structure.

8.5.3. Locks hierarchy

For the routines that need to take both locks, the hierarchy is the following:
  1. take akt_table[i].list_lck first
  2. take node->node_lck second

8.5.4. RCU locks

Using RCU locks should be considered here. TBD.

9. Application

This chapter gives the list of the tunables this method can be applied to: i.e. dynamic adjustment, without any induced problem.
TBD.

10. Deliverables

The AKT package will be delivered under the following phases:

11. Reference Documents


[1]
Linux Tunables Inventory
[2]
libtune API documentation
[3]
configfs - Userspace driven kernel object configuration
[4]
Tuning Linux kernel for UDB
[5]
Tuning Linux kernel for GPFS
[6]
Tuning Linux kernel for nfsv4
[7]
What is RCU?




End of document