你的位置:Site賽的 - 資訊挖掘專家 >> 資訊 >> 科技世界 >> Linux >> 詳細內容 在線投稿

linux irqbalance 是做什麼的

舉報 發佈者:robot-1
熱度0票  瀏覽1698次 【共0條評論】【我要評論
irqbalance用於優化中斷分配,它會自動收集系統數據以分析使用模式,並依據系統負載狀況將工作狀態置於Performance mode 或Power-save mode。處於Performance mode 時,irqbalance 會將中斷盡可能均勻地分發給各個CPU core,以充分利用CPU 多核,提升性能。
處於Power-save mode 時,irqbalance 會將中斷集中分配給第一個CPU,以保證其它空閒CPU 的睡眠時間,降低能耗。

在RHEL發行版裡這個守護程序默認是開機啟用的,那如何確認它的狀態呢?

# service irqbalance status
irqbalance (pid PID) is running…

然後在實踐中,我們的專用的應用程序通常是綁定在特定的CPU上的,所以其實不可不需要它。如果已經被打開了,我們可以用下面的命令關閉它:

# service irqbalance stop
Stopping irqbalance: [ OK ]

或者乾脆取消開機啟動:

# chkconfig irqbalance off

下面我們來分析下這個irqbalance的工作原理,好準確的知道什麼時候該用它,什麼時候不用它。

既然irqbalance用於優化中斷分配,首先我們從中斷講起,文章很長,深吸一口氣,來吧!

SMP IRQ Affinity 相關東西可以參見這篇文章
摘抄重點:

SMP affinity is controlled by manipulating files in the /proc/irq/ directory.
In /proc/irq/ are directories that correspond to the IRQs present on your
system (not all IRQs may be available). In each of these directories is
the “smp_affinity” file, and this is where we will work our magic.

說白了就是往/proc/irq/N/smp_affinity文件寫入你希望的親緣的CPU的mask碼!關於如何手工設置中斷親緣性,請參見我之前的博文: 這裡這裡

接著普及下概念,我們再來看下CPU的拓撲結構,首先看下Intel CPU的各個部件之間的關係:
cpu-term0
一個NUMA node包括一個或者多個Socket,以及與之相連的local memory。一個多核的Socket有多個Core。如果CPU支持HT,OS還會把這個Core看成2個Logical Processor。

可以看拓撲的工具很多lscpu或者intel的cpu_topology64工具都可以,可以參考這裡這裡

這次用之前我們新介紹的Likwid工具箱裡面的likwid-topology我們可以看到:

./likwid-topology

cpu-topology

CPU的拓撲結構是各種高性能服務器CPU親緣性綁定必須理解的東西,有感覺了嗎?

有了前面的各種基礎知識和名詞的鋪墊,我們就可以來調查irqbalance的工作原理:

//irqbalance.c
int main(int argc, char** argv)
{
  /* ... */
  while (keep_going) {
                sleep_approx(SLEEP_INTERVAL); //#define SLEEP_INTERVAL 10
                /* ... */
                clear_work_stats();
                parse_proc_interrupts();
                parse_proc_stat();
                /* ... */
                calculate_placement();
                activate_mappings();
                /* ... */
}
/* ... */
}
從程序的主循環可以很清楚的看到它的邏輯,在退出之前每隔10秒它做了以下的幾個事情:
1. 清除統計
2. 分析中斷的情況
3. 分析中斷的負載情況
4. 根據負載情況計算如何平衡中斷
5. 實施中斷親緣性變跟

好吧,稍微看下irqbalance如何使用的:

man irqbalance

–oneshot
Causes irqbalance to be run once, after which the daemon exits
–debug
Causes irqbalance to run in the foreground and extra debug information to be printed

在診斷模型下運行irqbalance可以給我們很多詳細的信息:

#./irqbalance –oneshot –debug

喝口水,我們接著來分析下各個步驟的詳細情況:

先了解下中斷在CPU上的分佈情況:

$cat /proc/interrupts|tr -s ' ' '\t'|cut -f 1-3
        CPU0 CPU1
        0: 2622846291
        1: 7
        4: 234
        8: 1
        9: 0
        12: 4
        50: 6753
        66: 228
        90: 497
        98: 31
209: 2 0
217: 0 0
225: 29 556
233: 0 0
NMI: 7395302 4915439
LOC: 2622846035 2622833187
ERR: 0
MIS: 0
輸出的第一列是中斷號,後面的2列是在CPU0,CPU1的中斷次數。

但是我們如何知道比如中斷是98那個類型的設備呢?不廢話,上代碼!

//classify.c
char *classes[] = {
        "other",
        "legacy",
        "storage",
        "timer",
        "ethernet",
        "gbit-ethernet",
        "10gbit-ethernet",
        0
};
 
#define MAX_CLASS 0x12
/*
 * Class codes lifted from pci spec, appendix D.
 * and mapped to irqbalance types here
 */
static short class_codes[MAX_CLASS] = {
        IRQ_OTHER,
        IRQ_SCSI,
        IRQ_ETH,
        IRQ_OTHER,
        IRQ_OTHER,
        IRQ_OTHER,
        IRQ_LEGACY,
        IRQ_OTHER,
        IRQ_OTHER,
        IRQ_LEGACY,
        IRQ_OTHER,
        IRQ_OTHER,
        IRQ_LEGACY,
        IRQ_ETH,
        IRQ_SCSI,
        IRQ_OTHER,
        IRQ_OTHER,
        IRQ_OTHER,
};
int map_class_to_level[7] =
{ BALANCE_PACKAGE, BALANCE_CACHE, BALANCE_CACHE, BALANCE_NONE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE };
irqbalance把中斷分成7個類型,不同類型的中斷平衡的時候作用域不同,有的在PACKAGE,有的在CACHE,有的在CORE。
那麼類型信息在那裡獲取呢?不廢話,上代碼!

//#define SYSDEV_DIR "/sys/bus/pci/devices"
static struct irq_info *add_one_irq_to_db(const char *devpath, int irq, struct user_irq_policy *pol)
{
...
        sprintf(path, "%s/class", devpath);
 
        fd = fopen(path, "r");
 
        if (!fd) {
                perror("Can't open class file: ");
                goto get_numa_node;
        }
 
        rc = fscanf(fd, "%x", &class);
        fclose(fd);
 
        if (!rc)
                goto get_numa_node;
 
        /*
         * Restrict search to major class code
         */
        class >>= 16;
 
        if (class >= MAX_CLASS)
                goto get_numa_node;
 
        new->class = class_codes[class];
        if (pol->level >= 0)
                new->level = pol->level;
        else
                new->level = map_class_to_level[class_codes[class]];
get_numa_node:
        numa_node = -1;
        sprintf(path, "%s/numa_node", devpath);
        fd = fopen(path, "r");
        if (!fd)
                goto assign_node;
 
        rc = fscanf(fd, "%d", &numa_node);
        fclose(fd);
 
assign_node:
        new->numa_node = get_numa_node(numa_node);
 
        sprintf(path, "%s/local_cpus", devpath);
        fd = fopen(path, "r");
        if (!fd) {
                cpus_setall(new->cpumask);
                goto assign_affinity_hint;
        }
        lcpu_mask = NULL;
        ret = getline(&lcpu_mask, &blen, fd);
fclose(fd);
        if (ret <= 0) {
                cpus_setall(new->cpumask);
        } else {
                cpumask_parse_user(lcpu_mask, ret, new->cpumask);
        }
        free(lcpu_mask);
 
assign_affinity_hint:
        cpus_clear(new->affinity_hint);
        sprintf(path, "/proc/irq/%d/affinity_hint", irq);
        fd = fopen(path, "r");
        if (!fd)
                goto out;
        lcpu_mask = NULL;
        ret = getline(&lcpu_mask, &blen, fd);
        fclose(fd);
        if (ret <= 0)
            goto out;
        cpumask_parse_user(lcpu_mask, ret, new->affinity_hint);
        free(lcpu_mask);
out:
...
}
#上面的c代碼翻譯成下面的腳本就是:

$cat>x.sh
SYSDEV_DIR="/sys/bus/pci/devices/"
for dev in `ls $SYSDEV_DIR`
do
    IRQ=`cat $SYSDEV_DIR$dev/irq`
    CLASS=$(((`cat $SYSDEV_DIR$dev/class`)>>16))
    printf "irq %s: class[%s] " $IRQ $CLASS
    if [ -f "/proc/irq/$IRQ/affinity_hint" ]; then
        printf "affinity_hint[%s] " `cat /proc/irq/$IRQ/affinity_hint`
    fi
    if [ -f "$SYSDEV_DIR$dev/local_cpus" ]; then
        printf "local_cpus[%s] " `cat $SYSDEV_DIR$dev/local_cpus`
    fi
    if [ -f "$SYSDEV_DIR$dev/numa_node" ]; then
        printf "numa_node[%s]" `cat $SYSDEV_DIR$dev/numa_node`
    fi
    echo
done
CTRL+D
$ tree /sys/bus/pci/devices
/sys/bus/pci/devices
|-- 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
|-- 0000:00:01.0 -> ../../../devices/pci0000:00/0000:00:01.0
|-- 0000:00:03.0 -> ../../../devices/pci0000:00/0000:00:03.0
|-- 0000:00:07.0 -> ../../../devices/pci0000:00/0000:00:07.0
|-- 0000:00:09.0 -> ../../../devices/pci0000:00/0000:00:09.0
|-- 0000:00:13.0 -> ../../../devices/pci0000:00/0000:00:13.0
|-- 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0
|-- 0000:00:14.1 -> ../../../devices/pci0000:00/0000:00:14.1
|-- 0000:00:14.2 -> ../../../devices/pci0000:00/0000:00:14.2
|-- 0000:00:14.3 -> ../../../devices/pci0000:00/0000:00:14.3
|-- 0000:00:1a.0 -> ../../../devices/pci0000:00/0000:00:1a.0
|-- 0000:00:1a.7 -> ../../../devices/pci0000:00/0000:00:1a.7
|-- 0000:00:1d.0 -> ../../../devices/pci0000:00/0000:00:1d.0
|-- 0000:00:1d.1 -> ../../../devices/pci0000:00/0000:00:1d.1
|-- 0000:00:1d.2 -> ../../../devices/pci0000:00/0000:00:1d.2
|-- 0000:00:1d.7 -> ../../../devices/pci0000:00/0000:00:1d.7
|-- 0000:00:1e.0 -> ../../../devices/pci0000:00/0000:00:1e.0
|-- 0000:00:1f.0 -> ../../../devices/pci0000:00/0000:00:1f.0
|-- 0000:00:1f.2 -> ../../../devices/pci0000:00/0000:00:1f.2
|-- 0000:00:1f.3 -> ../../../devices/pci0000:00/0000:00:1f.3
|-- 0000:00:1f.5 -> ../../../devices/pci0000:00/0000:00:1f.5
|-- 0000:01:00.0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0
|-- 0000:01:00.1 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.1
|-- 0000:04:00.0 -> ../../../devices/pci0000:00/0000:00:09.0/0000:04:00.0
`-- 0000:05:00.0 -> ../../../devices/pci0000:00/0000:00:1e.0/0000:05:00.0
 
$chmod +x x.sh
$./x.sh|grep 98
irq 98: class[2] local_cpus[00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000]
簡單的分析下數字:class_codes[2]=IRQ_ETH 也就是說這個中斷是塊網卡。

那中斷的負載是怎麼算出來的呢?繼續看代碼!

//procinterrupts.c
void parse_proc_stat(void)
{
  ...
        file = fopen("/proc/stat", "r");
        if (!file) {
                log(TO_ALL, LOG_WARNING, "WARNING cant open /proc/stat. balacing is broken\n");
                return;
        }
 
        /* first line is the header we don't need; nuke it */
        if (getline(&line, &size, file)==0) {
                free(line);
                log(TO_ALL, LOG_WARNING, "WARNING read /proc/stat. balancing is broken\n");
                fclose(file);
                return;
        }
 cpucount = 0;
        while (!feof(file)) {
                if (getline(&line, &size, file)==0)
                        break;
 
                if (!strstr(line, "cpu"))
                        break;
 
                cpunr = strtoul(&line[3], NULL, 10);
 
                if (cpu_isset(cpu​​nr, banned_cpus))
                        continue;
 
                rc = sscanf(line, "%*s %*d %*d %*d %*d %*d %d %d", &irq_load, &softirq_load);
                if (rc < 2)
                        break;
 
                cpu = find_cpu_core(cpunr);
 
                if (!cpu)
                        break;
 
                cpucount++;
 /*
                 * For each cpu add the irq and softirq load and propagate that
                 * all the way up the device tree
                 */
                if (cycle_count) {
                        cpu->load = (irq_load + softirq_load) - (cpu->last_load);
                        /*
                         * the [soft]irq_load values​​ are in jiffies, which are
                         * units of 10ms, multiply by 1000 to convert that to
                         * 1/10 milliseconds. This give us a better integer
                         * distribution of load between irqs
                         */
                        cpu->load *= 1000;
                }
                cpu->last_load = (irq_load + softirq_load);
        }
...
}
相當於以下的命令:

$grep cpu015/proc/stat
cpu15 30068830 85841 22995655 3212064899 536154 91145 2789328 0

我們學習下 /proc/stat 的文件格式!

關於CPU這行摘抄如下:

cpu — Measures the number of jiffies (1/100 of a second for x86 systems) that the system has been in user mode, user mode with low priority (nice), system mode, idle task, I/O wait, IRQ (hardirq ), and softirq respectively. The IRQ (hardirq) is the direct response to a hardware event. The IRQ takes minimal work for queuing the “heavy” work up for the softirq to execute. The softirq runs at a lower priority than the IRQ and therefore may be interrupted more frequently. The total for all CPUs is given at the top, while each individual CPU is listed below with its own statistics. The following example is a 4-way Intel Pentium Xeon configuration with multi-threading enabled, therefore showing four physical processors and four virtual processors totaling eight processors.

可以知道這行的第7,8項分別對應著中斷和軟中斷的次數,二者加起來就是我們所謂的CPU負載。
這個和結果和irqbalance報告的中斷的情況是吻合的,見圖:

ib_irq_type_workload

是不是有點暈了,喝口水!
我們繼續來看下整個Package層面irqbalance是如何計算負載的,從下面的圖結合前面的那個CPU拓撲很清楚的看到:

irqbalance_package

每個CORE的負載是附在上面的中斷的負載的總和,每個DOMAIN是包含的CORE的總和,每個PACKAGE包含的DOMAIN的總和,就像樹層次一樣的計算。
知道了每個CORE, DOMAIN,PACKAGE的負載的情況,那麼剩下的就是找個這個中斷類型所在作用域範圍內最輕的對象把中斷遷移過去。

遷移的依據正是之前看過的這個東西:

int map_class_to_level[7] =
{ BALANCE_PACKAGE, BALANCE_CACHE, BALANCE_CACHE, BALANCE_NONE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE };

水喝多了,等等放下水先,回來繼續!

最後那irqbalance系統是如何實施中斷親緣性變更的呢,繼續上代碼:

// activate.c
static void activate_mapping(struct irq_info *info, void *data __attribute__((unused)))
{
...
        if ((hint_policy == HINT_POLICY_EXACT) &&
            (!cpus_empty(info->affinity_hint))) {
                applied_mask = info->affinity_hint;
                valid_mask = 1;
        } else if (info->assigned_obj) {
                applied_mask = info->assigned_obj->mask;
                valid_mask = 1;
                if ((hint_policy == HINT_POLICY_SUBSET) &&
                    (!cpus_empty(info->affinity_hint)))
                        cpus_and(applied_mask, applied_maska​​pplied_mask, info->affinity_hint);
        }
 
        /*
         * only activate mappings for irqs that have moved
         */
        if (!info->moved && (!valid_mask || check_affinity(info, applied_mask)))
                return;
 
        if (!info->assigned_obj)
                return;
 
        sprintf(buf, "/proc/irq/%i/smp_affinity", info->irq);
        file = fopen(buf, "w");
        if (!file)
                return;
 
        cpumask_scnprintf(buf, PATH_MAX, applied_mask);
        fprintf(file, "%s", buf);
        fclose(file);
        info->moved = 0; /*migration is done*/
}
 
void activate_mappings(void)
{
        for_each_irq(NULL, activate_mapping, NULL);
}
上面的代碼簡單的翻譯成shell就是:

#echo MASK > /proc/irq/N/smp_affinity

當然如果用戶設置的策略如果是HINT_POLICY_EXACT,那麼我們會參照/proc/irq/N/affinity_hint設置
策略如果是HINT_POLICY_SUBSET, 那麼我們會參照/proc/irq/N/affinity_hint | applied_mask 設置。

好吧,總算分析完成了!

總結:
irqbalance根據系統中斷負載的情況,自動遷移中斷保持中斷的平衡,同時會考慮到省電因素等等。但是在實時系統中會導致中斷自動漂移,對性能造成不穩定因素,在高性能的場合建議關閉。

祝玩得開心!
TAG: Linux linux




上一篇 下一篇
來許願池許個願~願望都成真
查看全部回復【已有0位網友發表了看法】