linux內(nèi)核hardlockup機(jī)制:
hardlockup 是watchdog框架下的一種關(guān)注于中斷發(fā)生后一直占用CPU而其它中斷無(wú)法響應(yīng)導(dǎo)致的系統(tǒng)問(wèn)題的一種debug方法. 具體的超時(shí)判斷時(shí)間一般為10S,也可以通過(guò)sysctrl watchdog_thresh 來(lái)進(jìn)行修改.
當(dāng)觸發(fā)hardlockup時(shí)內(nèi)核會(huì)打印當(dāng)前的調(diào)用堆棧信息或者配置為panic可以觸發(fā)panic并打印當(dāng)前堆棧信息. 可以通過(guò)sysctrl hardlockup_panic進(jìn)行動(dòng)態(tài)修改, 可以通過(guò) CONFIG_BOOTPARAM_HARDLOCKUP_PANIC進(jìn)行配置.
hardlockup機(jī)制實(shí)現(xiàn)基礎(chǔ):
hardlockup 實(shí)現(xiàn)上依賴(lài)于下面內(nèi)容:
a) watchdog的內(nèi)核框架
b) 高精度timer框架: 高精度timer即hrtimer的實(shí)現(xiàn)在不同的計(jì)算機(jī)體系結(jié)構(gòu)上會(huì)有不同的硬件去實(shí)現(xiàn).
c) perfEvent框架: perfEvent的實(shí)現(xiàn)同樣不同的計(jì)算機(jī)體系結(jié)構(gòu)會(huì)有不同的實(shí)現(xiàn)方式,他們都依賴(lài)于具體的計(jì)算機(jī)體系結(jié)構(gòu), 而ARM實(shí)現(xiàn)perf Event的方式我們之前有做過(guò)簡(jiǎn)單分析,具體的參考之前這篇文章.
hardlockup實(shí)現(xiàn)的框架圖:
hardlockup實(shí)現(xiàn)機(jī)制
hardlockup工作機(jī)制的源碼解讀(依賴(lài)計(jì)算機(jī)體系結(jié)構(gòu)實(shí)現(xiàn)的PerfEvent以ARM的PMU為示例進(jìn)行解讀):
啟動(dòng)watchdog hrtimer并創(chuàng)建PerfEvent過(guò)程如下:
//kernel/watchdog.c
void __init lockup_detector_init(void){
...
if (!watchdog_nmi_probe())//創(chuàng)建對(duì)應(yīng)perfEvent
nmi_watchdog_available = true;
lockup_detector_setup();//啟動(dòng)高精度timer的watchdog同時(shí)觸發(fā)PerfEvent
}
下面我們來(lái)看看Perf Event的創(chuàng)建過(guò)程.
//kernel/watchdog_hld.c
int __init hardlockup_detector_perf_init(void){
int ret = hardlockup_detector_event_create();//hardloopup 創(chuàng)建對(duì)應(yīng)perfevent過(guò)程
...
}
//對(duì)應(yīng)perf Event 創(chuàng)建額type以及config
static struct perf_event_attr wd_hw_attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.size = sizeof(struct perf_event_attr),
.pinned = 1,
.disabled = 1,
};
static int hardlockup_detector_event_create(void)
{
...
struct perf_event_attr *wd_attr;
struct perf_event *evt;
wd_attr = &wd_hw_attr;
//這句和具體的體系結(jié)構(gòu)有關(guān)系,對(duì)應(yīng)的ARM的PMU為換算成對(duì)應(yīng)cycle counter.
wd_attr- >sample_period = hw_nmi_get_sample_period(watchdog_thresh);
/* Try to register using hardware perf events */
/* watchdog_overflow_callback為cycle counter發(fā)生overflow時(shí)觸發(fā)的handler
* 對(duì)應(yīng)到我們之前講的Perf Event基石PMU那篇文章就是 armv8pmu_handle_irq中
* call到perf_event_overflow函數(shù) */
evt = perf_event_create_kernel_counter(wd_attr, cpu, NULL,
watchdog_overflow_callback, NULL);
...
return 0;
}
關(guān)于具體創(chuàng)建的我們稍后再詳細(xì)敘述, 這里只需要知道創(chuàng)建的具體過(guò)程是通過(guò)perf_event_overflow來(lái)實(shí)現(xiàn)的, 并且watchdog_overflow_callback是在對(duì)應(yīng)PMU的counter overflow時(shí)會(huì)產(chǎn)生對(duì)應(yīng)不可屏蔽中斷(NMI).我們先看一下watchdog_overflow_callback的具體實(shí)現(xiàn), 具體實(shí)現(xiàn)如下:
//kernel/watchdog_hld.c
/* 看到了嗎? 該函數(shù)參數(shù)是可以與 armv8pmu_handle_irq中call到的
* perf_event_overflow傳遞的參數(shù)是一致的
* 我們稍后解析這個(gè)函數(shù)是如何給具體的PerfEvent的 */
static void watchdog_overflow_callback(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs){
...
//watchdog_nmi_touch這個(gè)為可搶占case路徑提供的接口,我們不做討論
if (__this_cpu_read(watchdog_nmi_touch) == true) {
__this_cpu_write(watchdog_nmi_touch, false);
return;
}
//
if (!watchdog_check_timestamp())
return;
/* is_hardlockup的實(shí)現(xiàn)就是判斷hrtimer_interrupts與
* 上次發(fā)生時(shí)保存的hrtimer_interrupts_saved是否相等,相等即hrtimer沒(méi)有做過(guò)響應(yīng)
* 即觸發(fā)了hardlockup機(jī)制*/
if (is_hardlockup()) {
...
/* only print hardlockups once */
if (__this_cpu_read(hard_watchdog_warn) == true)
return;
//show對(duì)應(yīng)信息或者dump堆棧信息.
if (regs)
show_regs(regs);
else
dump_stack();
...
if (hardlockup_panic)
nmi_panic(regs, "Hard LOCKUP");//觸發(fā)對(duì)應(yīng)kernel panic
...
}
}
我們?cè)賮?lái)看看是如何更新hrtimer_interrupts與hrtimer_interrupts_saved的
//kernel/watchdog.c
lockup_detector_init
-- >lockup_detector_setup
-- >lockup_detector_reconfigure
-- >softlockup_start_all
-- >smp_call_on_cpu//每個(gè)CPU的核都對(duì)應(yīng)綁定一個(gè)
-- >watchdog_enable
//如果對(duì)應(yīng)支持CPU的熱插拔,會(huì)在cpu online中同樣做觸發(fā)
static void watchdog_enable(unsigned int cpu) {
struct hrtimer *hrtimer = this_cpu_ptr(&watchdog_hrtimer);
struct completion *done = this_cpu_ptr(&softlockup_completion);
...
/*Start the timer first to prevent the NMI watchdog triggering
* before the timer has a chance to fire.
*/
/* watchdog_timer_fn在以間隔時(shí)間sample_period=watchdog_thresh*2*NSEC_PER_SEC/5
* 即默認(rèn)(watchdog_thresh為10S) 4S為周期的狀況下做一次hrtimer的觸發(fā)*/
hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hrtimer- >function=watchdog_timer_fn;
hrtimer_start(hrtimer, ns_to_ktime(sample_period),HRTIMER_MODE_REL_PINNED);
...
//Enable the perf event,啟動(dòng)前面創(chuàng)建的perfEvent,如果沒(méi)有創(chuàng)建則進(jìn)行創(chuàng)建
if (watchdog_enabled & NMI_WATCHDOG_ENABLED)
watchdog_nmi_enable(cpu);
}
//watchdog kicker functions
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer){
...
/* kick the hardlockup detector */
watchdog_interrupt_count(); //對(duì)hrtimer_interrupts進(jìn)行更新.
...
}
以上就是我們看到的"hardlockup實(shí)現(xiàn)機(jī)制"的具體代碼實(shí)現(xiàn)部分.那么我們?cè)賮?lái)剖析另一個(gè)關(guān)鍵點(diǎn): 該P(yáng)erfEvent事件的創(chuàng)建過(guò)程,即perf_event_create_kernel_counter的實(shí)現(xiàn)過(guò)程
//kernel/events/core.c
/**
* perf_event_create_kernel_counter
*
* @attr: attributes of the counter to create
* @cpu: cpu in which the counter is bound
* @task: task to profile (NULL for percpu)
*/
struct perf_event *
perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
struct task_struct *task,perf_overflow_handler_t overflow_handler,void *context){
struct perf_event_context *ctx;
struct perf_event *event;
...
/* 創(chuàng)建type為PERF_TYPE_HARDWARE,config為PERF_COUNT_HW_CPU_CYCLES
* perfiod為10s次的cycle counter*/
event = perf_event_alloc(attr, cpu, task, NULL, NULL,overflow_handler, context, -1);
...
//分配 匹配對(duì)應(yīng)context。
ctx = find_get_context(event- >pmu, task, event);
...
perf_install_in_context(ctx, event, cpu);
perf_unpin_context(ctx);
...
return event;
}
/*分配并且初始化perfevent */
static struct perf_event *
perf_event_alloc(struct perf_event_attr *attr, int cpu,struct task_struct *task,
struct perf_event *group_leader,struct perf_event *parent_event,
perf_overflow_handler_t overflow_handler,void *context, int cgroup_fd){
struct pmu *pmu;
struct perf_event *event;
struct hw_perf_event *hwc;
...
//分配perf_event空間
event = kzalloc(sizeof(*event), GFP_KERNEL);
...//初始化變量
init_waitqueue_head(&event- >waitq);
init_irq_work(&event- >pending, perf_pending_event);
...
/* perf_event 做初始化,直接初始化到具體type的config
* -- >perf_init_event
* -- >perf_try_init_event
* -- > pmu- >event_init(event)
* /
pmu = perf_init_event(event);
...
}
//drivers/perf/arm_pmu.c
static int armpmu_event_init(struct perf_event *event){
....
/*根據(jù)之前perfEvent基石PMU中code的分析,改map_event對(duì)應(yīng)為PMU中的
* armv8_pmuv3_perf_map 進(jìn)行匹配,由于我們的config傳入的是PERF_COUNT_HW_CPU_CYCLES
* 所以對(duì)應(yīng)的PMU的事件為ARMV8_PMUV3_PERFCTR_CPU_CYCLES */
if (armpmu- >map_event(event) == -ENOENT)
return -ENOENT;
return __hw_perf_event_init(event);
}
自此,PERF_COUNT_HW_CPU_CYCLES的PefEvent事件就創(chuàng)建成功,后面的work 流程就如同文章中Perf Event基石PMU討論的那樣。
總結(jié):
hardlockup實(shí)際上就是一種debug cpu被中斷hung主的機(jī)制,它利用的NMI(不可屏蔽中斷)來(lái)定時(shí)監(jiān)控hrtimer中斷在監(jiān)控時(shí)間段內(nèi)是否有更新, 如果未更新,則證明發(fā)生異常,異常后的行為根據(jù)配置的不同會(huì)有不同的表現(xiàn)。
評(píng)論