Skip to content

Manage ticks to suppress RCU CPU stall warning #75

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 16, 2025

Conversation

Mes0903
Copy link
Collaborator

@Mes0903 Mes0903 commented Feb 14, 2025

Since the emulator currently operates using sequential emulation, the execution time for the boot process is relatively long, which can result in the generation of RCU CPU stall warnings.

To address this issue, there are several potential solutions:

  1. Scale the frequency to slow down emulator time during the boot process, thereby eliminating RCU CPU stall warnings.
  2. During the boot process, avoid using 'clock_gettime' to update ticks and instead manage the tick increment relationship manually.
  3. Implement multi-threaded emulation to accelerate the emulator's execution speed.

For the third point, while implementing multi-threaded emulation can significantly accelerate the emulator's execution speed, it cannot guarantee that this issue will not reappear as the number of cores increases in the future. Therefore, a better approach is to use methods 1 and 2 to allow the emulator to set an expected time for completing the boot process.

The scale method requires three pieces of information to be implemented:

  • the cost of 'semu_timer_clocksource'
  • percentage of 'semu_timer_clocksource' within the boot process
  • the number of times 'semu_timer_clocksource' is called.

In contrast, the increment method only requires the third information to be implemented, making its implementation simpler.

Furthermore, through statistical analysis, we found that the other two values (cost and percentage) exhibit different distributions across different environments. Therefore, using the scale method would require an additional profiling step when the emulator starts, adding complexity to the implementation.

Finally, since the increment method does not obtain real-time timestamps during the boot process, its overhead is lower compared to the scale method. Because 'semu_timer_clocksource' is called frequently, this reduction in overhead accumulates and results in a noticeable performance improvement.

Therefore, this commit opts for the increment method to address this issue. This commit divides time into emulator time and real time. During the boot process, the timer manage the ticks increment to slow down the growth of emulator time, eliminating RCU CPU stall warnings. After the boot process is complete, the growth of emulator time aligns with real time.

According to Using RCU’s CPU Stall Detector [1], the grace period for RCU CPU stalls is typically set to 21 seconds. By dividing this value by two as the expected completion time, we can provide a sufficient buffer to reduce the impact of errors and avoid RCU CPU stall warnings.

By statistic, the number of times 'semu_timer_clocksource' called is approximately

$$ \text{SMP count} \times 2.15 \times 10^8 $$

By the time the boot process is completed, the emulator will have a total of

$$ \text{boot seconds} \times \text{frequency} $$

ticks. Therefore, each time,

$$ \frac{\text{boot seconds} \times \text{frequency}}{(2.15 \times 10^8 \times \text{SMP count})} $$

ticks need to be added.

Close #51

[1] https://docs.kernel.org/RCU/stallwarn.html#config-rcu-cpu-stall-timeout

Summary by Bito

Implementation of a sophisticated timer management system with dual-phase timing to prevent RCU CPU stall warnings during boot process. The system features manual tick increment management during boot, transitioning to real-time timing after detecting boot completion via U mode transition. The solution includes statistical calculations for optimal tick increments based on hart count, creating a robust timing infrastructure.

Unit tests added: False

Estimated effort to review (1-5, lower is better): 2

@jserv
Copy link
Collaborator

jserv commented Feb 14, 2025

clang-format version 12 would incorrectly indent as following:

--- utils.h
+++ utils.h
@@ -87,7 +87,7 @@ static inline void list_del_init(struct
     })
 #else
 #define container_of(ptr, type, member) \
-    ((type *) ((char *) (ptr) - offsetof(type, member)))
+    ((type *) ((char *) (ptr) -offsetof(type, member)))
 #endif
 
 #define list_entry(node, type, member) container_of(node, type, member)

We shall migrate to newer versions later.

Copy link

bito-code-review bot commented Feb 14, 2025

Code Review Agent Run #9f603b

Actionable Suggestions - 6
  • Makefile - 1
  • utils.c - 5
    • Consider initializing static timing variables · Line 24-25
    • Consider using integer constant for nanoseconds · Line 54-54
    • Use consistent integer constant in fallback · Line 68-68
    • Consider making static variables thread-safe · Line 87-88
    • Consider consistent integer type usage · Line 25-25
Review Details
  • Files reviewed - 5 · Commit Range: e895724..e895724
    • Makefile
    • main.c
    • riscv.c
    • utils.c
    • utils.h
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • Fb Infer (Static Code Analysis) - ✖︎ Failed

AI Code Review powered by Bito Logo

Copy link

bito-code-review bot commented Feb 14, 2025

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Feature Improvement - Timer Management Enhancement for RCU Stall Prevention

Makefile - Added configuration for boot target time to manage RCU CPU stall warnings

main.c - Updated timer initialization to include hart count parameter

riscv.c - Added boot completion detection logic using U mode transition

utils.c - Implemented new timer management system with boot and real-time phases

utils.h - Added declarations for boot completion flag and updated timer initialization interface

@@ -133,6 +133,7 @@ E :=
S := $E $E

SMP ?= 1
CFLAGS += -D SEMU_BOOT_TARGET_TIME=10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making boot time configurable

Consider making the boot target time configurable via environment variable instead of hardcoding it in the Makefile. This would allow users to override the value without modifying the Makefile. Could use something like: SEMU_BOOT_TARGET_TIME ?= 10

Code suggestion
Check the AI-generated fix before applying
Suggested change
CFLAGS += -D SEMU_BOOT_TARGET_TIME=10
SEMU_BOOT_TARGET_TIME ?= 10
CFLAGS += -D SEMU_BOOT_TARGET_TIME=$(SEMU_BOOT_TARGET_TIME)

Code Review Run #9f603b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +24 to +25
static double ticks_increment;
static double boot_ticks;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider initializing static timing variables

Consider initializing boot_ticks and ticks_increment with meaningful default values to avoid potential undefined behavior when these variables are first accessed.

Code suggestion
Check the AI-generated fix before applying
Suggested change
static double ticks_increment;
static double boot_ticks;
static double ticks_increment = 0.0;
static double boot_ticks = 0.0;

Code Review Run #9f603b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

return t.tv_sec * freq + mult_frac(t.tv_nsec, freq, 1e9);
struct timespec ts;
clock_gettime(CLOCKID, &ts);
return (uint64_t) ts.tv_sec * 1e9 + (uint64_t) ts.tv_nsec;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using integer constant for nanoseconds

Consider using UINT64_C(1000000000) instead of 1e9 for explicit integer constant. Using floating point literals for integer calculations could lead to precision issues.

Code suggestion
Check the AI-generated fix before applying
Suggested change
return (uint64_t) ts.tv_sec * 1e9 + (uint64_t) ts.tv_nsec;
return (uint64_t) ts.tv_sec * UINT64_C(1000000000) + (uint64_t) ts.tv_nsec;

Code Review Run #9f603b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

return time(0) * freq;
/* Fallback to non-HRT calls time(0) in seconds => convert to ns. */
time_t now_sec = time(0);
return (uint64_t) now_sec * 1e9;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use consistent integer constant in fallback

Similar to above, consider using UINT64_C(1000000000) instead of 1e9 in the fallback path for consistency and to avoid potential floating point precision issues.

Code suggestion
Check the AI-generated fix before applying
Suggested change
return (uint64_t) now_sec * 1e9;
return (uint64_t) now_sec * UINT64_C(1000000000);

Code Review Run #9f603b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +87 to +88
static int64_t offset = 0;
static bool first_switch = true;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making static variables thread-safe

Consider making offset and first_switch variables thread-safe since they are static variables accessed from multiple threads. Using atomic variables would prevent potential race conditions.

Code suggestion
Check the AI-generated fix before applying
Suggested change
static int64_t offset = 0;
static bool first_switch = true;
static atomic_int64_t offset = ATOMIC_VAR_INIT(0);
static atomic_bool first_switch = ATOMIC_VAR_INIT(true);

Code Review Run #9f603b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

@@ -19,6 +20,10 @@
#endif
#endif

bool boot_complete = false;
static double ticks_increment;
static double boot_ticks;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider consistent integer type usage

The boot_ticks variable is declared as double but cast to uint64_t without checking for potential overflow or precision loss. Consider using uint64_t type consistently.

Code suggestion
Check the AI-generated fix before applying
Suggested change
static double boot_ticks;
static uint64_t boot_ticks;

Code Review Run #9f603b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Since the emulator currently operates using sequential emulation, the
execution time for the boot process is relatively long, which can result
in the generation of RCU CPU stall warnings.

To address this issue, there are several potential solutions:

1. Scale the frequency to slow down emulator time during the boot
   process, thereby eliminating RCU CPU stall warnings.
2. During the boot process, avoid using 'clock_gettime' to update ticks
   and instead manage the tick increment relationship manually.
3. Implement multi-threaded emulation to accelerate the emulator's
   execution speed.

For the third point, while implementing multi-threaded emulation can
significantly accelerate the emulator's execution speed, it cannot
guarantee that this issue will not reappear as the number of cores
increases in the future. Therefore, a better approach is to use methods
1 and 2 to allow the emulator to set an expected time for completing the
boot process.

The scale method requires three pieces of information to be implemented:
- the cost of 'semu_timer_clocksource'
- percentage of 'semu_timer_clocksource' within the boot process
- the number of times 'semu_timer_clocksource' is called.

In contrast, the increment method only requires the third information to
be implemented, making its implementation simpler.

Furthermore, through statistical analysis, we found that the other two
values (cost and percentage) exhibit different distributions across
different environments. Therefore, using the scale method would require
an additional profiling step when the emulator starts, adding complexity
to the implementation.

Finally, since the increment method does not obtain real-time timestamps
during the boot process, its overhead is lower compared to the scale
method. Because 'semu_timer_clocksource' is called frequently, this
reduction in overhead accumulates and results in a noticeable
performance improvement.

Therefore, this commit opts for the increment method to address this
issue. This commit divides time into emulator time and real time. During
the boot process, the timer manage the ticks increment to slow down the
growth of emulator time, eliminating RCU CPU stall warnings. After the
boot process is complete, the growth of emulator time aligns with real
time.

According to Using RCU’s CPU Stall Detector [1], the grace period for
RCU CPU stalls is typically set to 21 seconds. By dividing this value by
two as the expected completion time, we can provide a sufficient buffer
to reduce the impact of errors and avoid RCU CPU stall warnings.

By statistic, the number of times 'semu_timer_clocksource' called is
approximately 'SMP count * 2.15 * 1e8'. By the time the boot process is
completed, the emulator will have a total of 'boot seconds * frequency'
ticks. Therefore, each time, '(boot seconds * frequency) / (2.15 * 1e8 *
SMP count)' ticks need to be added.

Close sysprog21#51

[1] https://docs.kernel.org/RCU/stallwarn.html#config-rcu-cpu-stall-timeout
Copy link

bito-code-review bot commented Feb 16, 2025

Code Review Agent Run #cbf1c4

Actionable Suggestions - 1
  • utils.c - 1
    • Consider validating hart_count parameter value · Line 105-105
Review Details
  • Files reviewed - 5 · Commit Range: cd565d0..cd565d0
    • Makefile
    • main.c
    • riscv.c
    • utils.c
    • utils.h
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • Fb Infer (Static Code Analysis) - ✖︎ Failed

AI Code Review powered by Bito Logo

@jserv jserv merged commit b42df15 into sysprog21:master Feb 16, 2025
3 checks passed
@jserv
Copy link
Collaborator

jserv commented Feb 16, 2025

Thank @Mes0903 for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RCU CPU stall warning in a multi-core system simulation
2 participants