-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Regarding isqrt performance #137786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This implementation of // SQRTS[i<256] = (i << 8).isqrt() as u8
const fn new_isqrt(x: u32) -> u32 {
if x < 256 {
return SQRTS[x as usize] as u32 >> 4;
}
let idx = x >> ((25 - x.leading_zeros()) & !1);
// SAFETY: If x has y leading zeros, the shift count is either 24 - y or
// 25 - y. Thus idx has either 24 or 25 leading zeroes, and in particular
// it's less than 256.
unsafe { std::hint::assert_unchecked(idx < 256) };
let approx1 = SQRTS[idx as usize] as u32;
// SAFETY: Every element of SQRTS is at least 16 except element 0, and
// idx is positive so that cannot be selected.
unsafe { std::hint::assert_unchecked(approx1 >= 16) };
let mut approx = (approx1 + 1) << ((25 - x.leading_zeros()) / 2) >> 4;
approx = (approx + x / approx) / 2;
if approx * approx > x {
approx -= 1;
}
approx
} On my M3 Max laptop, this implementation takes ~6.7 seconds to evaluate for all 2^32 inputs, while |
With my Ryzen 7 9800X3D on Linux, that's significantly slower at first:
That turns around with
|
To avoid the /// Fixed-point square roots of 0..=255, with 4 bits integer part and
/// 4 bits fractional part.
const FIXED_POINT_SQRTS: [u8; 256] = {
let mut result = [0; 256];
let mut sqrt = 0_u32;
let mut i = 0_u32;
while i < 256 {
while sqrt * sqrt <= (i << 8) {
sqrt += 1;
}
sqrt -= 1;
result[i as usize] = sqrt as u8;
i += 1;
}
result
}; |
This is caused by an issue with how |
Does LLVM have enough information to make a better register selection, in theory? If so, this would be worth opening an issue (if one doesn't already exist). |
I'm not sure how feasible it is to solve that by picking a better output register, but it's also solvable by initializing the destination register immediately before the |
I've made a slightly faster isqrt implementation. // SQRTS[i<256] = (i * 256).isqrt()
// RECIPS[i<192] = (1 << 38).div_ceil(SQRTS[i + 64] + 1)
fn newer_isqrt2(x: u32) -> u32 {
if x < 256 {
return SQRTS[x as usize] as u32 >> 4;
}
let idx = x >> ((25 - x.leading_zeros()) & !1);
// SAFETY: If x has y leading zeros, the shift count is either 24 - y or
// 25 - y. Thus idx has either 24 or 25 leading zeroes, and in particular
// it's in 64..256.
unsafe { std::hint::assert_unchecked(64 <= idx) };
unsafe { std::hint::assert_unchecked(idx < 256) };
let approx1 = SQRTS[idx as usize] as u32 + 1;
let approx2 = approx1 << ((25 - x.leading_zeros()) / 2);
let divmult = RECIPS[idx as usize - 64] as u64;
// Approximately `x / approx2 * 16`.
let approxr = (x as u64 * divmult >> (25 - x.leading_zeros()) / 2 + 30) as u32;
let mut approx3 = approx2 + approxr >> 5;
if approx3 * approx3 > x {
approx3 -= 1;
}
approx3
} Godbolt. ( On my laptop, I get:
|
Here's a version which is (probably) a bit faster on 32-bit targets. // SQRTS[i<256] = (i * 256).isqrt()
// RECIPS[i<192] = (1 << 39).div_ceil(SQRTS[i + 64] + 1)
fn new_isqrt_32bit(x: u32) -> u32 {
if x < 256 {
return SQRTS[x as usize] as u32 >> 4;
}
let idx = x >> ((25 - x.leading_zeros()) & !1);
// SAFETY: If x has y leading zeros, the shift count is either 24 - y or
// 25 - y. Thus idx has either 24 or 25 leading zeroes, and in particular
// it's in 64..256.
unsafe { std::hint::assert_unchecked(64 <= idx) };
unsafe { std::hint::assert_unchecked(idx < 256) };
let approx1 = SQRTS[idx as usize] as u32 + 1;
let approx2 = approx1 << ((25 - x.leading_zeros()) / 2);
let divmult = RECIPS[idx as usize - 64] as u64;
// Approximately `x / approx2 * 16`.
let approxr = ((x as u64 * divmult >> 32) as u32) >> (25 - x.leading_zeros()) / 2 - 1;
let mut approx3 = approx2 + approxr >> 5;
if approx3 * approx3 > x {
approx3 -= 1;
}
approx3
} |
The std lib could use a const icbrt as well :-) I've implemented it twice in my Rust coding, and one of them was buggy. |
I recently found out about the That issue says that some microcontrollers have only 16 kiB total memory, and since |
|
This StackOverflow question has some nice implementations. |
We need to be careful because if we start with some of that code and modify it (such as porting it to Rust and/or improving its speed), we'd be making a derivative work and be bound by its license. I believe that we try to keep everything in the Rust standard library under MIT or Apache 2.0, and I'm not sure whether Stack Overflow's chosen CC-BY-SA licenses are compatible with that. One way around it would be if we decided which implementation(s) we wanted to work with, and then we got permission from the answerers in your linked post to license it under MIT and Apache 2.0. It appears both of the answerers in your linked post have been active on Stack Overflow in the past week. We'd need to follow the chain backwards in case they modified someone else's code (and that code wasn't under MIT or Apache 2.0) and so forth. |
It appears that the author of the longer answer there also was the original author of Python's |
A basic icbrt version: // Integer cubic root for u32 values.
// (Don't use this to create a u64 version).
const fn icbrt_u32(x: u32) -> u32 {
let mut x = x as u64;
let mut y = 0;
let mut s: i32 = 63;
while s >= 0 {
y += y;
let b = 3 * y * (y + 1) + 1;
if (x >> s) >= b {
x -= b << s;
y += 1;
}
s -= 3;
}
y as _
}
fn main() {
for x in 0 ..= u32::MAX {
if x % (1 << 24) == 0 {
println!("{x}");
}
let c1 = icbrt_u32(x);
let c2 = f64::from(x).cbrt() as u32;
if c1 != c2 {
println!("{x} {c1} {c2}");
}
}
} |
This is experimental code, it could be compiled with one or the other implementation of the integer square root:
As you see the floating-point based isqrt implementation is almost five times faster on my PC (using rustc 1.87.0-nightly). The std lib isqrt has upside of being const, so I can use it for const generics calculations and similar situations. And when I need to compute only one or few isqrt, this performance difference isn't important, and I use the std one. But when I need to compute a lot of isqrt, I consider faster alternatives.
While I don't propose to replace the std library isqrt with the code I've shown here, I suggest std lib implementers to express an opinion regarding the usage of floating point sqrt+ int cast in some vetted and safe cases.
The text was updated successfully, but these errors were encountered: