General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.
I am trying to figure out what to use as optimal kernel parameter for different architectures. For example, it looks like blis is using 8x4 for Sandy Bridge, but 8x6 for Haswell. Why? What lead them to this setup? Specifically, because operations are usually on 4 doubles at a time, how does the 6 fit in there. Is Haswell able to separately execute a `_mm256` and a `_mm` operation *at the same time*? Furthermore, if we have non-square kernels like for dgemm, is there a scenario where choosing 4x8 over 8x4 is better?
This issue appears to be discussing a feature request or bug report related to the repository. Based on the content, it seems to be still under discussion. The issue was opened by SuperFluffy and has received 7 comments.