These variants are implemented mainly for LANPR to run smoothly with its double precision internal calculations.
Matrix multiply function needs a new __SSE2__ implementation.
B is a float mat rather than a double one, so this way it might break _mm_mul_ps
I'm changing my matrix to use double only for simplicity, but some other problems popped up as well. I'm probably going to add a mul_m4_m4m4_db_uniq() where all three matrices are double pricision. Maybe use the SSE instructions only with pure double situations and leave this one without SSE? (Or I remove this variant completely)
__m128d is double instead of double, but __m256d needs AVX support. I removed SIMD instructions for those and leave the optimization to the compiler. double version of the matrix is not frequently called, there should be little performance impact on this.