Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: extend intrinsic matmul #951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wassup05 wants to merge 30 commits into fortran-lang:master
base: master
Choose a base branch
Loading
from wassup05:matmul
Open

Conversation

@wassup05
Copy link
Contributor

@wassup05 wassup05 commented Mar 13, 2025

This PR attempts to address #931

interface stdlib_matmul is created and extended to handle 3-5 matrices. (works for integer, real, complex)

API

A = stdlib_matmul(B, C, D, E, F)
A = stdlib_matmul(B, C, D, E)
A = stdlib_matmul(B, C, D)

The Algorithm for optimal parenthesization is as given in "Introduction to Algorithms" by Cormen, ed4, ch14, section-2.
numpy's linalg.multidot also uses the same algorithm.

Although as @jalvesz had suggested I should have used gemm for the multiplication of component matrices this uses matmul for now, I can make this if deemed appropriate once the major implementation has been given a green signal.

I have added a very basic example to play around with, and I will be adding the detailed docs, specs and tests once everybody approves of the major implementation.

I am not really happy with some parts of the code like computing the size of all the matrices individually, If anyone has any suggestions regarding that or any cleaner way of implementing some other stuff (perhaps some fypp magic) please do let me know

Notes

  • I think another interesting enhancement would be, if the first array is 1-D, then treat it as a row vector. If the last error is 1-D treat it as a column vector just as numpy does it.

jalvesz and zoziha reacted with thumbs up emoji
Copy link
Contributor

jalvesz commented Mar 14, 2025
edited
Loading

Here a few ideas for consideration:

  1. Regarding the use of gemm or plain matmult: both gfortran and intel (ifort/ifx) enable replacing the intrinsic matmul by gemm using flags: worth reading https://stackoverflow.com/questions/31494176/will-fortrans-matmul-make-use-of-mkl-if-i-include-the-library and https://community.intel.com/t5/Intel-Fortran-Compiler/qopt-matmul-with-mkl-sequential/td-p/1003110 so it might not be that harmful after all to implement using matmul.
  2. Regarding the signature of the key functions, you could consider:
pure function matmul_chain_order(p) result(s)
 integer, intent(in) :: p(:)
 integer :: s(1:size(p)-1, 2:size(p))
 integer :: m(1:size(p), 1:size(p))
 integer :: l, i, j, k, q, n
 n = size(p)
 ...
end function matmul_chain_order
pure module function stdlib_matmul (m1, m2, m3, m4, m5) result(e)
 real, intent(in) :: m1(:,:), m2(:,:)
 real, intent(in), optional :: m3(:,:), m4(:,:), m5(:,:) !> from the 3rd matrix they can be all optional
 real, allocatable :: e(:,:)
 integer :: p(5), i, num_present
 integer :: s(3,2:4)
 p(1) = size(m1, 1)
 p(2) = size(m2, 1)
 num_present = 2
 if(present(m3)) then
 p(3) = size(m3, 1)
 num_present = num_present + 1
 end if
 if(present(m4)) then
 p(4) = size(m4, 1)
 num_present = num_present + 1
 end if
 if(present(m5)) then
 p(5) = size(m5, 2)
 num_present = num_present + 1
 end if
 s = matmul_chain_order(p(1:num_present))
 ... 
end function stdlib_matmul

For the procedure computing the orders you only need the p array, the dimension can be obtained. One might argue that this induces extra time. I think it will most likely be transparent.

For the main procedure, you could consider using optional arguments from the 3rd matrix onwards and then manage the rest within the same procedure without recursion.

Regarding the example, I would consider much more interesting to set an example a non-trivial example: the actual ordering being different to the naive sequence.

Just some food-for-thought.

wassup05 reacted with heart emoji

Copy link
Contributor Author

wassup05 commented Mar 14, 2025
edited
Loading

Thank you very much @jalvesz for your detailed remarks, they are quite helpful.
I have refactored the algorithm to only accept p, as n can very well be calculated from that.

Regarding the signature I am a bit confused, because if we don't use recursion we would have to handle 14 cases (4th catalan number) if the number of matrices is 5, which would become quite messy quickly. For 3 matrices it is much efficient to compare the two ordering costs and dispatch an ordering appropriately (numpy claims that it is 15 more efficient than calculating a s matrix for it), 4 matrices case is also somewhat fine considering there are only 5 cases, But I am not sure about how to handle the 5 matrix one without handling all the cases explicitly.. (Or maybe we can just limit this to 4 matrices?) maybe I am missing something here.

And yes I will add some non-trivial examples for sure.

Looking forward to hearing your thoughts.

Copy link
Contributor

jalvesz commented Mar 14, 2025

Indeed! another idea:

For 3 and 4 matrices make them internal procedures including passing as extra argument the ordering slice corresponding to those matrices. For the one public procedure with 2+3optionals, you compute just once the ordering and call the internal versions for 3 and 4 depending on the case.

Copy link
Contributor Author

Thank you @jalvesz, I have implemented them accordingly and added some better examples although I am not sure of how I may test the correctness of multiplication with large matrices or compare the time taken by native matmul vs this..

Copy link
Contributor

jalvesz commented Mar 17, 2025
edited
Loading

In terms of correctness I would say that the simplest approach would be to write down a couple of analytical cases for which the exact solution is fixed, also a couple of cases with random matrices for which the optimal ordering is different from the trivial one and compare: stdlib_matmul vs trivial sequential calls to intrinsic matmul. For integer arrays the error should be zero, for real/complex arrays the error tolerance should be somewhere around 100*epsilon(0._wp). The test could be done using the Frobenius norm error = mnorm(A-A_ref), where A would result from the current proposal and A_ref from the intrinsic matmul using the naive sequence.

In terms of performance, It might not be necessary to add that to the PR but, a separate small program showcasing two scenarios might be useful:

  1. Many multiple calls for stdlib_matmul vs naive matmul for small matrices
  2. Few multiple calls for stdlib_matmul vs naive matmul for medium/large matrices

@loiseaujc maybe you have some use cases worth testing?

Copy link
Contributor

Oh boy ! I had not seen this PR. Super happy with it. I'll take some time this afternoon to look at the code and give you some feedback. As for the test cases, anything involving multiplications with a low-rank matrix already factored would do. As far as I'm concerned, these are the typical cases I encounter where such an interface would prove very useful syntactically. I'll see as I get to the lab if I have a sufficiently simple yet somewhat realistic set of data we could use for testing the performances.

Copy link
Contributor Author

I have replaced all the matmul's by calls to gemm... The code has become a bit complicated and verbose... If I have missed anything or there was a shorter way of doing the same, do let me know please.
And also now the interface handles only real and complex values.

@jalvesz @perazz

Copy link
Member

@perazz perazz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good progress @wassup05.

n = p(start + 3)
k = p(start + 2)
allocate(temp(m,n))
call gemm('N', 'N', m, n, k, one_${s},ドル m2, m, m3, k, zero_${s},ドル temp, m)
Copy link
Member

@perazz perazz Mar 21, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good progress @wassup05, thank you! Imho this PR is almost ready to be merged. As you suggest, it would be good to have a nice wrapper for gemm. It has been discussed before. I would like to suggest that all calls to gemm are also wrapped into a stdlib_matmul function - now with two matrices only. This would give stdlib fully functional matmul functionality.

Here I suggest two possible APIs, and I will ask @jalvesz @jvdp1 @loiseaujc to discuss that together:

The first would be similar to gemm

! API Similar to gemm
${t1}$ function stdlib_matmul(A, Astate, B, Bstate) result(C)
 ${t1},ドル intent(in) :: A(:,:), B(:,:)
 character, intent(in), optional :: Astate, Bstate

and could use the matrix state definitions already in use for the sparse operations

character(1), parameter :: sparse_op_none = 'N' !! no transpose
character(1), parameter :: sparse_op_transpose = 'T' !! transpose
character(1), parameter :: sparse_op_hermitian = 'H' !! conjugate or hermitian transpose

The second would be more ambitious and essentially zero-overhead, it would wrap the operation in a derived type: (to be templated of course)

type :: matrix_state_rdp
 real(dp), pointer :: A(:,:) => null()
 character(1) :: Astate = 'N'
end type matrix_state_rdp
interface transposed
 module procedure transposed_new_rdp
 ...
end interface transposed 
type(matrix_state_rdp) function transposed_new_rdp(A) result(AT)
 real(dp), intent(inout), target :: A(:,:)
 AT%A => A
 AT%Astate = 'T'
end function

Then we could define a templated base interface

! Work with normal matrices
${t1}$ function stdlib_matmul(A, B) result(C)
 ${t1},ドル intent(in) :: A(:,:), B(:,:)
! Work with transposed/hermitian swaps
${rt}$ function stdlib_matmul(A, B) result(C)
 matrix_state_${rn},ドル intent(in) :: A, B

So the user writing code would have it clear:

C = strlib_matmul(A, B)
C = stdlib_matmul(transposed(A), B)
C = stlib_matmul(A, hermitian(C))

we could even make it an operator:

C = strlib_matmul(A, B)
C = stdlib_matmul(.t.A, B)
C = stlib_matmul(A, .h.C)

without it triggering any actual data movement.

Copy link
Contributor

@loiseaujc loiseaujc Mar 24, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C = stdlib_matmul(transposed(A), B)
C = stlib_matmul(A, hermitian(C))

How would that work in the written code? Would A and B have to be declared as type(matrix_state_type)?
If so, I'm a bit afraid most people would prefer the more "natural" real, dimension(m, n) :: A, B and play around either directly with the intrinsic transpose or the classical gemm dummy arguments. Maybe I'm missing something though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would that work in the written code? Would A and B have to be declared as type(matrix_state_type)?

I also had the same reaction at first, after thinking about it longer I saw that it is actually a pretty clever solution, the user would declare the matrices as regular dense matrices. It is the internal interface that would make the distinction. This would imply though implementing internally several versions to account for the combinations (dense,dense) / (dense,type) / (type,dense). This looks interesting but I wonder if it should be pursued at this stage. The first proposal by @perazz looks easier and totally valid but I would propose it with a slight modification:

${t1}$ function stdlib_matmul(A, B, op_a, op_b) result(C)
 ${t1},ドル intent(in) :: A(:,:), B(:,:)
 character, intent(in), optional :: op_a, op_b

to let all optional arguments of a procedure at the end of the signature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @jalvesz @loiseaujc, I don't know how to write it better, but it is outlined in the previous post.
There would be an interface for matmul(A,B) (resolved at compile time) with 4 options for each kind:

  1. both A, B are simple matrices (2d arrays)
  2. A is a matrix_state_, B is a 2D array
  3. B is a matrix_state_, A is a 2d array
  4. Both are matrix_state_*

Only function 4. is actually implemented, and is a gemm wrapper: the other implementations just wrap against it with fypp.

Copy link
Contributor

@loiseaujc loiseaujc Mar 25, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. I quite like that indeed. I still believe though that, as a starting point, restricting ourselves to standard stuff might be easier. Introducing new derived types to represent matrices definitely is something I'm looking forward to but, in line with the discussion here, it might require a broader discussion to have a well-designed set of derived types.

I also prefer this signature

${t1}$ function stdlib_matmul(A, B, op_a, op_b) result(C)
 ${t1},ドル intent(in) :: A(:,:), B(:,:)
 character, intent(in), optional :: op_a, op_b

for the reasons you've mentioned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, here I would define these "convenience" derived types, because they're only used to make the matmul interface better and more readable. I would only expose to the user the actual interface (i.e., either the operators .t. .h., or the function names transposed hermitian).

Copy link
Contributor Author

@wassup05 wassup05 Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the first proposal quite a lot too! the 2nd proposal seems convenient too with the .t., .h. operators but maybe the initialization of arrays as type(matrix_state_*) would put some people off..? Maybe we should go ahead with the 1st proposal for now?

Copy link
Contributor Author

wassup05 commented Apr 3, 2025

I thought CI failing on windows was just pure chance, but I don't think so now.. most probably has something to do with random numbers... I am not sure what exactly tho

Copy link
Member

@perazz perazz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wassup05 the state type introduction LGTM. I have polished the error messages a bit. Are you still planning to include the gemm wrapper as part of this PR? Otherwise, we could defer that to a separate PR imho.

Copy link
Contributor

jalvesz commented Apr 22, 2025
edited
Loading

@perazz I think the following error:

 1827 | err0 = linalg_state_type(this, LINALG_VALUE_ERROR, 'matrices m4=',shape(m4),', m5=',shape(m5),' have incompatible sizes')
 | 1
Error: Unterminated character constant beginning at (1)

is related with the line limit by default of 132 characters. This line has 139 characters. Maybe if you split it with a line continuation it should pass.

Copy link
Contributor Author

yes @perazz defering that to another PR seems a better option to me.

So, some final touches to this PR (I think) would be for me to add

  • Some tests
  • an example
  • specs
perazz reacted with thumbs up emoji

Copy link
Member

perazz commented Apr 22, 2025

@wassup05 please also note that the example program seems to be failing on Windows in the CI.

wassup05 reacted with thumbs up emoji

Copy link
Contributor Author

@jalvesz @perazz I have added the specs, example and some tests. I did not add any 5 argument test cases because as you can see the error tolerance should be increased quite a lot since the error is increasing as args are increasing... For 3 args it needed epsilon*300 and for 4 epsilon*1500 and even with that some tests have failed in the CI

Copy link
Contributor

jalvesz commented May 23, 2025

@wassup05 could you try the following: instead of testing each individual value, test the L2 norm of the residual on the whole matrix operation:

instead of

call check(error, all(abs(r-r1) <= epsilon(0._${k}$) * 1500), "real, ${k},ドル 4 args: error too large")

do

call check(error, norm2(r-r1) <= epsilon(0._${k}$) * factor , "real, ${k},ドル 4 args: error too large")

per-value evaluations tend to be more sensitive, whole array norms should be a bit more robust. This does not remove the source problem which is the propagation of the error at each multiplication, but this is just the limit of the finite arithmetic operation at hand.

Copy link
Contributor Author

I tried that @jalvesz, I think the tolerance went even higher than the individual checking did, for 3 args it took around *800 and for 4 args it is *2500 and ongoing... Maybe I should reduce the size of the matrix

jalvesz reacted with thumbs up emoji

r = stdlib_matmul(x, y, z) ! the optimal ordering would be (x(yz))
r1 = matmul(matmul(x, y), z) ! the opposite order to induce a difference

call check(error, all(abs(r-r1) <= epsilon(0._${k}$) * 300), "real, ${k},ドル 3 args: error too large")
Copy link
Contributor

@loiseaujc loiseaujc Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably late for this party but isn't epsilon(0.0_wp) like the smallest number that can be represented in the working precision wp? If it is indeed the case, then I wouldn't be surprised to see the tests fail.

I guess, a better indicator would be to consider the matrix 2-norm (i.e. the largest singular value of $A - \tilde{A}$ where $A$ is the matrix you'd compute with a chain of matmul and $\tilde{A}$ the one from your implementation).

Copy link
Contributor

@jalvesz jalvesz Jun 14, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but isn't epsilon(0.0_wp) like the smallest number that can be represented in the working precision wp

That would be tiny(0.0) which is indeed a very tiny number. epsilon() represents the smallest significant difference between two consecutive real numbers in the given precision.

Copy link
Contributor Author

wassup05 commented Jul 5, 2025

I have reduced the size and updated the tolerance accordingly, I think it should be fine now

loiseaujc reacted with thumbs up emoji

Copy link
Member

perazz commented Sep 24, 2025

In absence of further comments and activity, this can be merged soon imho.

Copy link
Contributor

jalvesz commented Sep 24, 2025

Agreed @perazz, I also think this is ready!

Copy link
Contributor Author

just added some comments to make the example more understandable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@loiseaujc loiseaujc loiseaujc left review comments

@perazz perazz perazz approved these changes

@jalvesz jalvesz jalvesz approved these changes

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /