Hi Chia-Min,
Your understanding about the role of these two different environment variables is correct, assuming that the BLAS you are using is actually MKL. Of course for a user using a different BLAS, such as OpenBLAS, they would need to set OPENBLASNUMTHREADS instead of MKLNUMTHREADS.
To answer your last question, within ITensor the only explicit parallelization is over quantum number blocks, with the number of threads used for that controlled by OMPNUMTHREADS. However, that does not mean ITensor "only" supports that kind of multithreading. Since we use BLAS to do the tensor contractions, then if you turn on multithreading for your BLAS then calls to the BLAS by ITensor will also be multithreaded. In fact, it is something ITensor does not control itself and you just control by setting MKLNUMTHREADS or similar.
If you did not see any effect of setting MKLNUMTHREADS, this could be for a number of different reasons. A less likely one is that it's not properly set by your code or terminal. But more likely, it's one of two other things (or both):
1. there is a competition of resources happening between the multithreading over the blocks and the multithreading over the matrix data within BLAS
2. many of the blocks or tensors are just too small for the BLAS multithreading to have much of an effect
We have in general seen that BLAS multithreading does not often scale very well, and will just give something like a factor of 2 speedup even if more than two threads are used for it.
To be more precise about all these things, please see the benchmarks in the latest version of the ITensor paper, Section 12: https://arxiv.org/abs/2007.14822
Here is a link to the actual code that was used to obtain these benchmarks - I link here to the line that sets the MKLNUMTHREADS and OMPNUMTHREADS variables so you can see that is indeed how it is done:
https://github.com/ITensor/ITensorBenchmarks.jl/blob/12e3a1f0ff3e587fd026d22d79a00bf36668cb34/src/runbenchmarks.jl#L234
Of course if you have any followup questions please ask. Also I might ask Matt Fishman to weigh in since he did those benchmarks and wrote the block-sparse multithreading code.
Best regards,
Miles