+1 vote
asked by (760 points)

In the documentation, the parallelization is controlled by two environment variables, MKL_NUM_THREADS and OMP_NUM_THREADS. However in my benchmark, changing MKL_NUM_THREADS has no effect, and the number of CPU is fully controlled by OMP_NUM_THREADS. With this observation I have the following questions.

Is it correct that, MKL_NUM_THREADS controls the number of CPU used in a matrix-matrix multiplication, and OMP_NUM_THREADS controls the parallelization among different quantum number blocks? If this is true, does the current version of ITensor only support the parallelization among quantum number blocks?

1 Answer

+1 vote
answered by (70.1k points)
selected by
Best answer

Hi Chia-Min,
Your understanding about the role of these two different environment variables is correct, assuming that the BLAS you are using is actually MKL. Of course for a user using a different BLAS, such as OpenBLAS, they would need to set OPENBLASNUMTHREADS instead of MKLNUMTHREADS.

To answer your last question, within ITensor the only explicit parallelization is over quantum number blocks, with the number of threads used for that controlled by OMPNUMTHREADS. However, that does not mean ITensor "only" supports that kind of multithreading. Since we use BLAS to do the tensor contractions, then if you turn on multithreading for your BLAS then calls to the BLAS by ITensor will also be multithreaded. In fact, it is something ITensor does not control itself and you just control by setting MKLNUMTHREADS or similar.

If you did not see any effect of setting MKLNUMTHREADS, this could be for a number of different reasons. A less likely one is that it's not properly set by your code or terminal. But more likely, it's one of two other things (or both):
1. there is a competition of resources happening between the multithreading over the blocks and the multithreading over the matrix data within BLAS
2. many of the blocks or tensors are just too small for the BLAS multithreading to have much of an effect

We have in general seen that BLAS multithreading does not often scale very well, and will just give something like a factor of 2 speedup even if more than two threads are used for it.

To be more precise about all these things, please see the benchmarks in the latest version of the ITensor paper, Section 12: https://arxiv.org/abs/2007.14822

Here is a link to the actual code that was used to obtain these benchmarks - I link here to the line that sets the MKLNUMTHREADS and OMPNUMTHREADS variables so you can see that is indeed how it is done:

Of course if you have any followup questions please ask. Also I might ask Matt Fishman to weigh in since he did those benchmarks and wrote the block-sparse multithreading code.

Best regards,

commented by (760 points)
Thank you for guiding me to the paper. The benchmark is very useful.
Welcome to ITensor Support Q&A, where you can ask questions and receive answers from other members of the community.

Formatting Tips:
  • To format code, indent by four spaces
  • To format inline LaTeX, surround it by @@ on both sides
  • To format LaTeX on its own line, surround it by $$ above and below
  • For LaTeX, it may be necessary to backslash-escape underscore characters to obtain proper formatting. So for example writing \sum\_i to represent a sum over i.
If you cannot register due to firewall issues (e.g. you cannot see the capcha box) please email Miles Stoudenmire to ask for an account.

To report ITensor bugs, please use the issue tracker.