Hi Jin,
I think the answer to #1 is complicated. I'm not sure what the difference in error would be between the two methods without doing a careful theoretical error analysis. Barring that, I think if you're really very curious about both methods, or suspect one might not be accurate enough for what you need, then you could just code both of them and plot results to see how the errors behave.
Regarding BLAS parallelism, it's kind of complicated also. Sometimes we have seen speedups from letting the BLAS use multiple cores, but other times we've seen what you report, namely that it does a bad job and can even hurt speed. Basically this is a feature of your BLAS which is outside of ITensor and which we do not control. Also its effectiveness is highly dependent on the details of the algorithm you are doing, whether DMRG or another algorithm, and the sizes of the tensors, their block structure, etc. So again I think your best bet is just to adjust the settings as you have done and see what works best for your needs. Also you might get different results with a different BLAS implementation, such as MKL which is a very high quality BLAS code.
Best regards,
Miles
P.S. about the BLAS, you might want to write a simple test code which just multiplies two very large matrices, and run that with different OPENBLAS NUM THREADS settings to see what results you get. If it doesn't work for that, i.e. if it doesn't give a speedup even for rather large matrices, you might conclude that it's just not a very useful feature of your BLAS. On the other hand, if you do see a speedup, you should see what matrix sizes are needed to see it and compare to the typical sizes in your DMRG calculation. Perhaps it could help if you turn it on for a very large DMRG calculation in the last few sweeps.