# Random IQIndex Mismatch for long runtimes

retagged

[UPDATE: Problem also happens in unitary time evolution, see below]

Hello everybody,

I've come around some weird error, with wich I've been struggling for quite some time now. I am trying to implement a stochastic wave function approach (arXiv:quant-ph/9806026, see sec II.A) for the time-evolution in open quantum systems. This method combines the non-unitary time evolution, with a non-Hermitian Hamiltonian @@H = H{closedsystem} -i \gamma C^{\dag} C @@, of wave functions with the stochastic application of jump operators.

For this I use IQTensors for the wave functions and gates for the bond operators of a second order even-odd time evolution. The model is a Heisenberg model for using the SpinHalf site-set . Most of the times this gives the right results (compared to ED), but after longer times I get an Index Mismatch error:

----------------------------------------
IQIndexSet 1 =

IQIndex(S=1/2 2,2,Site,416) <Out>
(Up 2,1,Site,957) QN(1)
(decltype(nullptr) 2,1,Site,622) QN(-1)

----------------------------------------
IQIndexSet 2 =
IQIndex(uc,4,Site,416) <Out>
(c0,1,Site,216) QN(2)
(c1,2,Site,631) QN(0)
(c2,1,Site,174) QN(-2)

IQIndex(S=1/2 2,2,Site,416) <In>
(Up 2,1,Site,957) QN(1)
(decltype(nullptr) 2,1,Site,622) QN(-1)

----------------------------------------
Mismatched IQIndex IQIndex(S=1/2 2,2,Site,416) <Out>
(Up 2,1,Site,957) QN(1)
(decltype(nullptr) 2,1,Site,622) QN(-1)


A backtrace of the error shows, that the occurrence is in the position() function, during the SVD in the normalization process (see below). I also verified this by using couts right before and after the function call. Interstingly, the error is not deterministically reproducible. If I store the MPS just before entering the position() function where it crashes with writeToMPS(), reload it and perform position() everything works as expected. Also the point where it crashes is different for different runs of the same parameter set (incl. same seed). Therefore, I have thought the problem may originate in some dynamical memory errors. To circumvent this I tried to define the MPS explicitly on the heap

MPSt<IQTensor> *psi = new IQMPS(psi0);


where psi0 is a InitState, but that didn't solve the problem.

Backtrace of error:

[cmnode006:296505] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x2b9079850475]
[cmnode006:296505] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x180) [0x2b90798536f0]
[cmnode006:296505] [ 3] condor_exec.exe() [0x9f1dcb]
[cmnode006:296505] [ 4] condor_exec.exe(itensor::detail::checkArrows(itensor::IndexSetT<itensor::IQIndex> const&, itensor::IndexSetT<itensor::IQIndex> const&, bool)+0x3cc) [0xa7400c]
[cmnode006:296505] [ 5] condor_exec.exe(itensor::ITensorT<itensor::IQIndex>::operator*=(itensor::ITensorT<itensor::IQIndex> const&)+0x85d) [0xa8175d]
[cmnode006:296505] [ 6] condor_exec.exe(itensor::Spectrum itensor::svd<itensor::ITensorT<itensor::IQIndex> >(itensor::ITensorT<itensor::IQIndex>, itensor::ITensorT<itensor::IQIndex>&, itensor::ITensorT<itensor::IQIndex>&, itensor::ITensorT<itensor::IQIndex>&, itensor::Args)+0x466) [0xa079d6]
[cmnode006:296505] [ 7] condor_exec.exe(itensor::Spectrum itensor::orthMPS<itensor::ITensorT<itensor::IQIndex> >(itensor::ITensorT<itensor::IQIndex>&, itensor::ITensorT<itensor::IQIndex>&, itensor::Direction, itensor::Args const&)+0x24a) [0xb8ecea]
[cmnode006:296505] [ 8] condor_exec.exe(itensor::MPSt<itensor::ITensorT<itensor::IQIndex> >::position(int, itensor::Args const&)+0xc2) [0xb97eb2]
[cmnode006:296505] [ 9] condor_exec.exe(local_magnetization(itensor::MPSt<itensor::ITensorT<itensor::IQIndex> >, itensor::BasicSiteSet<itensor::SpinHalfSite>, int)+0x4f) [0xa051ef]


Also I once got the error message

Index::write: Index is default initialized


This might be an effect of the commonIndex() function used in orthMPS(). Generally I see the error appearing in every kind of parameter space, but it happens earlier (in terms of runtime) for smaller systems with lower cutoff, that means it might depend loosely on the number of gates applied. I tried this also for a Hamiltonian system, i.e. without Jump operators, but the program still crashes.

Maybe someone has any idea about that? I would appreciate any kind of help.
Best,
Stefan

P.S.: for completeness, I attach part of the code:

MPSt<IQTensor> *psi = new IQMPS(psi0);
*psi *= 1./norm(*psi);
compute_quantum_trajectory(sites,*psi,tensor_gates,
ofile_name,N, jump_operators,
uni,args, nstart, nfinal,Measurements);


void compute_quantum_trajectory(SpinHalf &sites,
MPSt<IQTensor> &psi,
vector< pair< int , IQTensor > > &tensor_gates,
string ofile_name,
int N,
vector<JumpOperator> &jump_operators,
my_uni_RNG uni,
Args args,
int nstart,
int nfinal,
vector<TimeSeries> &Measurements
)
{
int meas_counter =0;

double eta = uni(); // draws a random number from boost RNG
for(int step = nstart; step < nfinal; ++step)
{
measure(psi, sites, Measurements,step,ofile_name,args);

do_non_unitary_time_evolution(tensor_gates,psi,args);

if (norm(psi) * norm(psi) >= eta) continue;
else
{
JumpOperator Li
= choose_quantum_jump_operator(psi,sites,jump_operators,uni,args);
psi.Anc(Li.site) *= sites.op(Li.operator_string,Li.site);
psi.Anc(Li.site).noprime();

psi *= 1./ norm(psi);
}
eta = uni();
}
}


void
do_non_unitary_time_evolution(vector< pair< int , IQTensor > > &tensor_gates,
MPSt<IQTensor> &psi,
Args args)
{
for(auto& G : tensor_gates)
{
auto b = G.first;
psi.position(b,args);
IQTensor AA = psi.A(b)*psi.A(b+1);

AA *= G.second;
AA.noprime();

IQTensor D;
svd(AA,psi.Anc(b),D,psi.Anc(b+1), args);
psi.Anc(b+1) *= D;
}
}


[UPDATE]
As mentioned by G. Misguich, the problem also appears for the Hamiltonian time evolution. I tested it with a code very similar to the one provided itensor/tutorial/05_gates/gates.cc.

commented by (150 points)
Hi. Just a brief comment: I am facing the same kind of error (Index Mismatch)
when performing some long real-time (unitary) evolutions on simple 1D spin-half models. As noted by swolf, the point where it crashes is different for different runs of the same parameter set.
commented by (250 points)
I also see the problem in the unitary time evolution and updated the question.
commented by (1.2k points)
Hello, I also came across this kind of problem (Index Mismatch) a long time ago, when I was calculating the unitary time evolution for Bose-Hubbard model using "gates.cc" in the tutorial. It happens occasionally. Usually running the program again would finish the calculation successfully. I wanted to mention this but don't know how to describe it. Thanks.

+1 vote
selected by

Hi, so unfortunately this is the toughest kind of bug to debug (occurs at random times and only at long times). We had a bug like this in version 1 of ITensor and though I never explicitly found it, we believe it had to do with memory management in the matrix layer which has now been completely rewritten from scratch to be more memory safe.

Hopefully the bug you're seeing isn't tied to a memory management issue in ITensor, but it's hard to know a priori.

To reproduce it I would need more than just parts of your code; a complete working (minimal) code would be very helpful.

One other possibility is that there is a memory bug in your driver code that only shows up later in ITensor, but is not actually cause by ITensor. This happened once a few years back with a driver code I was using that silently overstepped an array bound.

Let me know how you want to proceed. If you email me a working driver code I could take a look at it and run it myself to see if I get the same bug.

Miles

commented by (250 points)
Hi Miles,

Thank you for offering your help! I have uploaded the most minimal version of a time evolution simulation, where the error encountered on GitHub (https://github.com/stefwolff/TimeEvolutionItensor/) and also added the input- and Makefile used.

The error occurs very rarely, usually I started 10 time evolution simulations simultaneously on single computer cores in a cluster and the first crash was typically around a few hours depending on the parameter set. Also, I have seen, that the  time of occurrence seems to depend on how many time steps have been executed, i.e. for comparably  large cutoff m (~100) I could run simulations for more then one week without encountering any issues. That’s why I defined very long end-times and a very low cutoff (m=15).

I also got different kinds of errors. Specifically with the provided code I have seen the following two:
++++++++++++++++++++++++++++++++++++++++++
terminate called after throwing an instance of 'ITError'
what():  Duplicate indices in index set
++++++++++++++++++++++++++++++++++++++++++

and
++++++++++++++++++++++++++++++++++++++++++
----------------------------------------
IQIndexSet 1 =

IQIndex(S=1/2 19,2,Site,159) <Out>
(Up 19,1,Site,951) QN(1)
(Dn 19,1,Site,528) QN(-1)

----------------------------------------
IQIndexSet 2 =
IQIndex(uc,2,Site,679) <Out>
(c0,1,Site,730) QN(19)
(c1,1,Site,26) QN(17)

IQIndex(S=1/2 19,2,Site,159) <In>
(Up 19,1,Site,951) QN(1)
(Dn 19,1,Site,528) QN(-1)

----------------------------------------

Mismatched IQIndex arrows
++++++++++++++++++++++++++++++++++++++++++

In some way here indices seem to be overwritten by others.

I also tested the time evolution in two other scenarios, one where I was using toExpH as explained in the tutorial, and one, where I used a even-odd time-evolution but calculated Exp(-itH) numerically exact using diagHermitian(), both of them also crashed randomly with similar errors after relatively long time. If it would help I can also upload the code of those.

In one case I could backtrace the error, although this was in the case for the stochastic sampling it might be helpful anyways. I also added the output to GitHub (backtrace.txt)

I also hope that there is a simple explanation for this, I really like the library and it is very convenient to work with ITensor!

Once again, thank you very much, if there is anything I can do please let me now :-)
Best,
Stefan
commented by (70.1k points)
Hi Stefan,
Sorry progress has been so slow on this. Just wanted to let you know I haven't forgotten about it, but (a) it may not be an easy bug to find and (b) this is the very busiest time of the year for me.

In the meantime, is there a possible workaround that can help you keep going on your research? For example, can you write the wavefunction to disk after a certain amount of time steps then restart the calculation? That way you can get the same result without needing to run the code for as long in a single calculation.
commented by (250 points)
Hi Miles,

Thanks for keeping me up to date and for your effort so far. I've tried to build a workaround, but it's a little bit difficult. I originally parallelized the program with MPI, since the number of time-evolved wave functions must be at least of the order of ~1000 to give reasonable Monte Carlo averages. The problem with this is, if one of the wave function has this memory problem, the whole program is terminated, and all the progress of the currently evolving wave functions so far is lost. Saving the MPS of all currently (parallel) evolving wave functions every once in a while therefore does not seem to be practical. I implemented a checkpointing after each Monte-Carlo update though, so the simulation is restartable from the last finished trajectory.

I tried to parallelize manually then, so that if the bug appears in one trajectory calculation, it doesn’t affect the others. The implementation of this is for now a bit clumsy, and still, a lot of wave functions (roughly 1 out of 10) are lost due to program crashes.

Stefan
commented by (70.1k points)
Ok that's helpful to know e.g. that it's a pretty serious issue for you.

One other thing that might shed some light: have you tried switching to a different BLAS / LAPACK library or version? Perhaps there is a memory bug in the version you are using that is causing these crashes. I will still check to see if I can reproduce the bug myself as well.
commented by (250 points)
Yes, unfortunately it's really quite harmful at the moment. I will try switching between different LAPACK versions, this might take a while, as I need to coordinate with the administrator of the HPC-cluster I'm using. I will let you know as soon as there are news on that side.
Best,
Stefan
commented by (70.1k points)
So the good news is I've been able to reproduce your bug. So thanks for sending the sample code so I could do that.

Also I was able to reproduce it with two different LAPACK implementations (MKL and Apple Accelerate/Veclib) so it does appear to be a bug within ITensor.

The bad news is that this bug exists, because as I mentioned these are tough to find and fix. Hopefully I can find it by using some automated memory checker tools such as valgrind. Of course I'll let you know if I find it but it could be a while (the last time we had a bug like this we never found it over a period of 2 years! Hopefully we'll have better luck this time.)
commented by (250 points)
Ok thanks for testing. I hope this time it is easier to find. I will then continue to work around the error as good as possible. I'm sorry for the bad news!
commented by (250 points)
Hi Miles,
I have done a bit more research to narrow down the cases of occurrence of the error, maybe this helps finding the bug. I uploaded a few more minimal examples to https://github.com/stefwolff/TimeEvolutionItensor. I have done the following things:
(1) svd_and_MPS_contractions.cpp:
- repeatedly contracting local Itensors of a MPS
- repeatedly application of a svd on randomly created Tensor(s)
=> program terminates
(2) only_svd.cpp
- repeatedly application of svd on randomly created Tensor(s)
=> program terminates
(3) only_MPS_contraction.cpp
- repeatedly contracting tensors of local tensors of IQMPS (also including definition of gates)
=> so far no crashes

The error messages are:
terminate called after throwing an instance of 'ITError'
what():  Duplicate indices in index set
or:
A must be matrix-like (rank 2)

I think these error messages correspond to the IQIndex Mismatch for IQTensors with conserved quantum numbers, as those were also sometimes trying to contract the same indices, (same number but different arrow direction). So without knowing precisely where the error originates, it seems only to be there if I use an SVD operation (it also appears for example, by calling position() to normalize in case (3)). I hope this is useful.
Stefan
commented by (250 points)
Hi Miles,
I did a lot of testing in the last days, and have found a potential candidate for the source of the error:
I used the following test code:

auto k = Index("index k",5);
auto l = Index("index l",15);
auto m = Index("index m",5);
auto T = randomTensor(k,l,m);

ITensor U(k,m),S,V;
svd(T,U,S,V,{"Cutoff",1E-12,"Maxm",10});

In the svd() function in decomp.h the vector Uinds which contains the indices which should go to the left side of the SVD (i.e. here k and m) contains in some cases (those which crash) a set of indices like this:

So the index l is mistakenly also moved to the left side and later also given to the creation of the combiner. I think this is due to the fact that the id_ is the same for land m (737) and as the prime level is the same, both indices are treated as equal by the overloaded == operator. As a result the 2-rank matrix is different as expected (sometimes it isn’t really 2-rank at all).

I think the source of this is the random generation of the index IDs with the Mersenne-Twister generator in the index constructor. I don’t think this guarantees unique index IDs. Is there any reason I didn’t think of so far, for using random numbers as IDs?

I tried a quick workaround, by assigning the index IDs incrementally starting from zero. For now a program, containing only the  svd application which was resulting in the mentioned error after ~2h for one out of 10 runs, has been running for 18h for 40 jobs. I also tested this in the context of the stochastic wave functions (incl.  everything as mentioned before, i.e. IQTensors, MPI,…), which seems to be ok for now too, although I think a bit more careful testing needs to be done.

Best wishes,
Stefan
commented by (70.1k points)
Hi Stefan,
Thanks for working more on this. I have been meaning to work on it myself, but have a backlog of other work. Maybe it's for the best because I'm not sure I would have thought of the index id's as being the culprit.

That's great news about the 18h runtime. Fingers crossed that it keeps working for a much longer time even.

First of all, I should point out that the id's which are printed are only a small piece of the full id number. So even though you see two indices with 737, their actual id numbers could be different. To really know, you can call J.id() for an index J to get its full id number, which I think is a 64 bit long integer.

So if changing the ids to be incremental fixes the bug, then it seems like the most logical reason is either that the random number generator itself has a memory error (seems unlikely since I got the bug too and it would probably be caught and fixed by others already) or that indeed some indices are getting the same id number. It's surprising since the Mersenne twister generator is supposed to have quite a long period (2^{19937}-1).

Kind of a puzzle but if the sequential id's do fix the issue we may move to that for the foreseeable future instead of random id's.

Best regards,
Miles
commented by (250 points)
Hello,
thanks for clearing up the reason for using the RNG.I have redone the runs printing Uinds[i].id():

So the indices are, up to my knowledge, really the same. As far as I understand, from http://www.cplusplus.com/reference/random/mt19937/ the std::mt19937 RNG returns random numbers from [0, 2^32 -1], and 2^{19937}-1 is the period of the random number sequence. So the created random numbers in one sequence do not necessarily be unique, although there is a quite low probability for two indices sharing the same number. But for, as in my case, a few 100,000 time evolutions, the probabilty= ratio number of created indices/2^32 is not negligible anymore (should be order of 10^-1 or 10^-2). I think similar things can happen in single (very long) time evolutions, where Indices are constantly created in the SVD.

So I’m still optimistic, that this might have been the problem. I still haven’t seen crashes on the runs using the incremental assignment of IDs.

Best,
Stefan
commented by (70.1k points)
Ah ok! This is very interesting. Good points about the non-uniquess of the random numbers and their range. So it seems like one solution could be to change the RNG type to std::mt19937_64 which returns a 64 bit integer.

When you made the id's incremental, did you use a 32 or 64 bit integer type (and signed or unsigned)?
commented by (250 points)
I used a usual signed 32-bit int and started from 0. I'm not completely sure, but shouldn't the RNG give (pseudo-) random integer numbers from the interval [0, 2^32-1] and not only a randomly ordered sequence of all integers in the interval? In that case one could even draw the same ID for the first two created indices if one is "unlucky". Then I think 64-bit ints would reduce the probability to get same indices, but would not kill the problem fully.
commented by (70.1k points)
Yes, you are right that even the first two numbers generated could be the same in principle. So 64 bit wouldn't guarantee it couldn't happen, but it seems like for a well-designed rng the odds should of this should be quite small for 64 bit since 2^64 ~ 1E19. Or at least I should research it more to see if there are theoretical estimates of how likely it is to happen for various rng's besides Mersenne twister.

But the random id's are not really needed for most purposes anyway. I put them in when I was doing parallel DMRG because otherwise I had issues if I didn't set the starting numbers on the separate computers at a different value. So it would be nice to have them but we can go back to sequential and then put in random as a compile-time option, or do more testing until we're comfortable that it will definitely work.
commented by (250 points)
Ah ok I see, in that sense I agree, I think  using a 64-bit  RNG might also in my application case be a practical solution, since the probability for having same indices is scaled again by 2^-32 compared to the previous one, so again almost vanishing small.
commented by (70.1k points)
To follow up, I just tested your code running it about 6 times with the 64-bit rng and I didn't observe any crashes. So it seems like going to 64 bit does fix it.

I may go to sequential numbering, however, for the simple reason that we could keep using 32 bit numbers. Changing the id size would break the write-to-disk feature for people who have written tensors to file with the previous code.

Thanks again for your report and let me know if the problem does show up again even with sequential or 64 bit id's.
commented by (250 points)
ok, that's good to hear. Thanks for the testing and for the support! Good to know that this issue is most likely solved.