# Very simple scalar contraction

Hello everyone!

I just discovered ITensor, and i think it's a fantastic library: thank you so much for making it available for julia too!

I was trying to figure out if I can use it to improve the full contraction between a multidimensional array and 5 vectors (currently this is what my research project requires).

For now I got the best performance with julia's "LoopVectorization" package (I've tried several).

Here is the full contraction (this is julia code). Of course this is a no-sense example which is only intended to illustrate the situation:

function contraction(D,A,a,b,c,f,e,indices)

@turbo for i = 1:D

a[i] = A[i,indices[1],indices[2],indices[3],indices[4]]
b[i] = A[i,indices[5],indices[6],indices[7],indices[8]]
c[i] = A[i,indices[9],indices[10],indices[11],indices[12]]
f[i] = A[i,indices[13],indices[14],indices[15],indices[16]]
e[i] = A[i,indices[17],indices[18],indices[19],indices[20]]

end

s = 0.0

@turbo for q in 1:D, l in 1:D, k in 1:D, j in 1:D, m in 1:D
s += A[m,j,k,l,q]*a[q]*b[l]*c[k]*f[j]*e[m]
end
return s

end


These are the performance:

indices = rand(1:5,20);
A = rand(5,5,5,5,5);
a = zeros(Float64,5)
b = zeros(Float64,5)
c = zeros(Float64,5)
f = zeros(Float64,5)
e = zeros(Float64,5)
D = 5

@btime contraction($D,$A,$a,$b,$c,$f,$e,$indices)
736.952 ns (0 allocations: 0 bytes)
29.147187769886045


It is nothing more than a full scalar contraction between a multidimensional array and 5 vectors. I wrote it in such a way that there are no memory allocations.

I have read the documentation, but I still don't understand if and how such a contraction can be implemented with ITensor. The best thing would be to do this in the GPU, especially when the dimension is much higher than 5 (as in this no-sense example).

Can anyone give me some advice about it? Thanks a lot in advance, and have a nice day everyone!

+1 vote
selected by

Given the arrays A, a, b, c, f, e, you can just do:

n = ndims(A)
is = Tuple(Index(size(A, n)) for n in 1:5)
Ait = itensor(A, is)
Ms = (a, b, c, f, e)
tn = [itensor(Ms[j], is[j]) for j in 1:n]
pushfirst!(tn, Ait)
contract(tn)


For number of dimensions n == 5, I get this result: 9.823 μs (73 allocations: 14.75 KiB), so much slower than LoopVectorization. This isn't entirely surprising for such small dimensions, since ITensor uses a transpose-transpose-gemm-transpose (TTGT) approach (https://arxiv.org/pdf/1607.00145.pdf), where we first permute the tensor data so the contraction can be performed as a matrix multiplication. Therefore, we do the contractions pairwise, and each contraction potentially has a nontrivial amount of intermediate data allocated.

On the other hand, in this case LoopVectorization is able to perform the calculation with no intermediate tensor values, and also does a bunch of tricks to perform the loops in very efficient ways. I'm curious how they compare for larger tensor orders and dimensions, but this very specific tensor contraction that reduces to a single number may be better suited for the strategy that LoopVectorization takes (and in principle we can detect this kind of contraction and dispatch to LoopVectorization for special contractions like this, which we would like to investigate).

As for performing the calculation on GPU, you should be able to just install the ITensorsGPU package (https://github.com/ITensor/ITensorsGPU.jl) and perform the contraction like this instead:

using ITensorsGPU
tn_gpu = cu.(tn)
contract(tn_gpu)


Note that I've found that using single precision (Float32) gives the best performance on GPU, so you may want to try that.