You should think about it as implementing a site reordering in terms of a series of pairwise swaps, i.e.:
s1,s2,s3,s4 -> s1,s2,s4,s3 -> s1,s4,s2,s3
Then, there would be a fermionic swap gate that implements (s3,s4) -> (s4,s3), and then one that implements (s2,s4) -> (s4,s2). The same fermionic swap gate implements the general operation (si,sj) -> (sj,si). For spinless fermions, for example, the fermionic swap gate would be something like:
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 -1
where there is a `-1` in the doubly-occupied subspace (so for the basis |00>, |01>, |10>, |11>).