My impression is all the MPI binaries need to be compiled with the same version of MPICH to work together and that version needs to be the same as used for the mpirun command to start them.That part actually works OK - I do have a shared /mnt/cluster over NFS where each node is forced to put things under different specific directories (eg /mnt/cluster/project/aarch64 vs /mnt/cluster/project/x86_64 etc) but every system otherwise has local /home/gtoal and the usual /usr/local/bin etc where it can have a unique version of any file but the same filename across all systems- but if you look at my first post you'll see that the biggest problem is not filenames, but that somewhere in the system (and I think it may be in the undocumented and obscure 'orted' daemon) there are explicit version number checks.
I've run into this with a Singularity container. The container used Alpine Linux but the cluster used Redhat. Even though it's all the same architecture, Alpine binaries don't run on Redhat and vice versa due to different C libraries. Compiling from source the same version of MPICH inside the container and outside was the only way forward I found.
I expect the problem and solution is the same with different operating systems running on physical hardware: Compile and install the same version of MPICH from source on all nodes.
Statistics: Posted by ejolson — Sun Jul 14, 2024 2:38 pm