Training on heterogeneous CPUs

Modern desktop computers may have heterogeneous CPUs i.e. performance and efficiency cores. If you launch the data-loader with the default number of cores this can slow down your train as the pipeline will be bottle-necked by the much slower efficiency cores. This effect can be quite pronounced. For example, on an 13th Gen Intel(R) Core(TM) i7-13700K training with 24 loader worker threads trains at 350 UTT/s but training on 8 cores runs close to 500 UTT/s! To determine the correct number of cores run lstopo --of console. Then set --loader_workers_per_gpu= as appropriate (you may need to apt install hwloc).