Implement the accumulation for force and energy using fixed precision - this way the current flavors of atomicAdd_x_y functions from vectype_ops.clh could drop the costly while loop in favor of existing atomic functions for integer accumulation.
Evaluate performance, compare the change in precision for the final results and decide if the kernels should use the fixed precision or floating point implementation.