Ordinarily, if a message isn't available in the current domain, it's copied from another through main memory. If however, it is capable of copying directly from a another device (think Nvidia SLI), then it can code that functionality, which may be faster than even copy a message in main memory.
Allow a allocator to express preference for what domains it prefers to copy from