Skip to content

Suggestions for improving code performance #17

@xjb714

Description

@xjb714
uint64_t
encode_ten_thousands(uint64_t hi, uint64_t lo)
{
    uint64_t merged = hi | (lo << 32);
    /* Truncate division by 100: 10486 / 2**20 ~= 1/100. */
    uint64_t top = ((merged * 10486ULL) >> 20) & ((0x7FULL << 32) | 0x7FULL);
    /* Trailing 2 digits in the 1e4 chunks. */
    uint64_t bot = merged - 100ULL * top;
    uint64_t hundreds;
    uint64_t tens;

    /*
     * We now have 4 radix-100 digits in little-endian order, each
     * in its own 16 bit area.
     */
    hundreds = (bot << 16) + top;

    /* Divide and mod by 10 all 4 radix-100 digits in parallel. */
    tens = (hundreds * 103ULL) >> 10;
    tens &= (0xFULL << 48) | (0xFULL << 32) | (0xFULL << 16) | 0xFULL;
    //tens += (hundreds - 10ULL * tens) << 8;
    tens = (hundreds << 8) - 2559ull * tens;

    return tens;
}

Explanation:

decimal: a*10+b=x
BCD: a+256*b=a+256*(x-a*10)=a+256*x-2560*x=256*x-2559*a=256*x-2559*(x/10)

I haven't conducted performance tests, but the above implementation has fewer instructions and fewer instruction dependencies, so there should be some performance improvement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions