Let’s say that we have an integer value representing a 64-bit array (or bitmap) and we only care about some of the bits, say, every 9th one (i.e., the 1st, 9th, 18th, all the way to the 64th). We would like to extract and gather those bits as efficiently as possible, i.e., create an 8 bit value containing just those bits, in order.

Would you believe that it is possible to do so with just 3 arithmetic operations?

As motivation for this problem, many board games are played on an 8x8 board (http://www.boardgamegeek.com lists more than one hundred games with the tag 8x8). My own favorite board game, Othello (aka Reversi), is played on an 8x8 board. Such boards can be efficiently represented using 64-bit unsigned integers. For instance, an Othello playing software might represent such a board using two bitmaps (or bitboards), one for white and one for black. Taking every 9th bit, from the least significant to the most significant one, means extracting the main diagonal.

A trivial solution would involve looking at every bit by bitwise-AND-ing with pre-calculated masks, and then setting a bit in the destination variable. That solution would require dozens of operations.

We could also use lookup tables, e.g., a 65536-entry lookup table could be used to extract the first two bits (a1 and b2), another such lookup table for c3 and d4, for a total of 4 lookups in 256 KiB of data. The total cost would be 3 shifts, 4 loads and 4 bitwise and operations. The problem with lookup tables, particularly larger ones, is that they interfere with the cache hierarchy. Such a solution may perform well in a microbenchmark. In a more realistic setting, however, its behavior will likely disappoint.

We could also use lookup tables, e.g., a 65536-entry lookup table could be used to extract the first two bits (a1 and b2), another such lookup table for c3 and d4, for a total of 4 lookups in 256 KiB of data. The total cost would be 3 shifts, 4 loads and 4 bitwise and operations. The problem with lookup tables, particularly larger ones, is that they interfere with the cache hierarchy. Such a solution may perform well in a microbenchmark. In a more realistic setting, however, its behavior will likely disappoint.

Solution

Can we do any better? My proposed solution uses 3 arithmetic operations and no loads.

The first operation is a bitwise AND that clears up all the non-diagonal bits (e.g., b4). The AND mask corresponding to the diagonal bits is 0x80 40 20 10 08 04 02 01.

The second instruction is, perhaps surprisingly, multiplication. To understand how multiplication can help us gather the bits, let’s take a look at how multiplication works in binary. Let’s say that we have an arbitrary 3-bit base-2 number, ABC(2), that we would like to multiply by 10001(2). The pen-and-pencil multiplication diagram in base 2 is:

It does not particularly matter whether each of A, B, C, is either 0 or 1. The result is ABC0ABC(2). The key observation is that, at a binary level, multiplication makes transposed (shifted) copies of the first operand wherever the second operand has a bit of 1, and adds all the transposed copies together. Of course, multiplication is commutative, therefore you can swap the operands in the previous statement - and in the diagram.

We will use this particular property to gather the evenly-spaced bits. For simplicity, let’s consider a 16 bit array, or 4x4 bitmap, and we would like to extract every 5th bit.

In binary, the number would be D0000C0000B0000A(2). The top-left corner is the least significant bit, and the bottom-right corner is the most significant bit.

How might we produce DCBA(2) somewhere in the output result?

Let’s start with a simpler step - how do we bring C before D? Assuming that we keep D in its current place - 16th bit - we could bring C from position 11 to position 15 by shifting left by four bits the initial number, and then superimposing it with the unshifted number. Thankfully a multiplication via 10001(2) achieves exactly that. Notice, there are only three 0s in between the two 1s. In the cleaned-up number, there are 4 zeroes between A, B, C and D.

For brevity I’m not showing the zero rows in the multiplication diagram.

Moving on, to bring B on position 14 from position 6, we need to left shift the original number by 8 bits, which is equivalent to multiplying the first operand by 100000000(2). Combining that with 10001(2), we get 100010001(2)

Great, now we have DCB on positions 16, 15 and 14.

It should not be particularly surprising that multiplying by 1000100010001(2) produces the DCBA string between positions 16 and 13.

We can now shift right by 13 and retain the first 4 bits (and with 0xF) to obtain our result. But wait, you might say, isn’t that a total 4 operations? For the 4x4 bitmap, that is correct.

For the original problem, of extracting 8 diagonal bits from an 8x8 bitmap, we multiply the 64 bit number by

100000001000000010000000100000001000000010000000100000001(2)

We obtain the diagonal bits on positions 63-56. For a 64 bit unsigned integer, those are the most significant 8 bits. In the C language, unsigned integer overflow is well-defined and keeps the low-order bits of the result. After multiplying the two unsigned 64 bit integers, all that is needed is a shift right by 56 bits.

#include <stdint.h>

uint64_t extract_diagonal(uint64_t bitmap) {

return ((uint64_t) ((bitmap & 0x8040201008040201ULL) *

0x0101010101010101ULL)) >> 56;

}

}

Actually, if we wanted to extract the diagonal bits from a 16-bit bitmap, we could do it with 3 operations as well. To obtain our DCBA string at position 31-28, we would multiply not by 1000100010001(2), but by 1000100010001 0000000000000000(2). Essentially, we combine the multiplication with a left shift (which is the same as a multiplication by a power of 2), so that overflow does the work of cleaning up the upper junk bits. We would use uint32_t, and then shift right by 28 bits.

Isn’t multiplication powerful?

Performance

What kind of assembly code is generated for the above?

Using a 64-bit x86 linux box, with gcc 4.7.3 in opt mode (-O3), I obtained the following assembly snippet. Note that objdump normally uses the AT&T assembly style, which puts the output operand last; this is different from Intel assembly style, which puts the output first.

movabs $0x8040201008040201,%rdx

and %rdx,%rax

movabs $0x101010101010101,%rdx

imul %rdx,%rax

shr $0x38,%rax

We start with the input value in register %rax . We have two instructions to load constants, and the three arithmetic operations - and, multiply and shift right.

In terms of instruction latency, the and, shr and movabs operations normally each take a cycle. According to http://www.agner.org/optimize/instruction_tables.pdf, 64 bit multiplication takes between 3 and 6 cycles on modern processors. The first movabs and the three arithmetic operations execute serially, because they depend on the previous instruction’s input. However, the second and third instruction may execute in parallel, because they are independent. So I estimate between 6 and 10 cycles.

Of course, this is just an example of generated code - the compiler may do something different. By the way, if the code is compiled for a 32 bit architecture, the resulting assembly code will not be as efficient, because all 64 bit operations will be simulated via 32 ones. It is probably more efficient to extract 4 bits from each of the 32 bit words (upper and lower) and combine them afterwards.

As is generally the case with modern processors that are superscalar (i.e., they can schedule multiple instructions or micro-ops per cycle) and execute instructions out-of-order (i.e., they can execute independent instructions in a flexible order, e.g. the second movabs before the first one), your mileage may vary. Always measure, and also keep in mind Amdahl’s law!

Correctness

Now you may wonder - why does it work? Or, are there cases where this may not work?

The multiplication trick works as long as there is no carry between two columns. In other words, if you look at any column in the multiplication diagram, there should be a single bit that is potentially non-zero. In plain English, that can be achieved if the gap between the bits we would like to extract is sufficiently large to hold all the collapsed bits.

If we need to collect k bits, at positions c, c+n, c+n*2 … c+n*(k-1), then n should be greater than or equal to k.

For the diagonal example, n was 9 and k was 8 (and c was 0). If n and k were equal to each other (8) then the would extract a column.

However, the algorithm cannot extract the second diagonal (a8 to h1), because n would be 7 and k 8. It could extract 7 bits off the second diagonal, and then 8th bit could be patched using more conventional approaches.

Let’s take a look at the second diagonal example.

Using the 4x4 example, the elements of the second diagonal are 2 bits apart.

The multiplication diagram shows how the overlapping of C and A clobbers, by producing a carry bit, all the useful higher order bits.

Note that I did not include the trailing three zeros in the original bitmap - those bits are inconsequential, i.e. overflow occurs independently of them.

Note that I did not include the trailing three zeros in the original bitmap - those bits are inconsequential, i.e. overflow occurs independently of them.

A meta algorithm: Python implementation

The meta algorithm takes c, n and k and produces three values: the AND mask, the multiplier and the final shift amount.

The AND mask simply has c, c+n, c+n*2 … c+n*(k-1) bits set to one, and all other bits set to zero.

The final shift value is also easy to determine, as the algorithm places the k bits at the end of the target integer. Assuming a 64 bit integer, the shift value is 64-k.

The multiplication mask is a composite of two factors:

- The mixing factor: the value given by setting the 0, n-1, (n-1)*2 … (n-1)*(k-1) bits to 1, and all other bits to zero. Note that I’m considering bit 0 to be the least significant bit.
- The shift required to bring the result to the most significant bits. It is meant to bring the last element, c+n*(k-1), to the most significant bit, 63, so its value is 63 - c - n*(k-1).

A sample Python code to generate these values:

def repeatedBit64(start, count, step):

assert start + (count - 1) * step < 64

value = 0

for position in range(0, count):

value |= 1 << (start + position * step)

return value

def bitGatherTerms(first, count, step):

assert step >= count

andMask = repeatedBit64(first, count, step)

multiplier = repeatedBit64(0, count, step-1) * \

(1 << (63 - first - (count - 1) * step))

rightShift = 64 - count

return (andMask, multiplier, rightShift)

Note that Python’s integers are arbitrarily long, as opposed to C’s integers which are fixed size. To test the above constants in Python, after performing the multiplication one needs to AND the result with 0xffffffffffffffff to keep only the lower 64 bits.

The download section includes a python generator/tester based on the above snippet and also a C++ metaprogramming solution.

The download section includes a python generator/tester based on the above snippet and also a C++ metaprogramming solution.

Bit reversal

In the previous examples, we took bitmaps with useful bits that were n positions apart and combined them with a mask made of bits n-1 positions apart. What if we instead used a mixing factor made of bits n+1 positions apart?
Let’s use the secondary diagonal example, but with the new mixing factor.

Wow … the new mixing factor not only reversed the bits - from D00C00B00A(2) it produced an ABCD string - it also bought us an extra position. Remember, we were not able to extract the secondary diagonal via the direct mixer, because of carry/overflow. However, the reverse mixing factor gave us the extra bit to prevent overflow.

In other words, we can extract the second diagonal as long we don’t mind the bit reversal. For a symmetric board game such as Othello/Reversi, reversal is generally not a problem.

It should be fairly obvious from the above diagram that only a single bit was gained by using the reversing mixing factor - if the original bits were any “tighter”, carry would occur.

The mixing factors we use are composed of evenly-spaced 1 bits that were either step+1 or step-1 bits apart, where step represents the interval of useful bits in the input. The reason for the “one off” is simple - it produces a result where the useful bits are adjacent. If we used a mixing factor with 1 bits every step+2 positions, the bits would be 2 positions apart, possibly intermixed with other “stuff” - a less useful pattern.

In other words, we can extract the second diagonal as long we don’t mind the bit reversal. For a symmetric board game such as Othello/Reversi, reversal is generally not a problem.

It should be fairly obvious from the above diagram that only a single bit was gained by using the reversing mixing factor - if the original bits were any “tighter”, carry would occur.

The mixing factors we use are composed of evenly-spaced 1 bits that were either step+1 or step-1 bits apart, where step represents the interval of useful bits in the input. The reason for the “one off” is simple - it produces a result where the useful bits are adjacent. If we used a mixing factor with 1 bits every step+2 positions, the bits would be 2 positions apart, possibly intermixed with other “stuff” - a less useful pattern.

Generalizing the technique

A trivial generalization would be to extract groups of adjacent bits (e.g., bits 1, 2, 8, 9, 16, 17, etc.). The restrictions are similar to the direct and reverse extraction techniques.

In the more general sense, we would like to extract bits on arbitrary positions.

Looking back at the multiplication diagram, the result is composed of three section:

- lower junk bits
- staging area
- higher junk bits

Of the three, the higher junk bits are completely inconsequential, because bit carry only propagates upstream (whatever happens there, it won’t affect the staging area). We normally get rid of the higher junk bits by shifting them to a position that’s above the natural unsigned integer size (32 or 64 bits).

For the lower junk bits, the requirement is that they don’t produce a carry bit towards the staging area.

So, the technique works as long as there exists a mixing factor that:

For the lower junk bits, the requirement is that they don’t produce a carry bit towards the staging area.

So, the technique works as long as there exists a mixing factor that:

- produces the desired bit pattern within the staging area
- does not generate a carry from the lower junk bits to the staging area

The generalized algorithm is left as future work.

It might also be desirable to intentionally overlap the bits. If we lined up the D, C, B, A bits, instead of separating them by a one bit, then we could calculate a form of bit count. The usefulness of such a bit counting is unclear, as other techniques are likely to work better.

Hardware support

The technique presented here requires only C-style multiplication, bitwise AND and right shift - all common instructions that most architectures offer.

Of course, having an instruction dedicated to bit gathering is likely to work considerably better than using a series of operations. At the moment of this writing - September 2013 - Intel had previously announced bit manipulation extensions (BMI), in particular, the PEXT instruction which performs ordered bit gathering via a bitmask. Though the first Haswell processor generation (June 2013) was expected, before its launch, to include these extensions, in the end it did not. It is likely that a future Intel processor will include such instructions.

Though less general, the techniques presented here are nonetheless more portable.

**Download**

License

The source code is provided under the 2-clause BSD license. The text of the license is embedded in the source code files.

Final Note
After first publishing this article, I started digging some more on chessprogramming.wikispaces.com to look for other similar techniques. I eventually found the "magic multiplier" bit trick, which is essentially what I describe here -

http://chessprogramming.wikispaces.com/Magic+Bitboards

Originally, I was Google searching for "bit gathering" which yielded no useful results, and prompted me to think about a different solution.

*it has been known in the chess programming community for quite a while*.http://chessprogramming.wikispaces.com/Magic+Bitboards

Originally, I was Google searching for "bit gathering" which yielded no useful results, and prompted me to think about a different solution.

Copyright 2013 Vlad Petric.