Dr. Petric's Technical Blog

Saturday, September 24, 2016

Akro Build - an extreme C++ build system

This blog post describes a new build system for C++, Akro. By the end of this somewhat lengthy writeup, you'll figure out how to build complex C++ binaries with only three lines of build specification.

The C++ build fiasco

A long time ago I was a C/C++ developer who started to look at the new-language-on-the-block, Java. Language differences aside, what really impressed me was the easiness of building Java applications with ant. A straightforward but functional build file only needed to specify the top level of the source tree, and a target destination for the compiled class files.

E.g.:

In contrast, C/C++ developers had to deal with crummy, easy to break, and eventually bit rotting Makefiles. The main reason for this fiasco is the way C/C++ handles modularization - through the preprocessor which textually includes header files within the to-be-compiled C/C++ file. Header files would contain declarations of structures, functions, and classes, whereas C/C++ files would contain their definitions. To be link-compatible within the same executable, different C/C++ have to include exactly the same declarations.

The separation between declaration and definition makes it particularly difficult to construct efficient build systems. In the general sense, because one can't tell where the definition (or implementation) corresponding to a declaration resides, C/C++ build tools require the developer to manually specify all C/C++ files, and potentially the header files as well.

Note that the vast majority of modern languages have reasonable modularization systems which make it straightforward to design and write build systems. Nonetheless, no modern programming language with reasonable industry support matches C++'s speed across the board. In other words, C++ is not obsolete, even though it has many evolutionary vestiges and mis-features.

Meet Akro

Can we do any better than having to specify all the C/C++ sources and header files at least once? In the absolutely general sense, probably not. On the other hand, just because we can implement a function anywhere we like, doesn't mean that we should. Good coding practices require that the definition and implementation reside in parallel files, with a different extension but everything else equal.

E.g., ntest, the premier Othello-playing engine has the following files:

~/ntest/src$ ls n64/endgameSearch

endgameSearch.cpp endgameSearch.h

The endgameSearch.h contains declarations for two classes, EndgameSearch and Empty, while the cpp file has the implementations. Akro relies on these conventions, and makes the following fundamental assumption:

If you include a header file, it means that you want to build and link the associated C++ file, if it exists

Normally, you give Akro a top level file. If you're building an executable the top level file is the file generally the one with the main() function. Akro then determines all the header files that are included in that top-level file, by calling the compiler in a dependency-tracking mode (gcc -M). Then, it looks at all the header files that reside within the same directory tree structure (essentially project header files, not system ones from /usr/include). From those header files it collects all the associated C++ files. If the top level file included endgameSearch.h, then endgameSearch.cpp is added to the list of cpp files. The process is repeated for all collected files (so endgameSearch.cpp is also compiled in dependency-tracking mode, and all its c++ dependencies added to the list). A file is only added once, and once the collected list is transitively closed, the process stops.

To make all this tracking and collection conceptually simple, Akro builds all files in the same top-level directory, and with the same compilation flags.

What does Akro need?

Akro requires a rakefile within the top-level directory of your project. Akro is implemented on top of Rake, a superb generic build system written in Ruby.

Normally C++ projects need to specify compile flags. So the first line in the rakefile is:

$COMPILE_FLAGS = "-std=c++1y -fPIC -pthread -Wall -Werror -msse4.2 -I."

If a project links in additional libraries, those need to be specified via additional link flags. For instance, the following says to link in the tcmalloc library (a great replacement for the really slow gnu malloc):

$ADDITIONAL_LINK_FLAGS = "-L/opt/gperftools/lib/ -ltcmalloc_and_profiler"

The final line declares several binaries to build:

add_binaries("bookplay.exe", "ntest.exe")

With the above 3 lines, you can run:

akro release

which will build release/bookplay.exe and release/ntest.exe.

Note 1: none of the above three lines are mandatory. The default compilation flags are -Wall, and then depending on the build mode, debug or release, -g3 or -O3 -g3. The default additional link flags are empty. If you do not specify binaries to build, you can still build by specifying them on the command line, such as:
akro release/ntest.exe

So, it is technically possible to build with an empty rakefile, though it is unlikely that the default compilation flags will work for you, and you may want to not type the binary names.

Note 2: the above example is a simplified version of ntest's build file.

What Akro is and is not

Akro is a C++ build system based on Rake, but is not:

An automated and portable compilation flag detection system, like autoconf/automake

Nonetheless, build portability within Akro may be achieved by running commands such as uname -a and setting variables such as $COMPILE_FLAGS depending on the output. Ruby in fact makes it really easy to do the above.

pkg-config, though functionality to use pkg-config to determine build parameters may be added in the future. Akro only tracks dependencies within the project, not external ones.
A generic C++ build system. If you don't follow the highlighted best practices, don't use Akro.
Make. Akro does not use Make in any way, but relies on Rake instead (seriously, why do projects use an outdated system like Make anyway?)

Akro is a layer on top of Rake - it generates tasks and rules to build C++ projects. All Rake functionality is kept as well.

Installing Akro

You need a Linux system, with Ruby 2.0 or later, though 2.3 is preferred. Ideally your system already has it installed.

$ ruby --version
ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu]

If not, it should already be present in the distribution's standard repositories. For instance, on Debian, Ubuntu you could apt-get install ruby.

Akro may work on other unices such as Mac OS/X, though it hasn't been tested on them yet.

If you're unlucky enough to run on an old ancient Linux system (such as Red Hat Enterprise Linux 6, whose acronym would be a good pun if it had another L), you may have to install the Ruby Virtual Machine manually, into your own directory. See https://rvm.io/rvm/install

If your system administrators are not cooperating, you may want to inform them that Linux is more akin to a cheap pale ale than wine or scotch - it really does not age well.

In any case, do not proceed until ruby --version shows something greater than 2.0.0

The next step is generally simple (may require root privileges):

gem install akro

Using Akro

Once akro is installed, you should be able to run it:

$ akro
rake aborted!
No Rakefile found (looking for: rakefile, Rakefile, rakefile.rb, Rakefile.rb)
/var/lib/gems/2.3.0/gems/akro-0.0.1/bin/akro:42:in `<top (required)>'
/usr/local/bin/akro:23:in `load'
/usr/local/bin/akro:23:in `<main>'
(See full trace by running task with --trace)

The error is to be expected, since we don't have a rakefile.

Now, go to your project's subdirectory. Create a rakefile which sets the proper $COMPILE_FLAGS. Then, run:

akro debug/path/to/main.exe

The above assumes that your top level file is path/to/main.{c,C,cpp,c++}. The main file may reside in the top level subdir, in which case the command is akro debug/main.exe

Note that all the compilations happen in that top-level directory where the rakefile is created. It is your responsibility to set the correct -I include paths, as part of $COMPILE_FLAGS

You can also change the $COMPILER, which is by default g++.

Once you manage to compile your project, you may still run into link issues. If you need to specify third libraries to link in, use $ADDITIONAL_LINK_FLAGS

Finally, add your binary to the default build list via add_binaries, so that you can just run akro release or akro debug

Hacking Akro

Akro's code is hosted on Github: https://github.com/vladpetric/akro

It is released under the MIT license, which I consider the most permissive Free Software license.

It currently consists of three files:

bin/akro
lib/akro.rb
lib/akrobuild.rake

lib/akro.rb constains mostly variables and definitions which the users may override in their rakefiles. It's a good idea to take a look at it.

lib/akrobuild.rake contains the code which generates Rake rules and tasks.

Final Notes

Akro's current version is 0.0.1. It's been tested extensively, but it has a really small number of users.

Your constructive feedback is greatly appreciated :)

Aknowledgements

I'd like to thank Maxim Trokhimtchouk for abuild, whose concepts served as an inspiration for Akro.

Friday, March 14, 2014

Bit Gathering and base-2-to-base-3 conversion for Othello

The previous article, Bit Gathering Via Multiplication, showed how to extract bit patterns from 64 bit integers which represent 8x8 board game positions (a.k.a. bitboards). Note, in the chess programming community the technique has been known for quite a while as kindergarten bitboards or magic multiply, but I prefer the term bit gathering.

This write-up presents an optimization to bit gathering which pertains to tri-state games such as Othello/Reversi.

Problem

Leading Othello engines such as NTest or Zebra use pattern tables to statically evaluate a position (see references). For instance, each row, by itself, will produce such a pattern. Symmetrically, every column will also produce a pattern. Diagonals and certain regions around the corners create patterns as well.

Each of these patterns have an associated "goodness" value depending on the stage of the game (certain patterns in fact have very different values in the beginning of the game versus the end of the game). The engines extract all the relevant patterns from the board, and then sum up their corresponding "goodness" values from the pattern database*.

The following figure shows how a column pattern and a diagonal pattern slice through the position (note: this position, with White to move, is a relatively balanced one). Both patterns are bolded. Dots indicate empty positions in the patterns.

				B
		B	B	.
.	B	B	W	W
B	W	B	W	W	W
W	W	B	B	W	B
	W	B	W	B	B
		B	B	W	B
		B		.	.

The highlighted secondary diagonal pattern is .WBWW.,with the dot representing an empty space. Similarly, the highlighted column pattern is B.WWWBW.. Because each of the 64 board position in Othello can be Empty (.), White (W), or Black (B), a pattern is essentially a base 3 number, with one digit per board position. A row has 8 positions, thus there are at most 3 to power 8 = 6561 possible patterns. I am saying at most because some patterns are clearly not possible (e.g., a center column, D or E, with the middle positions, 4 or 5, empty). Nonetheless, the pattern space is sufficiently dense that the base 3 representation provides an excellent space/encoding overhead trade-off.

Let's assume that within the pattern database, 0 means empty, 1 White and 2 Black **. The diagonal pattern will be 012110₍₃₎ or 1 * 3 ^ 4 + 2 * 3 ^ 3 + 1 * 3 ^ 2 + 1 * 3 ^ 1 .

So how would an engine produce the pattern for the 6-long secondary diagonal?

Let's assume that White and Black are each represented by a bitboard.

Using the bit gather approach:

Extract the 6 bit quantities from both the Black and White bitboards. Cost: 2 x (1 mask, 1 multiply, 1 shift). This step is described in the previous article.
Convert the base 2 representation to a base 3 representation with a look-up table.

The table essentially converts abcdefgh₍₂₎to abcdefgh₍₃₎, where each of a,b,c,d,e,f,g,h is a binary digit - either 0 or 1.

E.g., the array entry 27, which is:

11011₍₂₎ or
2 ^ 4 + 2 ^ 3 + 2 ^ 1 + 2 ^ 0,

contains

11011₍₃₎, which is:
3 ^ 4 + 3 ^ 3 + 3 ^ 1 + 3 ^ 0 == 112.

Cost: 2 loads (from a small table of 512 bytes, which is unlikely to cause caching problems)
Return 2 * base_3_Black + base_3_White : one or two basic operations. On x86, the LEA instruction can compute 2 * A + B in one shot.

To be clear, this is approach is already highly efficient on a modern processor. But, can we do any better? As it turns out, in some cases we can eliminate the need for the table look-ups altogether.

Folding bit gathering and base 2 to base 3 conversion

Let's take another look at the secondary diagonal that we'd like to extract, gather, and convert to base 3:



a
	b
		c
			d
				e
					f

Much like the regular bit gather, the first step is a bitwise AND to clean up anything else. We thus end up with a00000000b00000000c00000000d00000000e00000000f00₍₂₎. Whereas the bit gather step would produce abcdef₍₂₎ via a multiply and shift, this technique will produce directly abcdef₍₃₎ via a multiply and shift as well.

If you recall from the first article about bit gathering, multiplication makes transposed (shifted) copies of the first operand wherever the second operand has a bit of 1, and adds all the transposed copies together. To extract and base-2-to-base-3 convert a pattern at the same time, the first operand will be an array of powers of 3. The second operand will be the cleaned-up bit pattern. Because the diagonal bit pattern is made up of bits that are 9 positions apart, the powers of three in the first operand will also be 9 positions apart.

The following are the powers of 3 up to and including 5, in binary:

power	decimal	binary	alias
0	1	1	p
1	3	11	qq
2	9	1001	rrrr
3	27	11011	sssss
4	81	1010001	ttttttt
5	243	11110011	uuuuuuuu
Total:	365	101101100

Without further ado, the multiplication diagram, which, for simplicity, assumes that all a, b, c, d, e, and f are 1:

                  p0000000qq00000rrrr0000sssss00ttttttt0uuuuuuuu *
                  a00000000b00000000c00000000d00000000e00000000f
                ------------------------------------------------  
                  p0000000qq00000rrrr0000sssss00ttttttt0uuuuuuuu
         p0000000qq00000rrrr0000sssss00ttttttt0uuuuuuuu
  000000qq00000rrrr0000sssss00ttttttt0uuuuuuuu
  0000rrrr0000sssss00ttttttt0uuuuuuuu
  000sssss00ttttttt0uuuuuuuu
  0ttttttt0uuuuuuuu
  --------------------------------------------------------------
  ________XXXXXXXXX_____________________________________________

The highlighted portion, XXXXXXXXX, shall contain a * uuuuuuuu + b * ttttttt + c * sssss + d * rrrr + e * qq + f * p, or abcdef₍₃₎. The maximum result value shall not exceed 365, which can be represented on 9 bits. By placing the powers of 3 with exactly the same "pitch" as the bits in our pattern (i.e., 9 bits), we ensured the XXXXXXXXX area in the result will have all the powers of 3 precisely lined up. Of course, uuuuuuuu, ttttttt, sssss, rrrr, qq, and p are just aliases for the powers of 3. The multiplication factor is 0x1018243651.

This technique works as long as there is no carry between the multiplication groups. In other words, the sum of powers of 3 should be representable on the number of bits between two elements (9 bits). It should now be clear why a 6-long NWSE diagonal was chosen. The NWSE direction provides a 9-bit separation between elements. The NESW diagonal provides only 7 bits of separation. Also, if we chose a 7 or 8-long NWSE diagonal, the powers of 3 would exceed 9 bits, as the sum of powers of 3 up to 6 is 1091 (10 bits).

However, for 5-long diagonals the technique works in both directions, as 7 bits are sufficient to sum up all the powers of 3 up to 4.

Acknowledgements

I would like to thank Chris Welty, the main author of NTest, and Won Chun for independently giving me the idea of combining magic multiplies and base 2 to base 3 conversion.

References

Ntest: Main source tree: https://github.com/weltyc/ntest
My clone of ntest's source tree: https://github.com/vladpetric/ntest

Zebra: http://radagast.se/othello/download.html

"Experiments with Multi-ProbCut and a New High-Quality Evaluation Function for Othello" (1997) by Michael Buro describes additive, pattern-based Othello

Notes

* NTest uses a total of 42 patterns: 16 for all the rows and columns (8 + 8), 14 for diagonals (size 5 to size 8), and 12 for corner regions (3 per corner). In addition to summing up pattern values, NTest also uses mobility, potential mobility and parity to statically evaluate a position.

** NTest uses a homologous representation - 0 for the side to move, 1 for empty and 2 for the opposing side. For the purpose of this post, the different convention makes little difference.

Tuesday, September 17, 2013

Bit Gather Via Multiplication

Problem

Let’s say that we have an integer value representing a 64-bit array (or bitmap) and we only care about some of the bits, say, every 9th one (i.e., the 1st, 9th, 18th, all the way to the 64th). We would like to extract and gather those bits as efficiently as possible, i.e., create an 8 bit value containing just those bits, in order.

Would you believe that it is possible to do so with just 3 arithmetic operations?

As motivation for this problem, many board games are played on an 8x8 board (http://www.boardgamegeek.com lists more than one hundred games with the tag 8x8). My own favorite board game, Othello (aka Reversi), is played on an 8x8 board. Such boards can be efficiently represented using 64-bit unsigned integers. For instance, an Othello playing software might represent such a board using two bitmaps (or bitboards), one for white and one for black. Taking every 9th bit, from the least significant to the most significant one, means extracting the main diagonal.

A trivial solution would involve looking at every bit by bitwise-AND-ing with pre-calculated masks, and then setting a bit in the destination variable. That solution would require dozens of operations.

We could also use lookup tables, e.g., a 65536-entry lookup table could be used to extract the first two bits (a1 and b2), another such lookup table for c3 and d4, for a total of 4 lookups in 256 KiB of data. The total cost would be 3 shifts, 4 loads and 4 bitwise and operations. The problem with lookup tables, particularly larger ones, is that they interfere with the cache hierarchy. Such a solution may perform well in a microbenchmark. In a more realistic setting, however, its behavior will likely disappoint.

Solution

Can we do any better? My proposed solution uses 3 arithmetic operations and no loads.

The first operation is a bitwise AND that clears up all the non-diagonal bits (e.g., b4). The AND mask corresponding to the diagonal bits is 0x80 40 20 10 08 04 02 01.

The second instruction is, perhaps surprisingly, multiplication. To understand how multiplication can help us gather the bits, let’s take a look at how multiplication works in binary. Let’s say that we have an arbitrary 3-bit base-2 number, ABC(2), that we would like to multiply by 10001(2). The pen-and-pencil multiplication diagram in base 2 is:

It does not particularly matter whether each of A, B, C, is either 0 or 1. The result is ABC0ABC(2). The key observation is that, at a binary level, multiplication makes transposed (shifted) copies of the first operand wherever the second operand has a bit of 1, and adds all the transposed copies together. Of course, multiplication is commutative, therefore you can swap the operands in the previous statement - and in the diagram.

We will use this particular property to gather the evenly-spaced bits. For simplicity, let’s consider a 16 bit array, or 4x4 bitmap, and we would like to extract every 5th bit.

In binary, the number would be D0000C0000B0000A(2). The top-left corner is the least significant bit, and the bottom-right corner is the most significant bit.

How might we produce DCBA(2) somewhere in the output result?

Let’s start with a simpler step - how do we bring C before D? Assuming that we keep D in its current place - 16th bit - we could bring C from position 11 to position 15 by shifting left by four bits the initial number, and then superimposing it with the unshifted number. Thankfully a multiplication via 10001(2) achieves exactly that. Notice, there are only three 0s in between the two 1s. In the cleaned-up number, there are 4 zeroes between A, B, C and D.

For brevity I’m not showing the zero rows in the multiplication diagram.

Moving on, to bring B on position 14 from position 6, we need to left shift the original number by 8 bits, which is equivalent to multiplying the first operand by 100000000(2). Combining that with 10001(2), we get 100010001(2)

Great, now we have DCB on positions 16, 15 and 14.

It should not be particularly surprising that multiplying by 1000100010001(2) produces the DCBA string between positions 16 and 13.

We can now shift right by 13 and retain the first 4 bits (and with 0xF) to obtain our result. But wait, you might say, isn’t that a total 4 operations? For the 4x4 bitmap, that is correct.

For the original problem, of extracting 8 diagonal bits from an 8x8 bitmap, we multiply the 64 bit number by

100000001000000010000000100000001000000010000000100000001(2)

We obtain the diagonal bits on positions 63-56. For a 64 bit unsigned integer, those are the most significant 8 bits. In the C language, unsigned integer overflow is well-defined and keeps the low-order bits of the result. After multiplying the two unsigned 64 bit integers, all that is needed is a shift right by 56 bits.

#include <stdint.h>

uint64_t extract_diagonal(uint64_t bitmap) {

return ((uint64_t) ((bitmap & 0x8040201008040201ULL) *

0x0101010101010101ULL)) >> 56;
}

Actually, if we wanted to extract the diagonal bits from a 16-bit bitmap, we could do it with 3 operations as well. To obtain our DCBA string at position 31-28, we would multiply not by 1000100010001(2), but by 1000100010001 0000000000000000(2). Essentially, we combine the multiplication with a left shift (which is the same as a multiplication by a power of 2), so that overflow does the work of cleaning up the upper junk bits. We would use uint32_t, and then shift right by 28 bits.

Isn’t multiplication powerful?

Performance

What kind of assembly code is generated for the above?

Using a 64-bit x86 linux box, with gcc 4.7.3 in opt mode (-O3), I obtained the following assembly snippet. Note that objdump normally uses the AT&T assembly style, which puts the output operand last; this is different from Intel assembly style, which puts the output first.

movabs $0x8040201008040201,%rdx

and %rdx,%rax

movabs $0x101010101010101,%rdx

imul %rdx,%rax

shr $0x38,%rax

We start with the input value in register %rax . We have two instructions to load constants, and the three arithmetic operations - and, multiply and shift right.

In terms of instruction latency, the and, shr and movabs operations normally each take a cycle. According to http://www.agner.org/optimize/instruction_tables.pdf, 64 bit multiplication takes between 3 and 6 cycles on modern processors. The first movabs and the three arithmetic operations execute serially, because they depend on the previous instruction’s input. However, the second and third instruction may execute in parallel, because they are independent. So I estimate between 6 and 10 cycles.

Of course, this is just an example of generated code - the compiler may do something different. By the way, if the code is compiled for a 32 bit architecture, the resulting assembly code will not be as efficient, because all 64 bit operations will be simulated via 32 ones. It is probably more efficient to extract 4 bits from each of the 32 bit words (upper and lower) and combine them afterwards.

As is generally the case with modern processors that are superscalar (i.e., they can schedule multiple instructions or micro-ops per cycle) and execute instructions out-of-order (i.e., they can execute independent instructions in a flexible order, e.g. the second movabs before the first one), your mileage may vary. Always measure, and also keep in mind Amdahl’s law!

Correctness

Now you may wonder - why does it work? Or, are there cases where this may not work?

The multiplication trick works as long as there is no carry between two columns. In other words, if you look at any column in the multiplication diagram, there should be a single bit that is potentially non-zero. In plain English, that can be achieved if the gap between the bits we would like to extract is sufficiently large to hold all the collapsed bits.

If we need to collect k bits, at positions c, c+n, c+n*2 … c+n*(k-1), then n should be greater than or equal to k.

For the diagonal example, n was 9 and k was 8 (and c was 0). If n and k were equal to each other (8) then the would extract a column.

However, the algorithm cannot extract the second diagonal (a8 to h1), because n would be 7 and k 8. It could extract 7 bits off the second diagonal, and then 8th bit could be patched using more conventional approaches.

Let’s take a look at the second diagonal example.

Using the 4x4 example, the elements of the second diagonal are 2 bits apart.

The multiplication diagram shows how the overlapping of C and A clobbers, by producing a carry bit, all the useful higher order bits.

Note that I did not include the trailing three zeros in the original bitmap - those bits are inconsequential, i.e. overflow occurs independently of them.

A meta algorithm: Python implementation

The meta algorithm takes c, n and k and produces three values: the AND mask, the multiplier and the final shift amount.

The AND mask simply has c, c+n, c+n*2 … c+n*(k-1) bits set to one, and all other bits set to zero.

The final shift value is also easy to determine, as the algorithm places the k bits at the end of the target integer. Assuming a 64 bit integer, the shift value is 64-k.

The multiplication mask is a composite of two factors:

The mixing factor: the value given by setting the 0, n-1, (n-1)*2 … (n-1)*(k-1) bits to 1, and all other bits to zero. Note that I’m considering bit 0 to be the least significant bit.
The shift required to bring the result to the most significant bits. It is meant to bring the last element, c+n*(k-1), to the most significant bit, 63, so its value is 63 - c - n*(k-1).

A sample Python code to generate these values:

def repeatedBit64(start, count, step):

assert start + (count - 1) * step < 64

value = 0

for position in range(0, count):

value |= 1 << (start + position * step)

return value

def bitGatherTerms(first, count, step):

assert step >= count

andMask = repeatedBit64(first, count, step)

multiplier = repeatedBit64(0, count, step-1) * \

(1 << (63 - first - (count - 1) * step))

rightShift = 64 - count

return (andMask, multiplier, rightShift)

Note that Python’s integers are arbitrarily long, as opposed to C’s integers which are fixed size. To test the above constants in Python, after performing the multiplication one needs to AND the result with 0xffffffffffffffff to keep only the lower 64 bits.

The download section includes a python generator/tester based on the above snippet and also a C++ metaprogramming solution.

Bit reversal

In the previous examples, we took bitmaps with useful bits that were n positions apart and combined them with a mask made of bits n-1 positions apart. What if we instead used a mixing factor made of bits n+1 positions apart?

Let’s use the secondary diagonal example, but with the new mixing factor.

Wow … the new mixing factor not only reversed the bits - from D00C00B00A(2) it produced an ABCD string - it also bought us an extra position. Remember, we were not able to extract the secondary diagonal via the direct mixer, because of carry/overflow. However, the reverse mixing factor gave us the extra bit to prevent overflow.

In other words, we can extract the second diagonal as long we don’t mind the bit reversal. For a symmetric board game such as Othello/Reversi, reversal is generally not a problem.

It should be fairly obvious from the above diagram that only a single bit was gained by using the reversing mixing factor - if the original bits were any “tighter”, carry would occur.

The mixing factors we use are composed of evenly-spaced 1 bits that were either step+1 or step-1 bits apart, where step represents the interval of useful bits in the input. The reason for the “one off” is simple - it produces a result where the useful bits are adjacent. If we used a mixing factor with 1 bits every step+2 positions, the bits would be 2 positions apart, possibly intermixed with other “stuff” - a less useful pattern.

Generalizing the technique

A trivial generalization would be to extract groups of adjacent bits (e.g., bits 1, 2, 8, 9, 16, 17, etc.). The restrictions are similar to the direct and reverse extraction techniques.

In the more general sense, we would like to extract bits on arbitrary positions.

Looking back at the multiplication diagram, the result is composed of three section:

lower junk bits
staging area
higher junk bits

Of the three, the higher junk bits are completely inconsequential, because bit carry only propagates upstream (whatever happens there, it won’t affect the staging area). We normally get rid of the higher junk bits by shifting them to a position that’s above the natural unsigned integer size (32 or 64 bits).

For the lower junk bits, the requirement is that they don’t produce a carry bit towards the staging area.

So, the technique works as long as there exists a mixing factor that:

produces the desired bit pattern within the staging area
does not generate a carry from the lower junk bits to the staging area

The generalized algorithm is left as future work.

It might also be desirable to intentionally overlap the bits. If we lined up the D, C, B, A bits, instead of separating them by a one bit, then we could calculate a form of bit count. The usefulness of such a bit counting is unclear, as other techniques are likely to work better.

Hardware support

The technique presented here requires only C-style multiplication, bitwise AND and right shift - all common instructions that most architectures offer.

Of course, having an instruction dedicated to bit gathering is likely to work considerably better than using a series of operations. At the moment of this writing - September 2013 - Intel had previously announced bit manipulation extensions (BMI), in particular, the PEXT instruction which performs ordered bit gathering via a bitmask. Though the first Haswell processor generation (June 2013) was expected, before its launch, to include these extensions, in the end it did not. It is likely that a future Intel processor will include such instructions.

Though less general, the techniques presented here are nonetheless more portable.

Download

bit_extract.py

bit_extractor.h

License

The source code is provided under the 2-clause BSD license. The text of the license is embedded in the source code files.

Final Note After first publishing this article, I started digging some more on chessprogramming.wikispaces.com to look for other similar techniques. I eventually found the "magic multiplier" bit trick, which is essentially what I describe here - it has been known in the chess programming community for quite a while.

http://chessprogramming.wikispaces.com/Magic+Bitboards

Originally, I was Google searching for "bit gathering" which yielded no useful results, and prompted me to think about a different solution.