# Fast Bit Counting

2009.02.21 19:53

조회 수:1624

Puzzle: Devise a fast algorithm for computing the number of 1-bits in an unsigned integer. If the machine supports n-bit integers, can we compute the number of 1-bits in O(log n) machine instructions? Can we compute the number of 1-bits in O(b) machine instructions where b is the number of 1-bits in the integer?

Source: Commonly asked in job interviews for Software Engineers. I first heard it in 1996 when I was a graduate student.

Solution: This article presents six solutions to this problem. Source code in C is available.

1) Iterated Count
int bitcount (unsigned int n) {
int count = 0;
while (n) {
count += n & 0x1u;
n >>= 1;
}
return count;
}

Iterated Count runs in time proportional to the total number of bits. It simply loops through all the bits, terminating slightly earlier because of the while condition. Useful if 1’s are sparse and among the least significant bits.

2a) Sparse Ones
int bitcount (unsigned int n)  {
int count = 0 ;
while (n)  {
count++ ;
n &= (n - 1) ;
}
return count ;
}

Sparse Ones runs in time proportional to the number of 1 bits. The mystical line n &= (n - 1) simply sets the rightmost 1 bit in n to 0.

2b) Dense Ones
int bitcount (unsigned int n)   {
int count = 8 * sizeof(int) ;
n ^= (unsigned int) - 1 ;
while (n)  {
count-- ;
n &= (n - 1) ;
}
return count ;
}

Dense Ones runs in time proportional to the number of 0 bits. It is the same as Sparse Ones, except that it first toggles all bits (n ~= -1), and continually subtracts the number of 1 bits from sizeof(int).

Sparse Ones and Dense Ones were first described by Peter Wegner in “A Technique for Counting Ones in a Binary Computer”, Communications of the ACM, Volume 3 (1960) Number 5, page 322.

3a) Precompute_8bit
static int bits_in_char  ;

int bitcount (unsigned int n)  {
// works only for 32-bit ints

return bits_in_char [n  & 0xffu]
+  bits_in_char [(n >>  8 ) & 0xffu]
+  bits_in_char [(n >> 16) & 0xffu]
+  bits_in_char [(n >> 24) & 0xffu] ;
}

Precompute_8bit assumes an array bits_in_char such that bits_in_char[i] contains the number of 1 bits in the binary representation for i. It repeatedly updates count by masking out the last eight bits in n, and indexing into bits_in_char.

3b) Precompute_16bit
static char bits_in_16bits [0x1u << 16] ;

int bitcount (unsigned int n)  {
// works only for 32-bit ints

return bits_in_16bits [n         & 0xffffu]
+  bits_in_16bits [(n >> 16) & 0xffffu] ;
}

Precompute_16bit is a variant of Precompute_8bit in that an array bits_in_16bits[] stores the number of 1 bits in successive 16 bit numbers (shorts).

4) Parallel Count
#define TWO(c)     (0x1u << (c))
#define MASK(c)    (((unsigned int)(-1)) / (TWO(TWO(c)) + 1u))

int bitcount (unsigned int n)  {
n = COUNT(n, 0) ;
n = COUNT(n, 1) ;
n = COUNT(n, 2) ;
n = COUNT(n, 3) ;
n = COUNT(n, 4) ;
/* n = COUNT(n, 5) ;    for 64-bit integers */
return n ;
}

Parallel Count carries out bit counting in a parallel fashion. Consider n after the first line has finished executing. Imagine splitting n into pairs of bits. Each pair contains the number of ones in those two bit positions in the original n. After the second line has finished executing, each nibble contains the number of ones in those four bits positions in the original n. Continuing this for five iterations, the 64 bits contain the number of ones among these sixty-four bit positions in the original n. That is what we wanted to compute.

5) Nifty Parallel Count
int bitcount (unsigned int n) {
n = (n & MASK_01010101) + ((n >> 1) & MASK_01010101) ;
n = (n & MASK_00110011) + ((n >> 2) & MASK_00110011) ;
n = (n & MASK_00001111) + ((n >> 4) & MASK_00001111) ;
return n % 255 ;
}

Nifty Parallel Count works the same way as Parallel Count for the first three iterations. At the end of the third line (just before the return), each byte of n contains the number of ones in those eight bit positions in the original n. A little thought then explains why the remainder modulo 255 works.

According to Don Knuth (The Art of Computer Programming Vol IV, p 11), in the first textbook on programming, The Preparation of Programs for an Electronic Digital Computer by Wilkes, Wheeler and Gill (1957, reprinted 1984), pages 191–193 presented Nifty Parallel Count by D B Gillies and J C P Miller.

6) MIT HAKMEM Count
int bitcount(unsigned int n) {
/* works for 32-bit numbers only    */
/* fix last line for 64-bit numbers */

register unsigned int tmp;

tmp = n - ((n >> 1) & 033333333333)
- ((n >> 2) & 011111111111);
return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}

MIT HAKMEM Count is funky. Consider a 3 bit number as being 4a+2b+c. If we shift it right 1 bit, we have 2a+b. Subtracting this from the original gives 2a+b+c. If we right-shift the original 3-bit number by two bits, we get a, and so with another subtraction we have a+b+c, which is the number of bits in the original number. How is this insight employed? The first assignment statement in the routine computes tmp. Consider the octal representation of tmp. Each digit in the octal representation is simply the number of 1’s in the corresponding three bit positions in n. The last return statement sums these octal digits to produce the final answer. The key idea is to add adjacent pairs of octal digits together and then compute the remainder modulus 63. This is accomplished by right-shifting tmp by three bits, adding it to tmp itself and ANDing with a suitable mask. This yields a number in which groups of six adjacent bits (starting from the LSB) contain the number of 1’s among those six positions in n. This number modulo 63 yields the final answer. For 64-bit numbers, we would have to add triples of octal digits and use modulus 1023. This is HACKMEM 169, as used in X11 sources. Source: MIT AI Lab memo, late 1970’s.

7) Builtin Instructions

GNU compiler allows for

int __builtin_popcount (unsigned int x);which translates into a single CPU instruction if the underlying machine architecture supports it. For example, Intel machines have POPCNT (SSE4 Instruction set announced in 2006). Many GCC builtin functions exist.

No Optimization         Some Optimization       Heavy Optimization

Precomp_16 52.94 Mcps    Precomp_16 76.22 Mcps    Precomp_16 80.58 Mcps
Precomp_8 29.74 Mcps     Precomp_8 49.83 Mcps     Precomp_8 51.65 Mcps
Parallel 19.30 Mcps      Parallel 36.00 Mcps      Parallel 38.55 Mcps
MIT 16.93 Mcps           MIT 17.10 Mcps         Nifty 31.82 Mcps
Nifty 12.78 Mcps         Nifty 16.07 Mcps           MIT 29.71 Mcps
Sparse  5.70 Mcps        Sparse 15.01 Mcps        Sparse 14.62 Mcps
Dense  5.30 Mcps        Dense  14.11 Mcps         Dense 14.56 Mcps
Iterated  3.60 Mcps      Iterated  3.84 Mcps      Iterated  9.24 Mcps

Mcps = Million counts per second

Which of the several bit counting routines is the fastest? Results of speed trials on an i686 are summarized in the table on left. “No Optimization” was compiled with plain gcc. “Some Optimizations” was gcc -O3. “Heavy Optimizations” corresponds to gcc -O3 -mcpu=i686 -march=i686 -fforce-addr -funroll-loops -frerun-cse-after-loop -frerun-loop-opt -malign-functions=4.

Thanks to Seth Robertson who suggested performing speed trials by extending bitcount.c. Seth also pointed me to MIT_Hackmem routine. Thanks to Denny Gursky who suggested the idea of Precompute_11bit. That would require three sums (11-bit, 11-bit and 10-bit precomputed counts). I then tried Precompute_16bit which turned out to be even faster.

If you have niftier solutions up your sleeves, please send me an e-mail or write comments below!

HAKMEM (bit counting is memo number 169), MIT AI Lab, Artificial Intelligence Memo No. 239, February 29, 1972.
Bit Twiddling Hacks by Sean Anderson at Stanford University.
Bitwise Tricks and Techniques by Don Knuth (The Art of Computer Programming, Part IV).
Posted in Computer Science | Tagged algorithms, bit counting, microsoft interview problems, mit hakmem, puzzles | 13 Comments

## 댓글 0

여기에 파일을 끌어 놓거나 파일 첨부 버튼을 클릭하세요.

파일 크기 제한 : 0MB (허용 확장자 : *.*)

0개 첨부 됨 ( / )

번호 제목 글쓴이 날짜 조회 수
64 문길 설치법 2017.04.19 78
63 진 여신전생 스트레인지 저니 에디터 코드 2015.06.18 147
62 VC 2012이후에서 릴리즈 빌드 스택변수들 디버깅하기 2014.03.05 230
61 Large Address Aware 2014.03.05 714
60 빠른 1/sqrt(n) 계산방법 2014.03.05 420
59 마포에서 AF되는 포서드용 렌즈들 2014.02.13 183
58 디카 이미지 슬라이드필름처럼 보정하기 2014.02.13 160
57 비디오 카드의 정의 2013.11.30 171
56 Windows 8.1 복구 파티션 만들기 2013.11.13 6530
55 마력 구하는 공식 2013.07.11 1272
54 [공지] 게시판 용도 2005.06.16 1097
53 Stream of Life 2009.06.29 1157
52 골프의 물리학 2009.05.30 1554
» Fast Bit Counting 2009.02.21 1624
50 vTune 사용법 2009.01.13 1257
49 제트 추력 엔진 2008.04.29 1452
48 std::tr1 2008.03.13 1347
47 일본어 날짜 2008.02.19 1136
46 Networking Best Practices in XBOX360 2007.12.19 2072
45 개인적으로 만들어 본 리눅스 게임 서버 FAQ 2007.11.27 1112