REN

Ph.D. in Computer Science at Rutgers University

Sign Extension in C

When we do explicit type casting in writing C codes, we couldn't avoid sign extension. In this article, I'll take two examples on how sign extension occurs when doing type casting. Here're the general rules for sign extension when transform a shorter size to a larger size:

1. If it's positive, extends high bits with 0.
2. If it's negative, extends high bits with 1.

Now, let's take a look at an example. To help you understand the code below, you should know two's complement

/* test.c */
#include <stdio.h>

int main() {
	char c = 0x87;
	short i = (short)c;
	short j = (short)(c&0xffff);
	short k = (short)(c&0x00ff);
	
	printf("i = %d\n", i);
	printf("j = %d\n", j);
	printf("k = %d\n", k);

	return 0;
}

Let's compile this program in GCC:

gcc -o test test.c

The result of those codes are: (on x86_64 Linux)

i = -121
j = -121
k = 135

Let's see how the result came:

c = (1)000 0111 // (two's complement)
(short)c = (1)111 1111 1000 0111 // (2's complement)
i = (1)000 0000 0111 1001 = -121 // (true form)
(short)(c&0xffff) = (1)111 1111 1000 0111 & 1111 1111 1111 1111 = (1)111 1111 1000 0111 // (two's complement)
j = (1)000 0000 0111 1001 = -121 // (true form)
(short)(c&0x00ff) = (1)111 1111 1000 0111 & 0000 0000 1111 1111 = (0)000 0000 1000 0111 // (two's complement)
k = (0)000 0000 1000 0111 = 135 // (true form)

How sign extension is performed in ISA level

We've know how the sign extension is done. Now let's look at how sign extension is performed in ISA level. As we know ALU in a modern processor only have adder to calculate addition and substraction. Now, let's look at a simple program:

/* test.c */
#include <stdio.h>

int main() {
	char a = 0x87;
	unsigned char b = 0x87;
	int c = a + b;
	printf("%d\n", c);
	return 0;
}

In the code above, a was overflown, thus when extended to integer, it should be extended with bit 1. Now, let's see how the compiler did this:

4		char a = 0x87;
   0x0804841c <+17>:	movb   $0x87,-0xe(%ebp)

5		unsigned char b = 0x87;
   0x08048420 <+21>:	movb   $0x87,-0xd(%ebp)

6		int c = a + b;
   0x08048424 <+25>:	movsbl -0xe(%ebp),%edx
   0x08048428 <+29>:	movzbl -0xd(%ebp),%eax
   0x0804842c <+33>:	add    %edx,%eax
   0x0804842e <+35>:	mov    %eax,-0xc(%ebp)

From the assembly code, we could clearly see that variable a was extended with bit 1 with instruction movsbl, which means extending with sign bit to long word, and move it to %edx. While variable b was extended with bit 0 with instruction movzbl, which means extending with zero bit to long word, and move it to %eax.