What is Linux System Call Under the Hood?

The operating system must fulfill many objectives but one of most important is:

  • Provide an execution environment to the applications that run on the computer system (the so-called user programs).

Linux system call or syscall is the transition between the User Space and the Kernel Space. Whenever a process makes a system call (i.e., a request to the kernel), the hardware changes the privilege mode from User Mode to Kernel Mode, and the process starts the execution of a kernel procedure with a strictly limited purpose. Whenever the request is fully satisfied, the kernel procedure forces the hardware to return to User Mode and the process continues its execution from the instruction following the system call. It’s crucially important from the security side to restrict User Space applications from functionality allowed only for the Kernel. Therefore, Linux Kernel developers created such API or syscalls mechanism.

Linux system call

Contents

Kernel

The Linux kernel is a monolithic Unix-like computer operating system kernel created by Linux Torvalds in 1991. The Linux kernel, developed by contributors worldwide, is a prominent example of free and open source software.

Application binary interface

In computer software, an application binary interface (ABI) is the interface between two program modules, one of which is often a library or operating system, at the level of machine code.

Software interrupt

Transitions between User Mode and Kernel Mode happen only through well-established hardware mechanisms called interrupts and exceptions.

At the very start of x86 memory, down at segment 0, offset 0, is a special
lookup table with 256 entries. Each entry is a complete memory address
including segment and offset portions, for a total of 4 bytes per entry. The first 1,024 bytes of memory in any x86 machine is reserved for this table, and no other code or data may be placed there.

Each of the addresses in the table is called an interrupt vector. The table as a whole is called the interrupt vector table. Each vector has a number, from 0 to 255. The vector occupying bytes 0 through 3 in the table is vector 0. The vector occupying bytes 4 through 7 is vector 1, and so on.

None of the addresses is burned into permanent memory the way the PC BIOS routines are. When your machine starts up, Linux and BIOS fill many of the slots in the interrupt vector table with addresses of certain service routines within themselves. Each version of Linux knows the location of its innermost parts, and when you upgrade to a new version of Linux, that new version will fill the appropriate slots in the interrupt vector table with upgraded and accurate addresses.

The x86 CPUs include a machine instruction that has special powers
to make use of the interrupt vector table. The INT (INTerrupt) is an assembly language instruction for x86 processors that generates a software interrupt.

When the INT 80h instruction is executed, the CPU goes down to
the interrupt vector table, fetches the address from slot 80h, and then jumps execution to that address. The transition from user space to kernel space is clean and completely controlled. On the other side of the address stored in table slot 80h, the dispatcher picks up execution and performs the service that your program requests.

The INT 80h instruction takes the interrupt number formatted as a byte value. An interrupt transfers the program flow to whoever is handling that interrupt, which is interrupt 0x80 in this case. In Linux, 0x80 interrupt handler is the kernel and is used to make a system call to the kernel by other programs.

The kernel is notified about which system call the program wants to make, by examining the value in the register %rax (gas syntax, and RAX in Intel syntax). Each system call has different requirements about the use of the other registers. For example, a value of 1 in %rax means a system call of exit(), and the value in %rbx holds the value of the status code for exit().

Behind the scenes, the INT 80h instruction does something else: it pushes the address of the next instruction (that is, the instruction immediately following the INT 80h instruction) onto the stack before it follows vector 80h into the Linux kernel.

The INT 80h instruction was pushing some breadcrumbs to the stack as a way of helping the CPU find its way back to the User Space program after the excursion down into Linux.

The list of syscalls are defined here https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl and here sys/syscall.h:

...
#define	SYS_exit	1
#define	SYS_fork	2
#define	SYS_read	3
#define	SYS_write	4
#define	SYS_open	5
#define	SYS_close	6
#define	SYS_wait4	7
...

Implementation

Syswrite is defined here https://github.com/torvalds/linux/blob/master/fs/read_write.c

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
		size_t, count)
{
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos = file_pos_read(f.file);
		ret = vfs_write(f.file, buf, count, &pos);
		if (ret >= 0)
			file_pos_write(f.file, pos);
		fdput_pos(f);
	}

	return ret;
}

Assembly code

Let’s write a simple hello world program in assembly language which illustrates the usage of the syscalls and int 80H instruction:

; Author: Alex Bod
; Website: http://www.alexbod.com
; License: The GNU General Public License, version 2
; helloworld.asm: What is Linux System Call Under the Hood?
;

SECTION .data      ; Section containing initialised data
 
	Msg: db "Hello World!",10
	Len: equ $-Msg

SECTION .bss       ; Section containing uninitialised data

SECTION .text      ; Section containig code

global _start      ; Linker needs this to find the entry point

_start:
	; Syswrite
	mov rax,4
	mov rbx,1
	mov rcx, Msg
	mov rdx, Len
	int 80H

	; Sysexit
	mov rax,1
	mov rbx,0
	int 80H

Assembly code explanation

; Syswrite
mov rax,4 - move 4 (syswrite) to RAX register
mov rbx,1 - move 1 (standard output descriptor number) to RBX register
mov rcx, Msg -  move the link of the hello world string to RCX register
mov rdx, Len -  move the length of the hello world string to RDX register
int 80H -  call interrupt, 80H stands for the syscall in Linux with parameter 4 which means syswrite

; Sysexit
mov rax,1 - move 1 (sysexit) to RAX register
mov rbx,0 - move 0 (return 0, means program is returned normally) to RBX register
int 80H -  call interrupt, 80H stands for the syscall in Linux with parameter 1 which means sysexit

Compile it with nasm:

$ nasm -f elf64 -g -F dwarf helloworld.asm

Link it with ld:

$ ld -o helloworld helloworld.o

Execute it:

$ ./helloworld
Hello World!

Check the size:

$ ls -l
1928 bytes

It’s too big for the simple assembly Hello World program.

Strip it:

$ strip helloworld

Check the size again:

$ ls -l
512 bytes

Now we have the smallest hello world program in the world.

Check the linking:

$ ldd helloworld
not a dynamic executable

$ file helloworld
helloworld: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped

It’s linked to nothing. Awesome.

Disassemble it:

$ objdump -d helloworld

helloworld:     file format elf64-x86-64

Disassembly of section .text:

00000000004000b0 <.text>:
  4000b0:	b8 04 00 00 00       	mov    $0x4,%eax
  4000b5:	bb 01 00 00 00       	mov    $0x1,%ebx
  4000ba:	48 b9 d8 00 60 00 00 	movabs $0x6000d8,%rcx
  4000c1:	00 00 00 
  4000c4:	ba 0d 00 00 00       	mov    $0xd,%edx
  4000c9:	cd 80                	int    $0x80
  4000cb:	b8 01 00 00 00       	mov    $0x1,%eax
  4000d0:	bb 00 00 00 00       	mov    $0x0,%ebx
  4000d5:	cd 80                	int    $0x80