Operating system internals

I spend a lot of time using computers, but I am still scared of what goes on beneath the covers.
To change that, I am following along with the free Operating Systems: Three Easy Pieces book offered by Computer Science professors at the University of Wisconsin-Madison


  1. Introduction
  2. Virtualization
    1. Virtualizing The CPU
      1. Mechanisms And Policies
      2. Process Creation
      3. Limited Direct Execution
      4. Scheduling
    2. Virtualizing Memory
      1. Memory Allocation Api
      2. Address Translation
      3. Segmentation
      4. Free Memory Management
      5. Paging
      6. Translation Lookaside Buffers
      7. Advanced Paging
      8. Swap Space
  3. Concurrency
  4. Persistence


When we say a program is “running” we mean that a program is doing the following over and over again:

This is the Von Neumann model of computing. That.is.it.

However, just having a single program running at a time is boring. As it stands it is a simple batch processing system. You cannot run multiple programs at once and you cannot interact with the system while it is running your program. We want more. We need more.

The body of software that allows us to interact with our laptops in the way we do is called the operating system. It enables multiple programs to run seemingly at the same time – giving us the desktop “environment” we are used to.

As I am writing this in a text editor, my browser is open to the page containing this lesson, music is playing from another tab on the browser, I have a status bar on the bottom showing the time, volume, date, battery, and the status of my wireless connection. I also have another workspace with a terminal window open. None of this would be possible without the wild wild things the operating system does behind the scenes while programs are seemingly just fetching, decoding and executing instructions.

So how does an operating system (or OS) do this magic? Three easy pieces; virtualization, concurrency and persistence.


Virtualization: “Sharing one peach among many peach-eaters while making them think they each have a peach to themselves”

Virtualizing the CPU

Program – A program is a lifeless thing, sitting on disk. A bunch of instructions and maybe some data, waiting to spring into action

Process – A running program.

It is the operating system that takes a program and gets it running, transforming it into something useful. The operating system needs a CPU to “run” these programs in.

A usable computer should be able to run many programs at a time. So how do we create the illusion of having multiple CPUs for these programs to run on? How do we give the illusion of access other physical resources that the computer might be connected to ? The answer is clearly “virtualization”…but how does the OS do that?

Mechanisms and Policies

Mechanisms are low-level operations, methods or protocols to implement a needed piece of functionality. They represent the step-by-step details of how to do something.

Policies are the algorithms for making some kind of decision. For example, deciding which process to run next.

Back to processes. Some terms

Machine state: The parts of the machine that are important to the execution of a running process. Some examples are

Process API: The way you expect to be able to interact with a process. For example, we should be able to create, destroy, and wait for a process. Other operations might also be possible.

Process List: Data structure that the OS uses to track information about processes that are running, ready or blocked. The individual members in this list are sometimes called PCBs (process control blocks) or process descriptors.

Mechanism: Process Creation

To have multiple processes running on the same computer, the operating system must be able to create them. These is the main application programming interface for process creation:

Mechanism: Limited Direct Execution

When a process is running on the CPU, nothing else is, including the operating system. So how can the OS (which is not running) manage processes (which are running)?

A mechanism called Limited Direct Execution is used. The elements of this are:

Policy: Scheduling

Some Definitions:

Turnaround Time: The time from when the job entered the system to when it completes

Response Time: The time from when the job enters the system to when it first starts executing.

Different scheduling policies yield different results for these two metrics. In general, policies that are considered “fair” are terrible for turnaround time but good for response time.

Policies that are considered “efficient” are great at turnaround time, but terrible at response time.

Multi-Level-Feedback Queue Scheduler

The crux of the problem: “How can we design a scheduler that both minimizes response time for interactive jobs while also minimizing turnaround time without previous knowledge of job length?

We do so by coming up with the concept of job priorities.

Here are the rules:

  1. If Priority(A) > Priority(B), A runs (B does not)
  2. If Priority(A) = Priority(B), A & B run in round-robin fashion using the time slice of the given queue.
  3. When a job enters the system, it is placed at the highest priority (the topmost queue).
  4. Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).
  5. After some time period S, move all the jobs in the system to the topmost queue

The Pros:

The Cons:

Lottery Scheduling

Exactly what it sounds like. We pick the next process to run at random with the added twist that we can assign “weights” to certain processes so they are more likely to be picked.

The longer the processes run for the more “fair” this scheduling is as each process runs for exactly the proportion of time its weights corresponds to. These weights can be referred to as “tickets”. The more tickets a process has, the more likely it is to be picked when the scheduler is deciding what job to run next.

The Pros:

The Cons:

Aside: Linux uses a variant of this and calls it the CFS (Completely Fair Scheduler) As each process runs, it accumulates vruntime. Once a time-slice is over, the scheduler simply picks the process with the lowest vruntime.

The time-slice is determined by another knob called sched_latency, which is (usually) the period over which the scheduler will be completely fair. It takes this value and divides it by the number of processes, which will give it the time-slice to use.

Yet another knob is used to make sure time-slices are not too short. This is the min_granularity, which is the minimum amount of time a time-slice should be. If the set sched_latency implies a value lower than min_granularity, it is set to min_granularity instead.

Virtualizing Memory

We are not yet done with virtualization though. The OS also needs to virtualize any shared hardware that processes might want to use. This includes memory. The virtual memory manager is responsible for giving processes the illusion that they have a sequential block of memory to themselves. This is called the “address space” for process.

This goal of this illusion is easy of use and protection.
It would be terrible to program in an environment where you needed to make sure you did not overwrite some other program’s memory.
It would also be terrible if other programs could overwrite your own memory.

In the early days of computing, this did not matter much. Computers only needed to only run a few jobs non-interactively. But humans are demanding and eventually we wanted our computers to be interactive! we wanted them fast and capable of running multiple programs at the same time! and they must support multiple concurrent users! .__. So OS designers had no choice. Virtual memory it is. Gone are the simple days

Mechanism: Memory Allocation Api

Generally, when running a program, there are two types of memory “stack” and “heap”. The underlying operating system does not make this distinction. It’s all just memory.
However most programming languages introduce this concept to make a distinction between short-lived and long-lived memory.

Short-lived memory or stack memory only lasts within a function call.
Long-lived memory or heap memory can last between function invocations.

Stack memory is usually automatically allocated AND de-allocated for you. Heap memory however, you need to take care of yourself.

In UNIX/C programs, two methods are available for allocating heap memory

These are not system calls. They are provided by the C standard library as easier interfaces to the brk and sbrk system calls.

Mechanism: Address Translation

The idea is the same with limited directed execution. The operating system must partner with the hardware to be able to provide a good memory interface to programs.

Specifically, running programs are made to think they have a continuous chunk of memory starting at 0 which they can use. Perhaps unsurprisingly, this is not actually the case.

If multiple programs are to share memory, the OS needs to be able to place them at different spots in memory, which means the addresses the programs use will be wrong. To fix this issue, the OS uses address translation or base and bounds.

The idea is that the running program thinks it is accessing address 128 to get its next instruction. However, the next instruction for that program might not be at physical address 128 because of where the program was placed in memory. So the Operating System with great help from the hardware translates that 128 virtual address its physical address.

This is done by recording a base and bounds for every program. We can think of the base as the offset from 0 to where the program actually resides in memory. So when a program whose base is 10000 tries to access memory location 128, the hardware will actually translate that to the physical memory location 10128. The bounds helps enforce the rule that processes can only access their own memory. If a process tries to access memory greater than the bounds, the hardware can run a handler (set up by the OS) that will kill the offending process.

Mechanism: Segmentation

If programmers were content creatures, we would be be content with having base and bounds registers. But no. We must have more flexibility. We cannot simply give each process a physical block of continuous address space! We need to be able to partition it and place its parts wherever we want, so we can be more memory-efficient and avoid wasted space!

And that is what happened and they called it “segmentation” and it added a few more registers to mark where different segments started, stopped and the direction in which they grew.

This is also the origin of the “segmentation fault” error we see in programs. It happens when we try to access memory in a segment that does not belong to us.

Free Memory Management

Free memory management goes hand in hand with segmentation. Since the address space of a process is not one continuous range of memory, we need to come up with clever techniques for keeping track of free space.

In managing free space, there are a few considerations.

  1. What is the interface we will presenting to the user? e.g. what is the process through which they can request memory? If the user is finished using a chunk of memory, is there a way they can signal that they do not need said chunk any longer? If so, how?
  2. How will we map virtual addresses to physical ones?
  3. How will we implement coalescing and splitting?
    • Splitting: When a user requests some free space, the allocator will find some chunk that can satisfy the request. It will break this chunk into two. The first part will be returned back to the caller and the second will be placed back into the free list.
    • Coalescing: If a user no longer has need for a chunk of memory, the allocator should make sure to store neighboring chunks of memory as one big chunk instead of separate ones.
    • These two techniques help to minimize external fragmentation – when free space gets chopped up into little pieces subsequent requests might fail to find one continuous chunk of memory that satisfies the demand.

Mechanism: Paging

With segmentation, we chop up free space into different size chunks using different policies to manage the fragmentation that occurs.
Thankfully, not all computer scientists went crazy with using segmentation. We have mostly moved to using a different mechanism for managing memory – paging. This is a much simpler technique where memory is represented as fixed-size chunks (or pages). In the context of physical memory, we use the term page frames. In the context of virtual memory (process address space), we use the term virtual pages, or just pages.

For this to work, each process needs to have a page table ~ a mapping from virtual pages to physical ones. To translate a virtual address to a physical one, we split the address into two parts: a virtual page number (VPN) and an offset. The vpn lets us know what virtual page the memory address is in and the offset is…well, the offset into that virtual page.
This works well because given a continuous process address space, we can designate a portion of a the higher-order bits to represent the page number. As we move throughout the address space the page number/offset works because higher order bits get incremented slower than lower order bits, and when they do, the lower order bits “reset” to zero. This works both in binary and decimal. Food for thought.

So the page table is really just a data structure for mapping virtual page numbers to physical page numbers. In practice, we use it to store some other information about pages though:

When you really think about it though, you realize that page tables can get big. We can no longer store the values we need to translate addresses in registers, we have to use memory. Another pretty big downside to this is that every memory access incurs more costs than usual because we have to make an additional fetch to get the page table from memory. More fetches might be necessary depending on how things are setup.

Nevertheless, i’m still a much bigger fan of paging that segmentation. It seems much “saner”, the right thing to do.

Translation Lookaside Buffers

Paging on its own is slow because memory access is costly (relative to the cpu speed). So for paging to be practical, we need to find a way to deal with the page-table access penalty we get.

One of such ways is using an Address Translation Cache (for historical? reasons, it is called a Translation Lookaside Buffer). This will be a small, but fast area close to the cpu that will be used to store recent page table translations.
The first time a process tries to translate a virtual page table to a physical one, we will have a TLB miss because the translation is not in the cache yet. So we grit our teeth, and fetch the page table from memory as usual. We then add that translation to the TLB. On subsequent requests to translate that page table, its a TLB hit because the translation exists in cache.

This mechanism works quite well because of spatial locality (if a process accesses a memory location, it is likely to access nearby memory locations) and temporal locality (if a process accesses a memory location, it is likely to access it again in the near future). Loading a page table translation into the TLB means that subsequent address translation that fit within that page table will be TLB hits and thus fast. WIN.

If it’s easy, you are missing something. That’s my new motto for Operating System things. Speaking of which, we are missing something. Context switches. Since each process has its own page table, processes cannot share the TLB. UGH. There are a few ways to deal with this. One is to flush the TLB on a context switch, so that each process starts with a fresh TLB. The downside of this is that each process starts with a fresh TLB, so all the caching work just goes down the drain. Another approach is to allow multiple processes to use the TLB, but differentiate which entry belongs to which process by using an “Address Space Identifier” or ASI. The downside of this is that processes need to share the already small TLB, so even fewer translations can be cached.

Then there is the issue of cache eviction. As the quote goes “There are only two hard things in Computer Science: cache invalidation and naming things”. The TLB will eventually get full, but subsequent translations will still need to be saved. How do we decide which entry to evict? TBD.

Advanced Paging

Do not let the word “advanced” scare you off. Stay. It’s “fun”.
Anyways, the previous section on the TLB was about how we can avoid the expensive additional calls to memory when using the paging technique for memory virtualization (yes, this is still about memory virtualization. See the hole we have dug for ourselves?).

If you recall, paging actually had two problems. We addressed the first one ~ we reduce the number of calls to memory by introducing caching. But we have yet to address the second one: Page tables get big.

Page tables get big because we need a page table entry for each page within a process address space, whether it is being used by the process or not. Lots of wasted space.
One way to reduce the space taken up by the page table is to increase the page size. This means less pages, which means less entries, which means smaller page table? Let’s not get too hasty, my simple-minded friend. Bigger page sizes now means that a process might be given more memory that it needs. This leads to “internal fragmentation” ~ applications end up allocating pages but only using small bits and pieces of each, filling up memory with barely used areas.

Out of the many other solutions, I will just point out three, because if you really think about it, this is not a textbook, but notes about one.

Swap space

Recall that one of the reasons virtualization exists in the first place is to give the illusion that interacting with the hardware is easier that it actually is. This is a good thing. Yet another thing we need to figure out to make such an illusion seamless is address the case when a process needs an address space greater than what can fit into memory.

Thankfully, this magic is not all that complicated. We have a small, but fast TLB (address translation cache). We have a bigger but slower memory (RAM). Now it’s time to introduce the much bigger, but much slower “disk”. This could be a hard drive (HD) or a solid state drive (SSD). It’s big, permanent and slow storage. If there is no space in memory to fit an address space, and THE SHOW MUST GO ON, we have no option but to start throwing address spaces onto disk.

This process of swapping pages from memory to disk and vice versa is controlled by the present bit in a process’s page table. The bit will tell us if that page is present in memory or not. This process must be done carefully though. If memory is full, we need a way to choose which pages to move to disk. Ideally, these would be pages that are not frequently accessed.

The act of moving pages in and out of swap space is done by the swap daemon or page daemon. This is a special process the operating system will run when it notices that there are fewer than a certain number of pages in memory. This “certain number” is called the “Low Watermark”. The special process will run until there are at least another certain number of pages in memory. This “another certain number” is called the “High Watermark”.

When we are low on memory, how does the OS decide which pages to evict?

Given that main memory holds a subset of all the pages in the system, it can be viewed as a sort of cache. Our goal is to minimize cache misses (or maximize cache hits). The number of cache hits/misses lets us calculate the AMAT (Average Memory Access Time) which is a metric we can use to compare how different memory swap policies compare to each other.

The formula for AMAT is straightforward: AMAT = "Time to access memory" + ("Probability of a page fault" * "Time to access disk")

What this formula means is that a page fault is costly because accessing disk after a page fault takes A LOT more time to do than accessing memory. I do not know exactly how much longer it takes, but the difference is pretty huge. Do not take it lightly. This is why reducing the probability of a page fault is very important.