Operating System Internals

I spend a lot of time using computers, but I am still scared of what goes on beneath the covers.
To change that, I am following along with the free Operating Systems: Three Easy Pieces book offered by Computer Science professors at the University of Wisconsin-Madison

Contents

  1. Introduction
  2. Virtualization
    1. Virtualizing the CPU
      1. Mechanisms and policies
      2. Process creation
      3. Limited direct Execution
      4. Scheduling
    2. Virtualizing memory
      1. Memory allocation Api
      2. Address translation
      3. Segmentation
      4. Free memory management
      5. Paging
      6. Translation lookaside buffers
      7. Advanced paging
      8. Swap space
  3. Concurrency
    1. Introduction to threads
    2. Thread api
    3. Thread locking
    4. Concurrent data structures
    5. Condition variables
    6. Semaphores
    7. Concurrency bugs
    8. Event based concurrency
  4. Persistence

Introduction

When we say a program is "running" we mean that a program is doing the following over and over again:

This is the Von Neumann model of computing. That.is.it.

However, just having a single program running at a time is boring. As it stands it is a simple batch processing system. You cannot run multiple programs at once and you cannot interact with the system while it is running your program. We want more. We need more.

The body of software that allows us to interact with our laptops in the way we do is called the operating system. It enables multiple programs to run seemingly at the same time -- giving us the desktop "environment" we are used to.

As I am writing this in a text editor, my browser is open to the page containing this lesson, music is playing from another tab on the browser, I have a status bar on the bottom showing the time, volume, date, battery, and the status of my wireless connection. I also have another workspace with a terminal window open. None of this would be possible without the wild wild things the operating system does behind the scenes while programs are seemingly just fetching, decoding and executing instructions.

So how does an operating system (or OS) do this magic? Three easy pieces; virtualization, concurrency and persistence.

Virtualization

Virtualization: "Sharing one peach among many peach-eaters while making them think they each have a peach to themselves"

Virtualizing the CPU

Program -- A program is a lifeless thing, sitting on disk. A bunch of instructions and maybe some data, waiting to spring into action

Process -- A running program.

It is the operating system that takes a program and gets it running, transforming it into something useful. The operating system needs a CPU to "run" these programs in.

A usable computer should be able to run many programs at a time. So how do we create the illusion of having multiple CPUs for these programs to run on? How do we give the illusion of access other physical resources that the computer might be connected to ? The answer is clearly "virtualization"...but how does the OS do that?

Mechanisms and policies

Mechanisms are low-level operations, methods or protocols to implement a needed piece of functionality. They represent the step-by-step details of how to do something.

Policies are the algorithms for making some kind of decision. For example, deciding which process to run next.

Back to processes. Some terms

Machine state: The parts of the machine that are important to the execution of a running process. Some examples are

Process API: The way you expect to be able to interact with a process. For example, we should be able to create, destroy, and wait for a process. Other operations might also be possible.

Process List: Data structure that the OS uses to track information about processes that are running, ready or blocked. The individual members in this list are sometimes called PCBs (process control blocks) or process descriptors.

Mechanism: Process creation

To have multiple processes running on the same computer, the operating system must be able to create them. These is the main application programming interface for process creation:

Mechanism: Limited direct execution

When a process is running on the CPU, nothing else is, including the operating system. So how can the OS (which is not running) manage processes (which are running)?

A mechanism called Limited Direct Execution is used. The elements of this are:

Policy: Scheduling

Some Definitions:

Turnaround Time: The time from when the job entered the system to when it completes

Response Time: The time from when the job enters the system to when it first starts executing.

Different scheduling policies yield different results for these two metrics. In general, policies that are considered "fair" are terrible for turnaround time but good for response time.

Policies that are considered "efficient" are great at turnaround time, but terrible at response time.

Multi-Level-Feedback Queue Scheduler

The crux of the problem: "How can we design a scheduler that both minimizes response time for interactive jobs while also minimizing turnaround time without previous knowledge of job length?

We do so by coming up with the concept of job priorities.

Here are the rules:

  1. If Priority(A) > Priority(B), A runs (B does not)
  2. If Priority(A) = Priority(B), A & B run in round-robin fashion using the time slice of the given queue.
  3. When a job enters the system, it is placed at the highest priority (the topmost queue).
  4. Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).
  5. After some time period S, move all the jobs in the system to the topmost queue

The Pros:

The Cons:

Lottery Scheduling

Exactly what it sounds like. We pick the next process to run at random with the added twist that we can assign "weights" to certain processes so they are more likely to be picked.

The longer the processes run for the more "fair" this scheduling is as each process runs for exactly the proportion of time its weights corresponds to. These weights can be referred to as "tickets". The more tickets a process has, the more likely it is to be picked when the scheduler is deciding what job to run next.

The Pros:

The Cons:

Aside: Linux uses a variant of this and calls it the CFS (Completely Fair Scheduler) As each process runs, it accumulates vruntime. Once a time-slice is over, the scheduler simply picks the process with the lowest vruntime.

The time-slice is determined by another knob called sched_latency, which is (usually) the period over which the scheduler will be completely fair. It takes this value and divides it by the number of processes, which will give it the time-slice to use.

Yet another knob is used to make sure time-slices are not too short. This is the min_granularity, which is the minimum amount of time a time-slice should be. If the set sched_latency implies a value lower than min_granularity, it is set to min_granularity instead.

Virtualizing memory

We are not yet done with virtualization though. The OS also needs to virtualize any shared hardware that processes might want to use. This includes memory. The virtual memory manager is responsible for giving processes the illusion that they have a sequential block of memory to themselves. This is called the "address space" for process.

This goal of this illusion is easy of use and protection.
It would be terrible to program in an environment where you needed to make sure you did not overwrite some other program's memory.
It would also be terrible if other programs could overwrite your own memory.

In the early days of computing, this did not matter much. Computers only needed to only run a few jobs non-interactively. But humans are demanding and eventually we wanted our computers to be interactive! we wanted them fast and capable of running multiple programs at the same time! and they must support multiple concurrent users! .__. So OS designers had no choice. Virtual memory it is. Gone are the simple days

Mechanism: Memory allocation API

Generally, when running a program, there are two types of memory "stack" and "heap". The underlying operating system does not make this distinction. It's all just memory.
However most programming languages introduce this concept to make a distinction between short-lived and long-lived memory.

Short-lived memory or stack memory only lasts within a function call.
Long-lived memory or heap memory can last between function invocations.

Stack memory is usually automatically allocated AND de-allocated for you. Heap memory however, you need to take care of yourself.

In UNIX/C programs, two methods are available for allocating heap memory

These are not system calls. They are provided by the C standard library as easier interfaces to the brk and sbrk system calls.

Mechanism: Address translation

The idea is the same with limited directed execution. The operating system must partner with the hardware to be able to provide a good memory interface to programs.

Specifically, running programs are made to think they have a continuous chunk of memory starting at 0 which they can use. Perhaps unsurprisingly, this is not actually the case.

If multiple programs are to share memory, the OS needs to be able to place them at different spots in memory, which means the addresses the programs use will be wrong. To fix this issue, the OS uses address translation or base and bounds.

The idea is that the running program thinks it is accessing address 128 to get its next instruction. However, the next instruction for that program might not be at physical address 128 because of where the program was placed in memory. So the Operating System with great help from the hardware translates that 128 virtual address its physical address.

This is done by recording a base and bounds for every program. We can think of the base as the offset from 0 to where the program actually resides in memory. So when a program whose base is 10000 tries to access memory location 128, the hardware will actually translate that to the physical memory location 10128. The bounds helps enforce the rule that processes can only access their own memory. If a process tries to access memory greater than the bounds, the hardware can run a handler (set up by the OS) that will kill the offending process.

Mechanism: Segmentation

If programmers were content creatures, we would be be content with having base and bounds registers. But no. We must have more flexibility. We cannot simply give each process a physical block of continuous address space! We need to be able to partition it and place its parts wherever we want, so we can be more memory-efficient and avoid wasted space!

And that is what happened and they called it "segmentation" and it added a few more registers to mark where different segments started, stopped and the direction in which they grew.

This is also the origin of the "segmentation fault" error we see in programs. It happens when we try to access memory in a segment that does not belong to us.

Free memory management

Free memory management goes hand in hand with segmentation. Since the address space of a process is not one continuous range of memory, we need to come up with clever techniques for keeping track of free space.

In managing free space, there are a few considerations.

  1. What is the interface we will presenting to the user? e.g. what is the process through which they can request memory? If the user is finished using a chunk of memory, is there a way they can signal that they do not need said chunk any longer? If so, how?
  2. How will we map virtual addresses to physical ones?
  3. How will we implement coalescing and splitting?
    • Splitting: When a user requests some free space, the allocator will find some chunk that can satisfy the request. It will break this chunk into two. The first part will be returned back to the caller and the second will be placed back into the free list.
    • Coalescing: If a user no longer has need for a chunk of memory, the allocator should make sure to store neighboring chunks of memory as one big chunk instead of separate ones.
    • These two techniques help to minimize external fragmentation -- when free space gets chopped up into little pieces subsequent requests might fail to find one continuous chunk of memory that satisfies the demand.

Mechanism: Paging

With segmentation, we chop up free space into different size chunks using different policies to manage the fragmentation that occurs.
Thankfully, not all computer scientists went crazy with using segmentation. We have mostly moved to using a different mechanism for managing memory -- paging. This is a much simpler technique where memory is represented as fixed-size chunks (or pages). In the context of physical memory, we use the term page frames. In the context of virtual memory (process address space), we use the term virtual pages, or just pages.

For this to work, each process needs to have a page table ~ a mapping from virtual pages to physical ones. To translate a virtual address to a physical one, we split the address into two parts: a virtual page number (VPN) and an offset. The vpn lets us know what virtual page the memory address is in and the offset is...well, the offset into that virtual page.
This works well because given a continuous process address space, we can designate a portion of a the higher-order bits to represent the page number. As we move throughout the address space the page number/offset works because higher order bits get incremented slower than lower order bits, and when they do, the lower order bits "reset" to zero. This works both in binary and decimal. Food for thought.

So the page table is really just a data structure for mapping virtual page numbers to physical page numbers. In practice, we use it to store some other information about pages though:

When you really think about it though, you realize that page tables can get big. We can no longer store the values we need to translate addresses in registers, we have to use memory. Another pretty big downside to this is that every memory access incurs more costs than usual because we have to make an additional fetch to get the page table from memory. More fetches might be necessary depending on how things are setup.

Nevertheless, i'm still a much bigger fan of paging that segmentation. It seems much "saner", the right thing to do.

Translation lookaside buffers

Paging on its own is slow because memory access is costly (relative to the cpu speed). So for paging to be practical, we need to find a way to deal with the page-table access penalty we get.

One of such ways is using an Address Translation Cache (for historical? reasons, it is called a Translation Lookaside Buffer). This will be a small, but fast area close to the cpu that will be used to store recent page table translations.
The first time a process tries to translate a virtual page table to a physical one, we will have a TLB miss because the translation is not in the cache yet. So we grit our teeth, and fetch the page table from memory as usual. We then add that translation to the TLB. On subsequent requests to translate that page table, its a TLB hit because the translation exists in cache.

This mechanism works quite well because of spatial locality (if a process accesses a memory location, it is likely to access nearby memory locations) and temporal locality (if a process accesses a memory location, it is likely to access it again in the near future). Loading a page table translation into the TLB means that subsequent address translation that fit within that page table will be TLB hits and thus fast. WIN.

If it's easy, you are missing something. That's my new motto for Operating System things. Speaking of which, we are missing something. Context switches. Since each process has its own page table, processes cannot share the TLB. UGH. There are a few ways to deal with this. One is to flush the TLB on a context switch, so that each process starts with a fresh TLB. The downside of this is that each process starts with a fresh TLB, so all the caching work just goes down the drain. Another approach is to allow multiple processes to use the TLB, but differentiate which entry belongs to which process by using an "Address Space Identifier" or ASI. The downside of this is that processes need to share the already small TLB, so even fewer translations can be cached.

Then there is the issue of cache eviction. As the quote goes "There are only two hard things in Computer Science: cache invalidation and naming things". The TLB will eventually get full, but subsequent translations will still need to be saved. How do we decide which entry to evict? TBD.

Advanced paging

Do not let the word "advanced" scare you off. Stay. It's "fun".
Anyways, the previous section on the TLB was about how we can avoid the expensive additional calls to memory when using the paging technique for memory virtualization (yes, this is still about memory virtualization. See the hole we have dug for ourselves?).

If you recall, paging actually had two problems. We addressed the first one ~ we reduce the number of calls to memory by introducing caching. But we have yet to address the second one: Page tables get big.

Page tables get big because we need a page table entry for each page within a process address space, whether it is being used by the process or not. Lots of wasted space.
One way to reduce the space taken up by the page table is to increase the page size. This means less pages, which means less entries, which means smaller page table? Let's not get too hasty, my simple-minded friend. Bigger page sizes now means that a process might be given more memory that it needs. This leads to "internal fragmentation" ~ applications end up allocating pages but only using small bits and pieces of each, filling up memory with barely used areas.

Out of the many other solutions, I will just point out three, because if you really think about it, this is not a textbook, but notes about one.

Swap space

Recall that one of the reasons virtualization exists in the first place is to give the illusion that interacting with the hardware is easier that it actually is. This is a good thing. Yet another thing we need to figure out to make such an illusion seamless is address the case when a process needs an address space greater than what can fit into memory.

Thankfully, this magic is not all that complicated. We have a small, but fast TLB (address translation cache). We have a bigger but slower memory (RAM). Now it's time to introduce the much bigger, but much slower "disk". This could be a hard drive (HD) or a solid state drive (SSD). It's big, permanent and slow storage. If there is no space in memory to fit an address space, and THE SHOW MUST GO ON, we have no option but to start throwing address spaces onto disk.

This process of swapping pages from memory to disk and vice versa is controlled by the present bit in a process's page table. The bit will tell us if that page is present in memory or not. This process must be done carefully though. If memory is full, we need a way to choose which pages to move to disk. Ideally, these would be pages that are not frequently accessed.

The act of moving pages in and out of swap space is done by the swap daemon or page daemon. This is a special process the operating system will run when it notices that there are fewer than a certain number of pages in memory. This "certain number" is called the "Low Watermark". The special process will run until there are at least another certain number of pages in memory. This "another certain number" is called the "High Watermark".

When we are low on memory, how does the OS decide which pages to evict?

Given that main memory holds a subset of all the pages in the system, it can be viewed as a sort of cache. Our goal is to minimize cache misses (or maximize cache hits). The number of cache hits/misses lets us calculate the AMAT (Average Memory Access Time) which is a metric we can use to compare how different memory swap policies compare to each other.

The formula for AMAT is straightforward: AMAT = "Time to access memory" + ("Probability of a page fault" * "Time to access disk")

What this formula means is that a page fault is costly because accessing disk after a page fault takes A LOT more time to do than accessing memory. I do not know exactly how much longer it takes, but the difference is pretty huge. Do not take it lightly. This is why reducing the probability of a page fault is very important.

Additionally, we need to remember that if a page was modified while in memory, we need to write it back to disk before evicting it. We cannot simply disgard it from memory. Hence what is called the dirty bit to flag that certain pages in memory have been modified since they were swapped into memory. When such a page is being evicted, we will need to write it to disk (which is expensive) first. If page is not dirty, no write is necessary and it can be immediately used.

Lastly, thrashing is what happens when we simply do not have enough physical memory and are constantly swapping pages in and out of disk. Different approaches have been created to deal with this. Some operating systems implement mechanisms to limit the processes that will be run. This is called admission control. Other operating systems might implement an out-of-memory-killer process that will choose a memory-intensive process and kill it. This is risky if it kills an important process (X.org on linux for example).

But let us be honest. The REAL solution to thrashing is "buy more memory"

Concurrency

To keep on with the peach analogy, imagine you have a group of people around a table full of peaches. Each person would like a peach. How do we get them all to get their peaches as fast as possible but without tripping over each other?

Introduction to threads

Up till now, we have implicitly assumed that a process (i.e. a running program) is sequential, it runs one instruction, then the next, then the next. While this is true, it is possible to provide yet another layer of virtualization called threads.
Why do this? Because first, our computers have evolved to have multiple processors, so the software must evolve as well to make use of them. Second, having multiple threads in a program allows for continuing useful work on one thread even when the other threads might be waiting for I/O. Third, all the threads for in single process share the same address space. (This is both a pro and a con, but that's a discussion for another time).

Apart from the shared address space part, threads and processes are similar. Both are things that don't actually exist, but the OS makes us believe they do. Both have data structures holding per-item state like the registers that are being used and their values. Both are run on the cpu(s) at the whim of the hopefully smart scheduler. Both need to be context-switched.

Thread api

I was not going to have a section for this chapter, but the book had a very funny paragraph that I want to reproduce here. It is talking about thread creation, and the author just finished giving an example on how to perform it. Emphasis mine:

"And there it is! Once you create a thread, you really have another live executing entity, complete with its own call stack, running within the same address space as all the currently existing threads in the program. The fun thus begins!"

Thread locking

Thread concurrency comes with a host of problems because it introduces shared state. One such class of problem is a "data race", which is the behavior that occurs when two threads try to access shared data at the same time. What to do?:

Let's backtrack a bit. The reason why two threads accessing shared data at the same time is a problem is that they might not do so in an atomic way. And because of this, at any point during the process of updating this piece of shared data, they might be pre-empted by the operating system scheduler and another thread might run, which could then try to also update the same piece of shared data, leading to potential data corruption. The solution to this problem is to make sure that whenever a thread starts executing some operation on shared data (i.e. executing in the critical path), it is the only thread doing so.

This is achieved via locks, also called mutexes. When a thread acquires a lock, it can start executing the critical path, then give up the lock once done. If other threads come around and try to execute the critical path, they will be unable to acquire the lock, and thus must wait until it is free before they can go any further.

Let's dig deeper then. How are locks implemented? Locks are supposed to guarantee that a certain operation will only be executed by one thread without interruption. How do we guarantee that the OS won't interrupt and schedule another thread to run that same operation before the first thread is done? Surprise, we will probably need some help from the hardware:


I spend a lot of time using computers, but I am still scared of what goes on beneath the covers.
To change that, I am following along with the free Operating Systems: Three Easy Pieces book offered by Computer Science professors at the University of Wisconsin-Madison

Contents

  1. Introduction
  2. Virtualization
    1. Virtualizing the CPU
      1. Mechanisms and policies
      2. Process creation
      3. Limited direct Execution
      4. Scheduling
    2. Virtualizing memory
      1. Memory allocation Api
      2. Address translation
      3. Segmentation
      4. Free memory management
      5. Paging
      6. Translation lookaside buffers
      7. Advanced paging
      8. Swap space
  3. Concurrency
    1. Introduction to threads
    2. Thread api
    3. Thread locking
    4. Concurrent data structures
  4. Persistence

Introduction

When we say a program is "running" we mean that a program is doing the following over and over again:

This is the Von Neumann model of computing. That.is.it.

However, just having a single program running at a time is boring. As it stands it is a simple batch processing system. You cannot run multiple programs at once and you cannot interact with the system while it is running your program. We want more. We need more.

The body of software that allows us to interact with our laptops in the way we do is called the operating system. It enables multiple programs to run seemingly at the same time -- giving us the desktop "environment" we are used to.

As I am writing this in a text editor, my browser is open to the page containing this lesson, music is playing from another tab on the browser, I have a status bar on the bottom showing the time, volume, date, battery, and the status of my wireless connection. I also have another workspace with a terminal window open. None of this would be possible without the wild wild things the operating system does behind the scenes while programs are seemingly just fetching, decoding and executing instructions.

So how does an operating system (or OS) do this magic? Three easy pieces; virtualization, concurrency and persistence.

Virtualization

Virtualization: "Sharing one peach among many peach-eaters while making them think they each have a peach to themselves"

Virtualizing the CPU

Program -- A program is a lifeless thing, sitting on disk. A bunch of instructions and maybe some data, waiting to spring into action

Process -- A running program.

It is the operating system that takes a program and gets it running, transforming it into something useful. The operating system needs a CPU to "run" these programs in.

A usable computer should be able to run many programs at a time. So how do we create the illusion of having multiple CPUs for these programs to run on? How do we give the illusion of access other physical resources that the computer might be connected to ? The answer is clearly "virtualization"...but how does the OS do that?

Mechanisms and policies

Mechanisms are low-level operations, methods or protocols to implement a needed piece of functionality. They represent the step-by-step details of how to do something.

Policies are the algorithms for making some kind of decision. For example, deciding which process to run next.

Back to processes. Some terms

Machine state: The parts of the machine that are important to the execution of a running process. Some examples are

Process API: The way you expect to be able to interact with a process. For example, we should be able to create, destroy, and wait for a process. Other operations might also be possible.

Process List: Data structure that the OS uses to track information about processes that are running, ready or blocked. The individual members in this list are sometimes called PCBs (process control blocks) or process descriptors.

Mechanism: Process creation

To have multiple processes running on the same computer, the operating system must be able to create them. These is the main application programming interface for process creation:

Mechanism: Limited direct execution

When a process is running on the CPU, nothing else is, including the operating system. So how can the OS (which is not running) manage processes (which are running)?

A mechanism called Limited Direct Execution is used. The elements of this are:

Policy: Scheduling

Some Definitions:

Turnaround Time: The time from when the job entered the system to when it completes

Response Time: The time from when the job enters the system to when it first starts executing.

Different scheduling policies yield different results for these two metrics. In general, policies that are considered "fair" are terrible for turnaround time but good for response time.

Policies that are considered "efficient" are great at turnaround time, but terrible at response time.

Multi-Level-Feedback Queue Scheduler

The crux of the problem: "How can we design a scheduler that both minimizes response time for interactive jobs while also minimizing turnaround time without previous knowledge of job length?

We do so by coming up with the concept of job priorities.

Here are the rules:

  1. If Priority(A) > Priority(B), A runs (B does not)
  2. If Priority(A) = Priority(B), A & B run in round-robin fashion using the time slice of the given queue.
  3. When a job enters the system, it is placed at the highest priority (the topmost queue).
  4. Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).
  5. After some time period S, move all the jobs in the system to the topmost queue

The Pros:

The Cons:

Lottery Scheduling

Exactly what it sounds like. We pick the next process to run at random with the added twist that we can assign "weights" to certain processes so they are more likely to be picked.

The longer the processes run for the more "fair" this scheduling is as each process runs for exactly the proportion of time its weights corresponds to. These weights can be referred to as "tickets". The more tickets a process has, the more likely it is to be picked when the scheduler is deciding what job to run next.

The Pros:

The Cons:

Aside: Linux uses a variant of this and calls it the CFS (Completely Fair Scheduler) As each process runs, it accumulates vruntime. Once a time-slice is over, the scheduler simply picks the process with the lowest vruntime.

The time-slice is determined by another knob called sched_latency, which is (usually) the period over which the scheduler will be completely fair. It takes this value and divides it by the number of processes, which will give it the time-slice to use.

Yet another knob is used to make sure time-slices are not too short. This is the min_granularity, which is the minimum amount of time a time-slice should be. If the set sched_latency implies a value lower than min_granularity, it is set to min_granularity instead.

Virtualizing memory

We are not yet done with virtualization though. The OS also needs to virtualize any shared hardware that processes might want to use. This includes memory. The virtual memory manager is responsible for giving processes the illusion that they have a sequential block of memory to themselves. This is called the "address space" for process.

This goal of this illusion is easy of use and protection.
It would be terrible to program in an environment where you needed to make sure you did not overwrite some other program's memory.
It would also be terrible if other programs could overwrite your own memory.

In the early days of computing, this did not matter much. Computers only needed to only run a few jobs non-interactively. But humans are demanding and eventually we wanted our computers to be interactive! we wanted them fast and capable of running multiple programs at the same time! and they must support multiple concurrent users! .__. So OS designers had no choice. Virtual memory it is. Gone are the simple days

Mechanism: Memory allocation API

Generally, when running a program, there are two types of memory "stack" and "heap". The underlying operating system does not make this distinction. It's all just memory.
However most programming languages introduce this concept to make a distinction between short-lived and long-lived memory.

Short-lived memory or stack memory only lasts within a function call.
Long-lived memory or heap memory can last between function invocations.

Stack memory is usually automatically allocated AND de-allocated for you. Heap memory however, you need to take care of yourself.

In UNIX/C programs, two methods are available for allocating heap memory

These are not system calls. They are provided by the C standard library as easier interfaces to the brk and sbrk system calls.

Mechanism: Address translation

The idea is the same with limited directed execution. The operating system must partner with the hardware to be able to provide a good memory interface to programs.

Specifically, running programs are made to think they have a continuous chunk of memory starting at 0 which they can use. Perhaps unsurprisingly, this is not actually the case.

If multiple programs are to share memory, the OS needs to be able to place them at different spots in memory, which means the addresses the programs use will be wrong. To fix this issue, the OS uses address translation or base and bounds.

The idea is that the running program thinks it is accessing address 128 to get its next instruction. However, the next instruction for that program might not be at physical address 128 because of where the program was placed in memory. So the Operating System with great help from the hardware translates that 128 virtual address its physical address.

This is done by recording a base and bounds for every program. We can think of the base as the offset from 0 to where the program actually resides in memory. So when a program whose base is 10000 tries to access memory location 128, the hardware will actually translate that to the physical memory location 10128. The bounds helps enforce the rule that processes can only access their own memory. If a process tries to access memory greater than the bounds, the hardware can run a handler (set up by the OS) that will kill the offending process.

Mechanism: Segmentation

If programmers were content creatures, we would be be content with having base and bounds registers. But no. We must have more flexibility. We cannot simply give each process a physical block of continuous address space! We need to be able to partition it and place its parts wherever we want, so we can be more memory-efficient and avoid wasted space!

And that is what happened and they called it "segmentation" and it added a few more registers to mark where different segments started, stopped and the direction in which they grew.

This is also the origin of the "segmentation fault" error we see in programs. It happens when we try to access memory in a segment that does not belong to us.

Free memory management

Free memory management goes hand in hand with segmentation. Since the address space of a process is not one continuous range of memory, we need to come up with clever techniques for keeping track of free space.

In managing free space, there are a few considerations.

  1. What is the interface we will presenting to the user? e.g. what is the process through which they can request memory? If the user is finished using a chunk of memory, is there a way they can signal that they do not need said chunk any longer? If so, how?
  2. How will we map virtual addresses to physical ones?
  3. How will we implement coalescing and splitting?
    • Splitting: When a user requests some free space, the allocator will find some chunk that can satisfy the request. It will break this chunk into two. The first part will be returned back to the caller and the second will be placed back into the free list.
    • Coalescing: If a user no longer has need for a chunk of memory, the allocator should make sure to store neighboring chunks of memory as one big chunk instead of separate ones.
    • These two techniques help to minimize external fragmentation -- when free space gets chopped up into little pieces subsequent requests might fail to find one continuous chunk of memory that satisfies the demand.

Mechanism: Paging

With segmentation, we chop up free space into different size chunks using different policies to manage the fragmentation that occurs.
Thankfully, not all computer scientists went crazy with using segmentation. We have mostly moved to using a different mechanism for managing memory -- paging. This is a much simpler technique where memory is represented as fixed-size chunks (or pages). In the context of physical memory, we use the term page frames. In the context of virtual memory (process address space), we use the term virtual pages, or just pages.

For this to work, each process needs to have a page table ~ a mapping from virtual pages to physical ones. To translate a virtual address to a physical one, we split the address into two parts: a virtual page number (VPN) and an offset. The vpn lets us know what virtual page the memory address is in and the offset is...well, the offset into that virtual page.
This works well because given a continuous process address space, we can designate a portion of a the higher-order bits to represent the page number. As we move throughout the address space the page number/offset works because higher order bits get incremented slower than lower order bits, and when they do, the lower order bits "reset" to zero. This works both in binary and decimal. Food for thought.

So the page table is really just a data structure for mapping virtual page numbers to physical page numbers. In practice, we use it to store some other information about pages though:

When you really think about it though, you realize that page tables can get big. We can no longer store the values we need to translate addresses in registers, we have to use memory. Another pretty big downside to this is that every memory access incurs more costs than usual because we have to make an additional fetch to get the page table from memory. More fetches might be necessary depending on how things are setup.

Nevertheless, I am still a much bigger fan of paging that segmentation. It seems much "saner", the right thing to do.

Translation lookaside buffers

Paging on its own is slow because memory access is costly (relative to the cpu speed). So for paging to be practical, we need to find a way to deal with the page-table access penalty we get.

One of such ways is using an Address Translation Cache (for historical? reasons, it is called a Translation Lookaside Buffer). This will be a small, but fast area close to the cpu that will be used to store recent page table translations.
The first time a process tries to translate a virtual page table to a physical one, we will have a TLB miss because the translation is not in the cache yet. So we grit our teeth, and fetch the page table from memory as usual. We then add that translation to the TLB. On subsequent requests to translate that page table, its a TLB hit because the translation exists in cache.

This mechanism works quite well because of spatial locality (if a process accesses a memory location, it is likely to access nearby memory locations) and temporal locality (if a process accesses a memory location, it is likely to access it again in the near future). Loading a page table translation into the TLB means that subsequent address translation that fit within that page table will be TLB hits and thus fast. WIN.

If it's easy, you are missing something. That's my new motto for Operating System things. Speaking of which, we are missing something. Context switches. Since each process has its own page table, processes cannot share the TLB. UGH. There are a few ways to deal with this. One is to flush the TLB on a context switch, so that each process starts with a fresh TLB. The downside of this is that each process starts with a fresh TLB, so all the caching work just goes down the drain. Another approach is to allow multiple processes to use the TLB, but differentiate which entry belongs to which process by using an "Address Space Identifier" or ASI. The downside of this is that processes need to share the already small TLB, so even fewer translations can be cached.

Then there is the issue of cache eviction. As the quote goes "There are only two hard things in Computer Science: cache invalidation and naming things". The TLB will eventually get full, but subsequent translations will still need to be saved. How do we decide which entry to evict? TBD.

Advanced paging

Do not let the word "advanced" scare you off. Stay. It's "fun".
Anyways, the previous section on the TLB was about how we can avoid the expensive additional calls to memory when using the paging technique for memory virtualization (yes, this is still about memory virtualization. See the hole we have dug for ourselves?).

If you recall, paging actually had two problems. We addressed the first one ~ we reduce the number of calls to memory by introducing caching. But we have yet to address the second one: Page tables get big.

Page tables get big because we need a page table entry for each page within a process address space, whether it is being used by the process or not. Lots of wasted space.
One way to reduce the space taken up by the page table is to increase the page size. This means less pages, which means less entries, which means smaller page table? Let's not get too hasty, my simple-minded friend. Bigger page sizes now means that a process might be given more memory that it needs. This leads to "internal fragmentation" ~ applications end up allocating pages but only using small bits and pieces of each, filling up memory with barely used areas.

Out of the many other solutions, I will just point out three, because if you really think about it, this is not a textbook, but notes about one.

Swap space

Recall that one of the reasons virtualization exists in the first place is to give the illusion that interacting with the hardware is easier that it actually is. This is a good thing. Yet another thing we need to figure out to make such an illusion seamless is address the case when a process needs an address space greater than what can fit into memory.

Thankfully, this magic is not all that complicated. We have a small, but fast TLB (address translation cache). We have a bigger but slower memory (RAM). Now it's time to introduce the much bigger, but much slower "disk". This could be a hard drive (HD) or a solid state drive (SSD). It's big, permanent and slow storage. If there is no space in memory to fit an address space, and THE SHOW MUST GO ON, we have no option but to start throwing address spaces onto disk.

This process of swapping pages from memory to disk and vice versa is controlled by the present bit in a process's page table. The bit will tell us if that page is present in memory or not. This process must be done carefully though. If memory is full, we need a way to choose which pages to move to disk. Ideally, these would be pages that are not frequently accessed.

The act of moving pages in and out of swap space is done by the swap daemon or page daemon. This is a special process the operating system will run when it notices that there are fewer than a certain number of pages in memory. This "certain number" is called the "Low Watermark". The special process will run until there are at least another certain number of pages in memory. This "another certain number" is called the "High Watermark".

When we are low on memory, how does the OS decide which pages to evict?

Given that main memory holds a subset of all the pages in the system, it can be viewed as a sort of cache. Our goal is to minimize cache misses (or maximize cache hits). The number of cache hits/misses lets us calculate the AMAT (Average Memory Access Time) which is a metric we can use to compare how different memory swap policies compare to each other.

The formula for AMAT is straightforward: AMAT = "Time to access memory" + ("Probability of a page fault" * "Time to access disk")

What this formula means is that a page fault is costly because accessing disk after a page fault takes A LOT more time to do than accessing memory. I do not know exactly how much longer it takes, but the difference is pretty huge. Do not take it lightly. This is why reducing the probability of a page fault is very important.

Additionally, we need to remember that if a page was modified while in memory, we need to write it back to disk before evicting it. We cannot simply disgard it from memory. Hence what is called the dirty bit to flag that certain pages in memory have been modified since they were swapped into memory. When such a page is being evicted, we will need to write it to disk (which is expensive) first. If page is not dirty, no write is necessary and it can be immediately used.

Lastly, thrashing is what happens when we simply do not have enough physical memory and are constantly swapping pages in and out of disk. Different approaches have been created to deal with this. Some operating systems implement mechanisms to limit the processes that will be run. This is called admission control. Other operating systems might implement an out-of-memory-killer process that will choose a memory-intensive process and kill it. This is risky if it kills an important process (X.org on linux for example).

But let us be honest. The REAL solution to thrashing is "buy more memory"

Concurrency

To keep on with the peach analogy, imagine you have a group of people around a table full of peaches. Each person would like a peach. How do we get them all to get their peaches as fast as possible but without tripping over each other?

Introduction to threads

Up till now, we have implicitly assumed that a process (i.e. a running program) is sequential, it runs one instruction, then the next, then the next. While this is true, it is possible to provide yet another layer of virtualization called threads.
Why do this? Because first, our computers have evolved to have multiple processors, so the software must evolve as well to make use of them. Second, having multiple threads in a program allows for continuing useful work on one thread even when the other threads might be waiting for I/O. Third, all the threads for in single process share the same address space. (This is both a pro and a con, but that's a discussion for another time).

Apart from the shared address space part, threads and processes are similar. Both are things that don't actually exist, but the OS makes us believe they do. Both have data structures holding per-item state like the registers that are being used and their values. Both are run on the CPU(s) at the whim of the hopefully smart scheduler. Both need to be context-switched.

Thread API

I was not going to have a section for this chapter, but the book had a very funny paragraph that I want to reproduce here. It is talking about thread creation, and the author just finished giving an example on how to perform it. Emphasis mine:

"And there it is! Once you create a thread, you really have another live executing entity, complete with its own call stack, running within the same address space as all the currently existing threads in the program. The fun thus begins!"

Thread locking

Thread concurrency comes with a host of problems because it introduces shared state. One such class of problem is a "data race", which is the behavior that occurs when two threads try to access shared data at the same time. What to do?:

Let's backtrack a bit. The reason why two threads accessing shared data at the same time is a problem is that they might not do so in an atomic way. And because of this, at any point during the process of updating this piece of shared data, they might be preempted by the operating system scheduler and another thread might run, which could then try to also update the same piece of shared data, leading to potential data corruption. The solution to this problem is to make sure that whenever a thread starts executing some operation on shared data (i.e. executing in the critical path), it is the only thread doing so.

This is achieved via locks, also called mutexes. When a thread acquires a lock, it can start executing the critical path, then give up the lock once done. If other threads come around and try to execute the critical path, they will be unable to acquire the lock, and thus must wait until it is free before they can go any further.

Let's dig deeper then. How are locks implemented? Locks are supposed to guarantee that a certain operation will only be executed by one thread without interruption. How do we guarantee that the OS won't interrupt and schedule another thread to run that same operation before the first thread is done? Surprise, we will probably need some help from the hardware:

All these approaches handle the atomic aspect of implementing a lock. However, they still involve spin locking. That is, when a thread is unable to acquire a lock implemented by one of the above techniques, it will sit in a while loop. This is not efficient because you could have multiple threads get scheduled only to waste cpu cycles executing a while loop, while the thread with the critical section is waiting. Spin locking also means that there is no gurantee that a thread won't starve if scheduled and de-scheduled at just the right times (or wrong times, it depends on your perspective). Solving this requires some more additions to the locking code. Look up park/unpark, linux futexes and two-phase locks for examples of how the spin locking problem is solved.

Concurrent data structures

The next step after implementing basic locks is including them into common data structures. This allows these data structures to be thread safe ~ meaning they can be safely used when accessed by multiple threads.

As with many things in this course, there is a trade-off between simplicity and performance. Most of the time, the obvious implementation where you just lock the entire structure before accessing it does not scale well. So usually, we optimize the structure until it is "good enough". This is left undefined on purpose, because "good enough" depends on the context in which the data structure will be used. Sometimes, there is no need to add complexity for some slight increase in performance if the current performance works.

RANDOM TIP:

Beware of control flow changes that lead to function returns. Because many function begin by acquiring a lock, allocating memory or some other type of initialization, when errors arise, the code has to undo all that state before returning, which is error prone.ditional, fetch-and-add).

Condition variables

In addition to synchronization via locks, threads will also need some sort of primitive to check if a condition is true before continuing execution. For example, a parent thread might wish to check if all its child threads have completed before moving on (a call to join() would need this).

The thing we are looking for is called a conditional variable. In its simplest form, a thread will check a condition, and if it is false, call wait() in order to be put to sleep until said condition changes. Another thread, responsible for making that condition true will at some point call signal() when the desired condition is met. This will wake up the waiting thread.

Of course, nothing is every simple in the OS world. Race conditions will crop up if we just use signal() and wait() as described, conditional variables should always be used with locks. The rule of thumb is "always hold the lock when signaling". If this wasn't the case, we could easily fall into situations when the a parent thread spins off a thread but is interrupted by the scheduler before calling wait(). The new thread will call signal(), and since nothing is waiting, will just continue on and eventually return. The first thread will continue, call wait() and wait forever. Sad.

Locks and conditional variables are a gateway drug to concurrency. Using these two primitives, we can create programs that (try to) do useful work by using fine-grained synchronization between threads. One example of this is the bounded buffer or producer/consumer problem: A group of threads are designated as producers and another group as consumers. There is a shared buffer. The producers place data into this buffer as long as it is not full and the consumers read from the buffer whenever data is available.

Start tangent:

Getting such programs right in the C language (which is what the book uses) is quite annoying though. First, there is no easy way to test if the code is correct because the error conditions in these programs depend on the order in which threads run rather than on the code the threads are executing. Second, finding these edge cases is a purely manual process; one needs to imagine up different scheduler behaviors and how that might affect thread execution. For non trivial programs, it is likely you will miss at least one such behavior. Thirdly, if you manage to figure out all the edge cases, your code will be a monster because of the sum of all the tweaks required to get things just right. I think I am finally convinced that I need to look elsewhere than C for a language to learn for "low-level" stuff.

End tangent

Semaphores

Semaphores are yet another synchronization construct. It is pretty similar to the conditional variable:

The Good:
Personally, i think semaphores are the most useful of the bunch. Partly because that is the only construct I have actually used. In my case, I found out you could use it very easily to throttle the number of threads concurrently issuing web requests to a server. Made my life very easy.

The OK:
It might be tempting to want to say that semaphores are a generalization of locks and condition variables, but that kind of thinking is dangerous. Indeed, implementing condition variables with semaphores has proven to be extremely tricky and error prone to the point that it is usually never worth it if you can avoid it.

The Bad:
Validating the correctness of an implementation that makes use of locks, condition variables, or semaphores is rough. Following along with the book I generally can find the obvious problem in the naive solutions pretty easily. Anything past that would require me spending a few hours, if not days looking at and thinking about the code. And even with that, you might still miss some. I just don't think it's worth it anymore. It does not seem responsible to build any critical systems in languages whose only solution to managing concurrency is these three constructs. Rust and Erlang are looking more appealing.

Concurrency bugs

This chapter is more of a summary of a study on real-world concurrency bugs.
I read the paper, and it is mostly what you would expect. Most of these bugs happen because we cannot reason about concurrent programs in a fool-proof way. The paper classifies the bugs and their fixes into various categories (I would argue it does so arbitrarily, but what do I know? I don't have a Phd).
No notes will be taken about this section because I'm lazy and have no real interest in remembering these categories. I will note that this paper only re-enforces my deep skepticism about sharing memory among threads. There has to be a better way.

Event based concurrency

To finish up the section on concurrency, the authors introduced another alternative to using threads: the event loop. Using an event loop will supposedly sidestep all the headaches of thread synchronization by doing all the work on one thread. The thread in question will be in a continuous loop where it (a) listens for "events" then (b) calls the appropriate event handlers for said event. This is usually called the event loop.

So is this a viable alternative to thread-based concurrency? eh, I would not be so quick to jump aboard. The obvious downside is that an event handler cannot be a long-running operation because it will halt the world. So not the best choice for programs that spend a long time doing "stuff". However, it is a good fit for programs that can be written as "reactions" to outside events. A user-interface is one such example. It needs only do anything when the user initiates an action.