提交 f53494c2 创建 作者: rsc's avatar rsc

DO NOT MAIL: xv6 web pages

上级 ee3f75f2
index.html: index.txt mkhtml
mkhtml index.txt >_$@ && mv _$@ $@
差异被折叠。
差异被折叠。
<title>OS Bugs</title>
<html>
<head>
</head>
<body>
<h1>OS Bugs</h1>
<p>Required reading: Bugs as deviant behavior
<h2>Overview</h2>
<p>Operating systems must obey many rules for correctness and
performance. Examples rules:
<ul>
<li>Do not call blocking functions with interrupts disabled or spin
lock held
<li>check for NULL results
<li>Do not allocate large stack variables
<li>Do no re-use already-allocated memory
<li>Check user pointers before using them in kernel mode
<li>Release acquired locks
</ul>
<p>In addition, there are standard software engineering rules, like
use function results in consistent ways.
<p>These rules are typically not checked by a compiler, even though
they could be checked by a compiler, in principle. The goal of the
meta-level compilation project is to allow system implementors to
write system-specific compiler extensions that check the source code
for rule violations.
<p>The results are good: many new bugs found (500-1000) in Linux
alone. The paper for today studies these bugs and attempts to draw
lessons from these bugs.
<p>Are kernel error worse than user-level errors? That is, if we get
the kernel correct, then we won't have system crashes?
<h2>Errors in JOS kernel</h2>
<p>What are unstated invariants in the JOS?
<ul>
<li>Interrupts are disabled in kernel mode
<li>Only env 1 has access to disk
<li>All registers are saved & restored on context switch
<li>Application code is never executed with CPL 0
<li>Don't allocate an already-allocated physical page
<li>Propagate error messages to user applications (e.g., out of
resources)
<li>Map pipe before fd
<li>Unmap fd before pipe
<li>A spawned program should have open only file descriptors 0, 1, and 2.
<li>Pass sometimes size in bytes and sometimes in block number to a
given file system function.
<li>User pointers should be run through TRUP before used by the kernel
</ul>
<p>Could these errors have been caught by metacompilation? Would
metacompilation have caught the pipe race condition? (Probably not,
it happens in only one place.)
<p>How confident are you that your code is correct? For example,
are you sure interrupts are always disabled in kernel mode? How would
you test?
<h2>Metacompilation</h2>
<p>A system programmer writes the rule checkers in a high-level,
state-machine language (metal). These checkers are dynamically linked
into an extensible version of g++, xg++. Xg++ applies the rule
checkers to every possible execution path of a function that is being
compiled.
<p>An example rule from
the <a
href="http://www.stanford.edu/~engler/exe-ccs-06.pdf">OSDI
paper</a>:
<pre>
sm check_interrupts {
decl { unsigned} flags;
pat enable = { sti(); } | {restore_flags(flags);} ;
pat disable = { cli(); };
is_enabled: disable ==> is_disabled | enable ==> { err("double
enable")};
...
</pre>
A more complete version found 82 errors in the Linux 2.3.99 kernel.
<p>Common mistake:
<pre>
get_free_buffer ( ... ) {
....
save_flags (flags);
cli ();
if ((bh = sh->buffer_pool) == NULL)
return NULL;
....
}
</pre>
<p>(Figure 2 also lists a simple metarule.)
<p>Some checkers produce false positives, because of limitations of
both static analysis and the checkers, which mostly use local
analysis.
<p>How does the <b>block</b> checker work? The first pass is a rule
that marks functions as potentially blocking. After processing a
function, the checker emits the function's flow graph to a file
(including, annotations and functions called). The second pass takes
the merged flow graph of all function calls, and produces a file with
all functions that have a path in the control-flow-graph to a blocking
function call. For the Linux kernel this results in 3,000 functions
that potentially could call sleep. Yet another checker like
check_interrupts checks if a function calls any of the 3,000 functions
with interrupts disabled. Etc.
<h2>This paper</h2>
<p>Writing rules is painful. First, you have to write them. Second,
how do you decide what to check? Was it easy to enumerate all
conventions for JOS?
<p>Insight: infer programmer "beliefs" from code and cross-check
for contradictions. If <i>cli</i> is always followed by <i>sti</i>,
except in one case, perhaps something is wrong. This simplifies
life because we can write generic checkers instead of checkers
that specifically check for <i>sti</i>, and perhaps we get lucky
and find other temporal ordering conventions.
<p>Do we know which case is wrong? The 999 times or the 1 time that
<i>sti</i> is absent? (No, this method cannot figure what the correct
sequence is but it can flag that something is weird, which in practice
useful.) The method just detects inconsistencies.
<p>Is every inconsistency an error? No, some inconsistency don't
indicate an error. If a call to function <i>f</i> is often followed
by call to function <i>g</i>, does that imply that f should always be
followed by g? (No!)
<p>Solution: MUST beliefs and MAYBE beliefs. MUST beliefs are
invariants that must hold; any inconsistency indicates an error. If a
pointer is dereferences, then the programmer MUST believe that the
pointer is pointing to something that can be dereferenced (i.e., the
pointer is definitely not zero). MUST beliefs can be checked using
"internal inconsistencies".
<p>An aside, can zero pointers pointers be detected during runtime?
(Sure, unmap the page at address zero.) Why is metacompilation still
valuable? (At runtime you will find only the null pointers that your
test code dereferenced; not all possible dereferences of null
pointers.) An even more convincing example for Metacompilation is
tracking user pointers that the kernel dereferences. (Is this a MUST
belief?)
<p>MAYBE beliefs are invariants that are suggested by the code, but
they maybe coincidences. MAYBE beliefs are ranked by statistical
analysis, and perhaps augmented with input about functions names
(e.g., alloc and free are important). Is it computationally feasible
to check every MAYBE belief? Could there be much noise?
<p>What errors won't this approach catch?
<h2>Paper discussion</h2>
<p>This paper is best discussed by studying every code fragment. Most
code fragments are pieces of code from Linux distributions; these
mistakes are real!
<p>Section 3.1. what is the error? how does metacompilation catch
it?
<p>Figure 1. what is the error? is there one?
<p>Code fragments from 6.1. what is the error? how does metacompilation catch
it?
<p>Figure 3. what is the error? how does metacompilation catch
it?
<p>Section 8.3. what is the error? how does metacompilation catch
it?
</body>
<title>L9</title>
<html>
<head>
</head>
<body>
<h1>Coordination and more processes</h1>
<p>Required reading: remainder of proc.c, sys_exec, sys_sbrk,
sys_wait, sys_exit, and sys_kill.
<h2>Overview</h2>
<p>Big picture: more programs than processors. How to share the
limited number of processors among the programs? Last lecture
covered basic mechanism: threads and the distinction between process
and thread. Today expand: how to coordinate the interactions
between threads explicitly, and some operations on processes.
<p>Sequence coordination. This is a diferrent type of coordination
than mutual-exclusion coordination (which has its goal to make
atomic actions so that threads don't interfere). The goal of
sequence coordination is for threads to coordinate the sequences in
which they run.
<p>For example, a thread may want to wait until another thread
terminates. One way to do so is to have the thread run periodically,
let it check if the other thread terminated, and if not give up the
processor again. This is wasteful, especially if there are many
threads.
<p>With primitives for sequence coordination one can do better. The
thread could tell the thread manager that it is waiting for an event
(e.g., another thread terminating). When the other thread
terminates, it explicitly wakes up the waiting thread. This is more
work for the programmer, but more efficient.
<p>Sequence coordination often interacts with mutual-exclusion
coordination, as we will see below.
<p>The operating system literature has a rich set of primivites for
sequence coordination. We study a very simple version of condition
variables in xv6: sleep and wakeup, with a single lock.
<h2>xv6 code examples</h2>
<h3>Sleep and wakeup - usage</h3>
Let's consider implementing a producer/consumer queue
(like a pipe) that can be used to hold a single non-null char pointer:
<pre>
struct pcq {
void *ptr;
};
void*
pcqread(struct pcq *q)
{
void *p;
while((p = q-&gt;ptr) == 0)
;
q-&gt;ptr = 0;
return p;
}
void
pcqwrite(struct pcq *q, void *p)
{
while(q-&gt;ptr != 0)
;
q-&gt;ptr = p;
}
</pre>
<p>Easy and correct, at least assuming there is at most one
reader and at most one writer at a time.
<p>Unfortunately, the while loops are inefficient.
Instead of polling, it would be great if there were
primitives saying ``wait for some event to happen''
and ``this event happened''.
That's what sleep and wakeup do.
<p>Second try:
<pre>
void*
pcqread(struct pcq *q)
{
void *p;
if(q-&gt;ptr == 0)
sleep(q);
p = q-&gt;ptr;
q-&gt;ptr = 0;
wakeup(q); /* wake pcqwrite */
return p;
}
void
pcqwrite(struct pcq *q, void *p)
{
if(q-&gt;ptr != 0)
sleep(q);
q-&gt;ptr = p;
wakeup(q); /* wake pcqread */
return p;
}
</pre>
That's better, but there is still a problem.
What if the wakeup happens between the check in the if
and the call to sleep?
<p>Add locks:
<pre>
struct pcq {
void *ptr;
struct spinlock lock;
};
void*
pcqread(struct pcq *q)
{
void *p;
acquire(&amp;q->lock);
if(q-&gt;ptr == 0)
sleep(q, &amp;q->lock);
p = q-&gt;ptr;
q-&gt;ptr = 0;
wakeup(q); /* wake pcqwrite */
release(&amp;q->lock);
return p;
}
void
pcqwrite(struct pcq *q, void *p)
{
acquire(&amp;q->lock);
if(q-&gt;ptr != 0)
sleep(q, &amp;q->lock);
q-&gt;ptr = p;
wakeup(q); /* wake pcqread */
release(&amp;q->lock);
return p;
}
</pre>
This is okay, and now safer for multiple readers and writers,
except that wakeup wakes up everyone who is asleep on chan,
not just one guy.
So some of the guys who wake up from sleep might not
be cleared to read or write from the queue. Have to go back to looping:
<pre>
struct pcq {
void *ptr;
struct spinlock lock;
};
void*
pcqread(struct pcq *q)
{
void *p;
acquire(&amp;q->lock);
while(q-&gt;ptr == 0)
sleep(q, &amp;q->lock);
p = q-&gt;ptr;
q-&gt;ptr = 0;
wakeup(q); /* wake pcqwrite */
release(&amp;q->lock);
return p;
}
void
pcqwrite(struct pcq *q, void *p)
{
acquire(&amp;q->lock);
while(q-&gt;ptr != 0)
sleep(q, &amp;q->lock);
q-&gt;ptr = p;
wakeup(q); /* wake pcqread */
release(&amp;q->lock);
return p;
}
</pre>
The difference between this an our original is that
the body of the while loop is a much more efficient way to pause.
<p>Now we've figured out how to use it, but we
still need to figure out how to implement it.
<h3>Sleep and wakeup - implementation</h3>
<p>
Simple implementation:
<pre>
void
sleep(void *chan, struct spinlock *lk)
{
struct proc *p = curproc[cpu()];
release(lk);
p-&gt;chan = chan;
p-&gt;state = SLEEPING;
sched();
}
void
wakeup(void *chan)
{
for(each proc p) {
if(p-&gt;state == SLEEPING &amp;&amp; p-&gt;chan == chan)
p-&gt;state = RUNNABLE;
}
}
</pre>
<p>What's wrong? What if the wakeup runs right after
the release(lk) in sleep?
It still misses the sleep.
<p>Move the lock down:
<pre>
void
sleep(void *chan, struct spinlock *lk)
{
struct proc *p = curproc[cpu()];
p-&gt;chan = chan;
p-&gt;state = SLEEPING;
release(lk);
sched();
}
void
wakeup(void *chan)
{
for(each proc p) {
if(p-&gt;state == SLEEPING &amp;&amp; p-&gt;chan == chan)
p-&gt;state = RUNNABLE;
}
}
</pre>
<p>This almost works. Recall from last lecture that we also need
to acquire the proc_table_lock before calling sched, to
protect p-&gt;jmpbuf.
<pre>
void
sleep(void *chan, struct spinlock *lk)
{
struct proc *p = curproc[cpu()];
p-&gt;chan = chan;
p-&gt;state = SLEEPING;
acquire(&amp;proc_table_lock);
release(lk);
sched();
}
</pre>
<p>The problem is that now we're using lk to protect
access to the p-&gt;chan and p-&gt;state variables
but other routines besides sleep and wakeup
(in particular, proc_kill) will need to use them and won't
know which lock protects them.
So instead of protecting them with lk, let's use proc_table_lock:
<pre>
void
sleep(void *chan, struct spinlock *lk)
{
struct proc *p = curproc[cpu()];
acquire(&amp;proc_table_lock);
release(lk);
p-&gt;chan = chan;
p-&gt;state = SLEEPING;
sched();
}
void
wakeup(void *chan)
{
acquire(&amp;proc_table_lock);
for(each proc p) {
if(p-&gt;state == SLEEPING &amp;&amp; p-&gt;chan == chan)
p-&gt;state = RUNNABLE;
}
release(&amp;proc_table_lock);
}
</pre>
<p>One could probably make things work with lk as above,
but the relationship between data and locks would be
more complicated with no real benefit. Xv6 takes the easy way out
and says that elements in the proc structure are always protected
by proc_table_lock.
<h3>Use example: exit and wait</h3>
<p>If proc_wait decides there are children to be waited for,
it calls sleep at line 2462.
When a process exits, we proc_exit scans the process table
to find the parent and wakes it at 2408.
<p>Which lock protects sleep and wakeup from missing each other?
Proc_table_lock. Have to tweak sleep again to avoid double-acquire:
<pre>
if(lk != &amp;proc_table_lock) {
acquire(&amp;proc_table_lock);
release(lk);
}
</pre>
<h3>New feature: kill</h3>
<p>Proc_kill marks a process as killed (line 2371).
When the process finally exits the kernel to user space,
or if a clock interrupt happens while it is in user space,
it will be destroyed (line 2886, 2890, 2912).
<p>Why wait until the process ends up in user space?
<p>What if the process is stuck in sleep? It might take a long
time to get back to user space.
Don't want to have to wait for it, so make sleep wake up early
(line 2373).
<p>This means all callers of sleep should check
whether they have been killed, but none do.
Bug in xv6.
<h3>System call handlers</h3>
<p>Sheet 32
<p>Fork: discussed copyproc in earlier lectures.
Sys_fork (line 3218) just calls copyproc
and marks the new proc runnable.
Does fork create a new process or a new thread?
Is there any shared context?
<p>Exec: we'll talk about exec later, when we talk about file systems.
<p>Sbrk: Saw growproc earlier. Why setupsegs before returning?
<title>L10</title>
<html>
<head>
</head>
<body>
<h1>File systems</h1>
<p>Required reading: iread, iwrite, and wdir, and code related to
these calls in fs.c, bio.c, ide.c, file.c, and sysfile.c
<h2>Overview</h2>
<p>The next 3 lectures are about file systems:
<ul>
<li>Basic file system implementation
<li>Naming
<li>Performance
</ul>
<p>Users desire to store their data durable so that data survives when
the user turns of his computer. The primary media for doing so are:
magnetic disks, flash memory, and tapes. We focus on magnetic disks
(e.g., through the IDE interface in xv6).
<p>To allow users to remember where they stored a file, they can
assign a symbolic name to a file, which appears in a directory.
<p>The data in a file can be organized in a structured way or not.
The structured variant is often called a database. UNIX uses the
unstructured variant: files are streams of bytes. Any particular
structure is likely to be useful to only a small class of
applications, and other applications will have to work hard to fit
their data into one of the pre-defined structures. Besides, if you
want structure, you can easily write a user-mode library program that
imposes that format on any file. The end-to-end argument in action.
(Databases have special requirements and support an important class of
applications, and thus have a specialized plan.)
<p>The API for a minimal file system consists of: open, read, write,
seek, close, and stat. Dup duplicates a file descriptor. For example:
<pre>
fd = open("x", O_RDWR);
read (fd, buf, 100);
write (fd, buf, 512);
close (fd)
</pre>
<p>Maintaining the file offset behind the read/write interface is an
interesting design decision . The alternative is that the state of a
read operation should be maintained by the process doing the reading
(i.e., that the pointer should be passed as an argument to read).
This argument is compelling in view of the UNIX fork() semantics,
which clones a process which shares the file descriptors of its
parent. A read by the parent of a shared file descriptor (e.g.,
stdin, changes the read pointer seen by the child). On the other
hand the alternative would make it difficult to get "(data; ls) > x"
right.
<p>Unix API doesn't specify that the effects of write are immediately
on the disk before a write returns. It is up to the implementation
of the file system within certain bounds. Choices include (that
aren't non-exclusive):
<ul>
<li>At some point in the future, if the system stays up (e.g., after
30 seconds);
<li>Before the write returns;
<li>Before close returns;
<li>User specified (e.g., before fsync returns).
</ul>
<p>A design issue is the semantics of a file system operation that
requires multiple disk writes. In particular, what happens if the
logical update requires writing multiple disks blocks and the power
fails during the update? For example, to create a new file,
requires allocating an inode (which requires updating the list of
free inodes on disk), writing a directory entry to record the
allocated i-node under the name of the new file (which may require
allocating a new block and updating the directory inode). If the
power fails during the operation, the list of free inodes and blocks
may be inconsistent with the blocks and inodes in use. Again this is
up to implementation of the file system to keep on disk data
structures consistent:
<ul>
<li>Don't worry about it much, but use a recovery program to bring
file system back into a consistent state.
<li>Journaling file system. Never let the file system get into an
inconsistent state.
</ul>
<p>Another design issue is the semantics are of concurrent writes to
the same data item. What is the order of two updates that happen at
the same time? For example, two processes open the same file and write
to it. Modern Unix operating systems allow the application to lock a
file to get exclusive access. If file locking is not used and if the
file descriptor is shared, then the bytes of the two writes will get
into the file in some order (this happens often for log files). If
the file descriptor is not shared, the end result is not defined. For
example, one write may overwrite the other one (e.g., if they are
writing to the same part of the file.)
<p>An implementation issue is performance, because writing to magnetic
disk is relatively expensive compared to computing. Three primary ways
to improve performance are: careful file system layout that induces
few seeks, an in-memory cache of frequently-accessed blocks, and
overlap I/O with computation so that file operations don't have to
wait until their completion and so that that the disk driver has more
data to write, which allows disk scheduling. (We will talk about
performance in detail later.)
<h2>xv6 code examples</h2>
<p>xv6 implements a minimal Unix file system interface. xv6 doesn't
pay attention to file system layout. It overlaps computation and I/O,
but doesn't do any disk scheduling. Its cache is write-through, which
simplifies keep on disk datastructures consistent, but is bad for
performance.
<p>On disk files are represented by an inode (struct dinode in fs.h),
and blocks. Small files have up to 12 block addresses in their inode;
large files use files the last address in the inode as a disk address
for a block with 128 disk addresses (512/4). The size of a file is
thus limited to 12 * 512 + 128*512 bytes. What would you change to
support larger files? (Ans: e.g., double indirect blocks.)
<p>Directories are files with a bit of structure to them. The file
contains of records of the type struct dirent. The entry contains the
name for a file (or directory) and its corresponding inode number.
How many files can appear in a directory?
<p>In memory files are represented by struct inode in fsvar.h. What is
the role of the additional fields in struct inode?
<p>What is xv6's disk layout? How does xv6 keep track of free blocks
and inodes? See balloc()/bfree() and ialloc()/ifree(). Is this
layout a good one for performance? What are other options?
<p>Let's assume that an application created an empty file x with
contains 512 bytes, and that the application now calls read(fd, buf,
100), that is, it is requesting to read 100 bytes into buf.
Furthermore, let's assume that the inode for x is is i. Let's pick
up what happens by investigating readi(), line 4483.
<ul>
<li>4488-4492: can iread be called on other objects than files? (Yes.
For example, read from the keyboard.) Everything is a file in Unix.
<li>4495: what does bmap do?
<ul>
<li>4384: what block is being read?
</ul>
<li>4483: what does bread do? does bread always cause a read to disk?
<ul>
<li>4006: what does bget do? it implements a simple cache of
recently-read disk blocks.
<ul>
<li>How big is the cache? (see param.h)
<li>3972: look if the requested block is in the cache by walking down
a circular list.
<li>3977: we had a match.
<li>3979: some other process has "locked" the block, wait until it
releases. the other processes releases the block using brelse().
Why lock a block?
<ul>
<li>Atomic read and update. For example, allocating an inode: read
block containing inode, mark it allocated, and write it back. This
operation must be atomic.
</ul>
<li>3982: it is ours now.
<li>3987: it is not in the cache; we need to find a cache entry to
hold the block.
<li>3987: what is the cache replacement strategy? (see also brelse())
<li>3988: found an entry that we are going to use.
<li>3989: mark it ours but don't mark it valid (there is no valid data
in the entry yet).
</ul>
<li>4007: if the block was in the cache and the entry has the block's
data, return.
<li>4010: if the block wasn't in the cache, read it from disk. are
read's synchronous or asynchronous?
<ul>
<li>3836: a bounded buffer of outstanding disk requests.
<li>3809: tell the disk to move arm and generate an interrupt.
<li>3851: go to sleep and run some other process to run. time sharing
in action.
<li>3792: interrupt: arm is in the right position; wakeup requester.
<li>3856: read block from disk.
<li>3860: remove request from bounded buffer. wakeup processes that
are waiting for a slot.
<li>3864: start next disk request, if any. xv6 can overlap I/O with
computation.
</ul>
<li>4011: mark the cache entry has holding the data.
</ul>
<li>4498: To where is the block copied? is dst a valid user address?
</ul>
<p>Now let's suppose that the process is writing 512 bytes at the end
of the file a. How many disk writes will happen?
<ul>
<li>4567: allocate a new block
<ul>
<li>4518: allocate a block: scan block map, and write entry
<li>4523: How many disk operations if the process would have been appending
to a large file? (Answer: read indirect block, scan block map, write
block map.)
</ul>
<li>4572: read the block that the process will be writing, in case the
process writes only part of the block.
<li>4574: write it. is it synchronous or asynchronous? (Ans:
synchronous but with timesharing.)
</ul>
<p>Lots of code to implement reading and writing of files. How about
directories?
<ul>
<li>4722: look for the directory, reading directory block and see if a
directory entry is unused (inum == 0).
<li>4729: use it and update it.
<li>4735: write the modified block.
</ul>
<p>Reading and writing of directories is trivial.
</body>
<html>
<head><title>Lecture 6: Interrupts &amp; Exceptions</title></head>
<body>
<h1>Interrupts &amp; Exceptions</h1>
<p>
Required reading: xv6 <code>trapasm.S</code>, <code>trap.c</code>, <code>syscall.c</code>, <code>usys.S</code>.
<br>
You will need to consult
<a href="../readings/ia32/IA32-3.pdf">IA32 System
Programming Guide</a> chapter 5 (skip 5.7.1, 5.8.2, 5.12.2).
<h2>Overview</h2>
<p>
Big picture: kernel is trusted third-party that runs the machine.
Only the kernel can execute privileged instructions (e.g.,
changing MMU state).
The processor enforces this protection through the ring bits
in the code segment.
If a user application needs to carry out a privileged operation
or other kernel-only service,
it must ask the kernel nicely.
How can a user program change to the kernel address space?
How can the kernel transfer to a user address space?
What happens when a device attached to the computer
needs attention?
These are the topics for today's lecture.
<p>
There are three kinds of events that must be handled
by the kernel, not user programs:
(1) a system call invoked by a user program,
(2) an illegal instruction or other kind of bad processor state (memory fault, etc.).
and
(3) an interrupt from a hardware device.
<p>
Although these three events are different, they all use the same
mechanism to transfer control to the kernel.
This mechanism consists of three steps that execute as one atomic unit.
(a) change the processor to kernel mode;
(b) save the old processor somewhere (usually the kernel stack);
and (c) change the processor state to the values set up as
the &ldquo;official kernel entry values.&rdquo;
The exact implementation of this mechanism differs
from processor to processor, but the idea is the same.
<p>
We'll work through examples of these today in lecture.
You'll see all three in great detail in the labs as well.
<p>
A note on terminology: sometimes we'll
use interrupt (or trap) to mean both interrupts and exceptions.
<h2>
Setting up traps on the x86
</h2>
<p>
See handout Table 5-1, Figure 5-1, Figure 5-2.
<p>
xv6 Sheet 07: <code>struct gatedesc</code> and <code>SETGATE</code>.
<p>
xv6 Sheet 28: <code>tvinit</code> and <code>idtinit</code>.
Note setting of gate for <code>T_SYSCALL</code>
<p>
xv6 Sheet 29: <code>vectors.pl</code> (also see generated <code>vectors.S</code>).
<h2>
System calls
</h2>
<p>
xv6 Sheet 16: <code>init.c</code> calls <code>open("console")</code>.
How is that implemented?
<p>
xv6 <code>usys.S</code> (not in book).
(No saving of registers. Why?)
<p>
Breakpoint <code>0x1b:"open"</code>,
step past <code>int</code> instruction into kernel.
<p>
See handout Figure 9-4 [sic].
<p>
xv6 Sheet 28: in <code>vectors.S</code> briefly, then in <code>alltraps</code>.
Step through to <code>call trap</code>, examine registers and stack.
How will the kernel find the argument to <code>open</code>?
<p>
xv6 Sheet 29: <code>trap</code>, on to <code>syscall</code>.
<p>
xv6 Sheet 31: <code>syscall</code> looks at <code>eax</code>,
calls <code>sys_open</code>.
<p>
(Briefly)
xv6 Sheet 52: <code>sys_open</code> uses <code>argstr</code> and <code>argint</code>
to get its arguments. How do they work?
<p>
xv6 Sheet 30: <code>fetchint</code>, <code>fetcharg</code>, <code>argint</code>,
<code>argptr</code>, <code>argstr</code>.
<p>
What happens if a user program divides by zero
or accesses unmapped memory?
Exception. Same path as system call until <code>trap</code>.
<p>
What happens if kernel divides by zero or accesses unmapped memory?
<h2>
Interrupts
</h2>
<p>
Like system calls, except:
devices generate them at any time,
there are no arguments in CPU registers,
nothing to return to,
usually can't ignore them.
<p>
How do they get generated?
Device essentially phones up the
interrupt controller and asks to talk to the CPU.
Interrupt controller then buzzes the CPU and
tells it, &ldquo;keyboard on line 1.&rdquo;
Interrupt controller is essentially the CPU's
<strike>secretary</strike> administrative assistant,
managing the phone lines on the CPU's behalf.
<p>
Have to set up interrupt controller.
<p>
(Briefly) xv6 Sheet 63: <code>pic_init</code> sets up the interrupt controller,
<code>irq_enable</code> tells the interrupt controller to let the given
interrupt through.
<p>
(Briefly) xv6 Sheet 68: <code>pit8253_init</code> sets up the clock chip,
telling it to interrupt on <code>IRQ_TIMER</code> 100 times/second.
<code>console_init</code> sets up the keyboard, enabling <code>IRQ_KBD</code>.
<p>
In Bochs, set breakpoint at 0x8:"vector0"
and continue, loading kernel.
Step through clock interrupt, look at
stack, registers.
<p>
Was the processor executing in kernel or user mode
at the time of the clock interrupt?
Why? (Have any user-space instructions executed at all?)
<p>
Can the kernel get an interrupt at any time?
Why or why not? <code>cli</code> and <code>sti</code>,
<code>irq_enable</code>.
</body>
</html>
差异被折叠。
差异被折叠。
<title>L11</title>
<html>
<head>
</head>
<body>
<h1>Naming in file systems</h1>
<p>Required reading: nami(), and all other file system code.
<h2>Overview</h2>
<p>To help users to remember where they stored their data, most
systems allow users to assign their own names to their data.
Typically the data is organized in files and users assign names to
files. To deal with many files, users can organize their files in
directories, in a hierarchical manner. Each name is a pathname, with
the components separated by "/".
<p>To avoid that users have to type long abolute names (i.e., names
starting with "/" in Unix), users can change their working directory
and use relative names (i.e., naming that don't start with "/").
<p>User file namespace operations include create, mkdir, mv, ln
(link), unlink, and chdir. (How is "mv a b" implemented in xv6?
Answer: "link a b"; "unlink a".) To be able to name the current
directory and the parent directory every directory includes two
entries "." and "..". Files and directories can reclaimed if users
cannot name it anymore (i.e., after the last unlink).
<p>Recall from last lecture, all directories entries contain a name,
followed by an inode number. The inode number names an inode of the
file system. How can we merge file systems from different disks into
a single name space?
<p>A user grafts new file systems on a name space using mount. Umount
removes a file system from the name space. (In DOS, a file system is
named by its device letter.) Mount takes the root inode of the
to-be-mounted file system and grafts it on the inode of the name space
entry where the file system is mounted (e.g., /mnt/disk1). The
in-memory inode of /mnt/disk1 records the major and minor number of
the file system mounted on it. When namei sees an inode on which a
file system is mounted, it looks up the root inode of the mounted file
system, and proceeds with that inode.
<p>Mount is not a durable operation; it doesn't surive power failures.
After a power failure, the system administrator must remount the file
system (i.e., often in a startup script that is run from init).
<p>Links are convenient, because with users can create synonyms for
file names. But, it creates the potential of introducing cycles in
the naning tree. For example, consider link("a/b/c", "a"). This
makes c a synonym for a. This cycle can complicate matters; for
example:
<ul>
<li>If a user subsequently calls unlink ("a"), then the user cannot
name the directory "b" and the link "c" anymore, but how can the
file system decide that?
</ul>
<p>This problem can be solved by detecting cycles. The second problem
can be solved by computing with files are reacheable from "/" and
reclaim all the ones that aren't reacheable. Unix takes a simpler
approach: avoid cycles by disallowing users to create links for
directories. If there are no cycles, then reference counts can be
used to see if a file is still referenced. In the inode maintain a
field for counting references (nlink in xv6's dinode). link
increases the reference count, and unlink decreases the count; if
the count reaches zero the inode and disk blocks can be reclaimed.
<p>How to handle symbolic links across file systems (i.e., from one
mounted file system to another)? Since inodes are not unique across
file systems, we cannot create a link across file systems; the
directory entry only contains an inode number, not the inode number
and the name of the disk on which the inode is located. To handle
this case, Unix provides a second type of link, which are called
soft links.
<p>Soft links are a special file type (e.g., T_SYMLINK). If namei
encounters a inode of type T_SYMLINK, it resolves the the name in
the symlink file to an inode, and continues from there. With
symlinks one can create cycles and they can point to non-existing
files.
<p>The design of the name system can have security implications. For
example, if you tests if a name exists, and then use the name,
between testing and using it an adversary can have change the
binding from name to object. Such problems are called TOCTTOU.
<p>An example of TOCTTOU is follows. Let's say root runs a script
every night to remove file in /tmp. This gets rid off the files
that editors might left behind, but we will never be used again. An
adversary can exploit this script as follows:
<pre>
Root Attacker
mkdir ("/tmp/etc")
creat ("/tmp/etc/passw")
readdir ("tmp");
lstat ("tmp/etc");
readdir ("tmp/etc");
rename ("tmp/etc", "/tmp/x");
symlink ("etc", "/tmp/etc");
unlink ("tmp/etc/passwd");
</pre>
Lstat checks whether /tmp/etc is not symbolic link, but by the time it
runs unlink the attacker had time to creat a symbolic link in the
place of /tmp/etc, with a password file of the adversary's choice.
<p>This problem could have been avoided if every user or process group
had its own private /tmp, or if access to the shared one was
mediated.
<h2>V6 code examples</h2>
<p> namei (sheet 46) is the core of the Unix naming system. namei can
be called in several ways: NAMEI_LOOKUP (resolve a name to an inode
and lock inode), NAMEI_CREATE (resolve a name, but lock parent
inode), and NAMEI_DELETE (resolve a name, lock parent inode, and
return offset in the directory). The reason is that namei is
complicated is that we want to atomically test if a name exist and
remove/create it, if it does; otherwise, two concurrent processes
could interfere with each other and directory could end up in an
inconsistent state.
<p>Let's trace open("a", O_RDWR), focussing on namei:
<ul>
<li>5263: we will look at creating a file in a bit.
<li>5277: call namei with NAMEI_LOOKUP
<li>4629: if path name start with "/", lookup root inode (1).
<li>4632: otherwise, use inode for current working directory.
<li>4638: consume row of "/", for example in "/////a////b"
<li>4641: if we are done with NAMEI_LOOKUP, return inode (e.g.,
namei("/")).
<li>4652: if the inode we are searching for a name isn't of type
directory, give up.
<li>4657-4661: determine length of the current component of the
pathname we are resolving.
<li>4663-4681: scan the directory for the component.
<li>4682-4696: the entry wasn't found. if we are the end of the
pathname and NAMEI_CREATE is set, lock parent directory and return a
pointer to the start of the component. In all other case, unlock
inode of directory, and return 0.
<li>4701: if NAMEI_DELETE is set, return locked parent inode and the
offset of the to-be-deleted component in the directory.
<li>4707: lookup inode of the component, and go to the top of the loop.
</ul>
<p>Now let's look at creating a file in a directory:
<ul>
<li>5264: if the last component doesn't exist, but first part of the
pathname resolved to a directory, then dp will be 0, last will point
to the beginning of the last component, and ip will be the locked
parent directory.
<li>5266: create an entry for last in the directory.
<li>4772: mknod1 allocates a new named inode and adds it to an
existing directory.
<li>4776: ialloc. skan inode block, find unused entry, and write
it. (if lucky 1 read and 1 write.)
<li>4784: fill out the inode entry, and write it. (another write)
<li>4786: write the entry into the directory (if lucky, 1 write)
</ul>
</ul>
Why must the parent directory be locked? If two processes try to
create the same name in the same directory, only one should succeed
and the other one, should receive an error (file exist).
<p>Link, unlink, chdir, mount, umount could have taken file
descriptors instead of their path argument. In fact, this would get
rid of some possible race conditions (some of which have security
implications, TOCTTOU). However, this would require that the current
working directory be remembered by the process, and UNIX didn't have
good ways of maintaining static state shared among all processes
belonging to a given user. The easiest way is to create shared state
is to place it in the kernel.
<p>We have one piece of code in xv6 that we haven't studied: exec.
With all the ground work we have done this code can be easily
understood (see sheet 54).
</body>
Security
-------------------
I. 2 Intro Examples
II. Security Overview
III. Server Security: Offense + Defense
IV. Unix Security + POLP
V. Example: OKWS
VI. How to Build a Website
I. Intro Examples
--------------------
1. Apache + OpenSSL 0.9.6a (CAN 2002-0656)
- SSL = More security!
unsigned int j;
p=(unsigned char *)s->init_buf->data;
j= *(p++);
s->session->session_id_length=j;
memcpy(s->session->session_id,p,j);
- the result: an Apache worm
2. SparkNotes.com 2000:
- New profile feature that displays "public" information about users
but bug that made e-mail addresses "public" by default.
- New program for getting that data:
http://www.sparknotes.com/getprofile.cgi?id=1343
II. Security Overview
----------------------
What Is Security?
- Protecting your system from attack.
What's an attack?
- Stealing data
- Corrupting data
- Controlling resources
- DOS
Why attack?
- Money
- Blackmail / extortion
- Vendetta
- intellectual curiosity
- fame
Security is a Big topic
- Server security -- today's focus. There's some machine sitting on the
Internet somewhere, with a certain interface exposed, and attackers
want to circumvent it.
- Why should you trust your software?
- Client security
- Clients are usually servers, so they have many of the same issues.
- Slight simplification: people across the network cannot typically
initiate connections.
- Has a "fallible operator":
- Spyware
- Drive-by-Downloads
- Client security turns out to be much harder -- GUI considerations,
look inside the browser and the applications.
- Systems community can more easily handle server security.
- We think mainly of servers.
III. Server Security: Offense and Defense
-----------------------------------------
- Show picture of a Web site.
Attacks | Defense
----------------------------------------------------------------------------
1. Break into DB from net | 1. FW it off
2. Break into WS on telnet | 2. FW it off
3. Buffer overrun in Apache | 3. Patch apache / use better lang?
4. Buffer overrun in our code | 4. Use better lang / isolate it
5. SQL injection | 5. Better escaping / don't interpret code.
6. Data scraping. | 6. Use a sparse UID space.
7. PW sniffing | 7. ???
8. Fetch /etc/passwd and crack | 8. Don't expose /etc/passwd
PW |
9. Root escalation from apache | 9. No setuid programs available to Apache
10. XSS |10. Filter JS and input HTML code.
11. Keystroke recorded on sys- |11. Client security
admin's desktop (planetlab) |
12. DDOS |12. ???
Summary:
- That we want private data to be available to right people makes
this problem hard in the first place. Internet servers are there
for a reason.
- Security != "just encrypt your data;" this in fact can sometimes
make the problem worse.
- Best to prevent break-ins from happening in the first place.
- If they do happen, want to limit their damage (POLP).
- Security policies are difficult to express / package up neatly.
IV. Design According to POLP (in Unix)
---------------------------------------
- Assume any piece of a system can be compromised, by either bad
programming or malicious attack.
- Try to limit the damage done by such a compromise (along the lines
of the 4 attack goals).
<Draw a picture of a server process on Unix, w/ other processes>
What's the goal on Unix?
- Keep processes from communicating that don't have to:
- limit FS, IPC, signals, ptrace
- Strip away unneeded privilege
- with respect to network, FS.
- Strip away FS access.
How on Unix?
- setuid/setgid
- system call interposition
- chroot (away from setuid executables, /etc/passwd, /etc/ssh/..)
<show Code snippet>
How do you write chroot'ed programs?
- What about shared libraries?
- /etc/resolv.conf?
- Can chroot'ed programs access the FS at all? What if they need
to write to the FS or read from the FS?
- Fd's are *capabilities*; can pass them to chroot'ed services,
thereby opening new files on its behalf.
- Unforgeable - can only get them from the kernel via open/socket, etc.
Unix Shortcomings (round 1)
- It's bad to run as root!
- Yet, need root for:
- chroot
- setuid/setgid to a lower-privileged user
- create a new user ID
- Still no guarantee that we've cut off all channels
- 200 syscalls!
- Default is to give most/all privileges.
- Can "break out" of chroot jails?
- Can still exploit race conditions in the kernel to escalate privileges.
Sidebar
- setuid / setuid misunderstanding
- root / root misunderstanding
- effective vs. real vs. saved set-user-ID
V. OKWS
-------
- Taking these principles as far as possible.
- C.f. Figure 1 From the paper..
- Discussion of which privileges are in which processes
<Table of how to hack, what you get, etc...>
- Technical details: how to launch a new service
- Within the launcher (running as root):
<on board:>
// receive FDs from logger, pubd, demux
fork ();
chroot ("/var/okws/run");
chdir ("/coredumps/51001");
setgid (51001);
setuid (51001);
exec ("login", fds ... );
- Note no chroot -- why not?
- Once launched, how does a service get new connections?
- Note the goal - minimum tampering with each other in the
case of a compromise.
Shortcoming of Unix (2)
- A lot of plumbing involved with this system. FDs flying everywhere.
- Isolation still not fine enough. If a service gets taken over,
can compromise all users of that service.
VI. Reflections on Building Websites
---------------------------------
- OKWS interesting "experiment"
- Need for speed; also, good gzip support.
- If you need compiled code, it's a good way to go.
- RPC-like system a must for backend communication
- Connection-pooling for free
Biggest difficulties:
- Finding good C++ programmers.
- Compile times.
- The DB is still always the problem.
Hard to Find good Alternatives
- Python / Perl - you might spend a lot of time writing C code /
integrating with lower level languages.
- Have to worry about DB pooling.
- Java -- must viable, and is getting better. Scary you can't peer
inside.
- .Net / C#-based system might be the way to go.
=======================================================================
Extra Material:
Capabilities (From the Eros Paper in SOSP 1999)
- "Unforgeable pair made up of an object ID and a set of authorized
operations (an interface) on that object."
- c.f. Dennis and van Horn. "Programming semantics for multiprogrammed
computations," Communications of the ACM 9(3):143-154, Mar 1966.
- Thus:
<object ID, set of authorized OPs on that object>
- Examples:
"Process X can write to file at inode Y"
"Process P can read from file at inode Z"
- Familiar example: Unix file descriptors
- Why are they secure?
- Capabilities are "unforgeable"
- Processes can get them only through authorized interfaces
- Capabilities are only given to processes authorized to hold them
- How do you get them?
- From the kernel (e.g., open)
- From other applications (e.g., FD passing)
- How do you use them?
- read (fd), write(fd).
- How do you revoke them once granted?
- In Unix, you do not.
- In some systems, a central authority ("reference monitor") can revoke.
- How do you store them persistently?
- Can have circular dependencies (unlike an FS).
- What happens when the system starts up?
- Revert to checkpointed state.
- Often capability systems chose a single-level store.
- Capability systems, a historical prospective:
- KeyKOS, Eros, Cyotos (UP research)
- Never saw any applications
- IBM Systems (System 38, later AS/400, later 'i Series')
- Commercially viable
- Problems:
- All bets are off when a capability is sent to the wrong place.
- Firewall analogy?
<html>
<head>
<title>Plan 9</title>
</head>
<body>
<h1>Plan 9</h1>
<p>Required reading: Plan 9 from Bell Labs</p>
<h2>Background</h2>
<p>Had moved away from the ``one computing system'' model of
Multics and Unix.</p>
<p>Many computers (`workstations'), self-maintained, not a coherent whole.</p>
<p>Pike and Thompson had been batting around ideas about a system glued together
by a single protocol as early as 1984.
Various small experiments involving individual pieces (file server, OS, computer)
tried throughout 1980s.</p>
<p>Ordered the hardware for the ``real thing'' in beginning of 1989,
built up WORM file server, kernel, throughout that year.</p>
<p>Some time in early fall 1989, Pike and Thompson were
trying to figure out a way to fit the window system in.
On way home from dinner, both independently realized that
needed to be able to mount a user-space file descriptor,
not just a network address.</p>
<p>Around Thanksgiving 1989, spent a few days rethinking the whole
thing, added bind, new mount, flush, and spent a weekend
making everything work again. The protocol at that point was
essentially identical to the 9P in the paper.</p>
<p>In May 1990, tried to use system as self-hosting.
File server kept breaking, had to keep rewriting window system.
Dozen or so users by then, mostly using terminal windows to
connect to Unix.</p>
<p>Paper written and submitted to UKUUG in July 1990.</p>
<p>Because it was an entirely new system, could take the
time to fix problems as they arose, <i>in the right place</i>.</p>
<h2>Design Principles</h2>
<p>Three design principles:</p>
<p>
1. Everything is a file.<br>
2. There is a standard protocol for accessing files.<br>
3. Private, malleable name spaces (bind, mount).
</p>
<h3>Everything is a file.</h3>
<p>Everything is a file (more everything than Unix: networks, graphics).</p>
<pre>
% ls -l /net
% lp /dev/screen
% cat /mnt/wsys/1/text
</pre>
<h3>Standard protocol for accessing files</h3>
<p>9P is the only protocol the kernel knows: other protocols
(NFS, disk file systems, etc.) are provided by user-level translators.</p>
<p>Only one protocol, so easy to write filters and other
converters. <i>Iostats</i> puts itself between the kernel
and a command.</p>
<pre>
% iostats -xvdfdf /bin/ls
</pre>
<h3>Private, malleable name spaces</h3>
<p>Each process has its own private name space that it
can customize at will.
(Full disclosure: can arrange groups of
processes to run in a shared name space. Otherwise how do
you implement <i>mount</i> and <i>bind</i>?)</p>
<p><i>Iostats</i> remounts the root of the name space
with its own filter service.</p>
<p>The window system mounts a file system that it serves
on <tt>/mnt/wsys</tt>.</p>
<p>The network is actually a kernel device (no 9P involved)
but it still serves a file interface that other programs
use to access the network.
Easy to move out to user space (or replace) if necessary:
<i>import</i> network from another machine.</p>
<h3>Implications</h3>
<p>Everything is a file + can share files =&gt; can share everything.</p>
<p>Per-process name spaces help move toward ``each process has its own
private machine.''</p>
<p>One protocol: easy to build custom filters to add functionality
(e.g., reestablishing broken network connections).
<h3>File representation for networks, graphics, etc.</h3>
<p>Unix sockets are file descriptors, but you can't use the
usual file operations on them. Also far too much detail that
the user doesn't care about.</p>
<p>In Plan 9:
<pre>dial("tcp!plan9.bell-labs.com!http");
</pre>
(Protocol-independent!)</p>
<p>Dial more or less does:<br>
write to /net/cs: tcp!plan9.bell-labs.com!http
read back: /net/tcp/clone 204.178.31.2!80
write to /net/tcp/clone: connect 204.178.31.2!80
read connection number: 4
open /net/tcp/4/data
</p>
<p>Details don't really matter. Two important points:
protocol-independent, and ordinary file operations
(open, read, write).</p>
<p>Networks can be shared just like any other files.</p>
<p>Similar story for graphics, other resources.</p>
<h2>Conventions</h2>
<p>Per-process name spaces mean that even full path names are ambiguous
(<tt>/bin/cat</tt> means different things on different machines,
or even for different users).</p>
<p><i>Convention</i> binds everything together.
On a 386, <tt>bind /386/bin /bin</tt>.
<p>In Plan 9, always know where the resource <i>should</i> be
(e.g., <tt>/net</tt>, <tt>/dev</tt>, <tt>/proc</tt>, etc.),
but not which one is there.</p>
<p>Can break conventions: on a 386, <tt>bind /alpha/bin /bin</tt>, just won't
have usable binaries in <tt>/bin</tt> anymore.</p>
<p>Object-oriented in the sense of having objects (files) that all
present the same interface and can be substituted for one another
to arrange the system in different ways.</p>
<p>Very little ``type-checking'': <tt>bind /net /proc; ps</tt>.
Great benefit (generality) but must be careful (no safety nets).</p>
<h2>Other Contributions</h2>
<h3>Portability</h3>
<p>Plan 9 still is the most portable operating system.
Not much machine-dependent code, no fancy features
tied to one machine's MMU, multiprocessor from the start (1989).</p>
<p>Many other systems are still struggling with converting to SMPs.</p>
<p>Has run on MIPS, Motorola 68000, Nextstation, Sparc, x86, PowerPC, Alpha, others.</p>
<p>All the world is not an x86.</p>
<h3>Alef</h3>
<p>New programming language: convenient, but difficult to maintain.
Retired when author (Winterbottom) stopped working on Plan 9.</p>
<p>Good ideas transferred to C library plus conventions.</p>
<p>All the world is not C.</p>
<h3>UTF-8</h3>
<p>Thompson invented UTF-8. Pike and Thompson
converted Plan 9 to use it over the first weekend of September 1992,
in time for X/Open to choose it as the Unicode standard byte format
at a meeting the next week.</p>
<p>UTF-8 is now the standard character encoding for Unicode on
all systems and interoperating between systems.</p>
<h3>Simple, easy to modify base for experiments</h3>
<p>Whole system source code is available, simple, easy to
understand and change.
There's a reason it only took a couple days to convert to UTF-8.</p>
<pre>
49343 file server kernel
181611 main kernel
78521 ipaq port (small kernel)
20027 TCP/IP stack
15365 ipaq-specific code
43129 portable code
1326778 total lines of source code
</pre>
<h3>Dump file system</h3>
<p>Snapshot idea might well have been ``in the air'' at the time.
(<tt>OldFiles</tt> in AFS appears to be independently derived,
use of WORM media was common research topic.)</p>
<h3>Generalized Fork</h3>
<p>Picked up by other systems: FreeBSD, Linux.</p>
<h3>Authentication</h3>
<p>No global super-user.
Newer, more Plan 9-like authentication described in later paper.</p>
<h3>New Compilers</h3>
<p>Much faster than gcc, simpler.</p>
<p>8s to build acme for Linux using gcc; 1s to build acme for Plan 9 using 8c (but running on Linux)</p>
<h3>IL Protocol</h3>
<p>Now retired.
For better or worse, TCP has all the installed base.
IL didn't work very well on asymmetric or high-latency links
(e.g., cable modems).</p>
<h2>Idea propagation</h2>
<p>Many ideas have propagated out to varying degrees.</p>
<p>Linux even has bind and user-level file servers now (FUSE),
but still not per-process name spaces.</p>
</body>
<title>Scalable coordination</title>
<html>
<head>
</head>
<body>
<h1>Scalable coordination</h1>
<p>Required reading: Mellor-Crummey and Scott, Algorithms for Scalable
Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991.
<h2>Overview</h2>
<p>Shared memory machines are bunch of CPUs, sharing physical memory.
Typically each processor also mantains a cache (for performance),
which introduces the problem of keep caches coherent. If processor 1
writes a memory location whose value processor 2 has cached, then
processor 2's cache must be updated in some way. How?
<ul>
<li>Bus-based schemes. Any CPU can access "dance with" any memory
equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache
listens to the memory bus. With write-through architecture, invalidate
copy when see a write. Or can have "ownership" scheme with write-back
cache (E.g., Pentium cache have MESI bits---modified, exclusive,
shared, invalid). If E bit set, CPU caches exclusively and can do
write back. But bus places limits on scalability.
<li>More scalability w. NUMA schemes (non-uniform memory access). Each
CPU comes with fast "close" memory. Slower to access memory that is
stored with another processor. Use a directory to keep track of who is
caching what. For example, processor 0 is responsible for all memory
starting with address "000", processor 1 is responsible for all memory
starting with "001", etc.
<li>COMA - cache-only memory architecture. Each CPU has local RAM,
treated as cache. Cache lines migrate around to different nodes based
on access pattern. Data only lives in cache, no permanent memory
location. (These machines aren't too popular any more.)
</ul>
<h2>Scalable locks</h2>
<p>This paper is about cost and scalability of locking; what if you
have 10 CPUs waiting for the same lock? For example, what would
happen if xv6 runs on an SMP with many processors?
<p>What's the cost of a simple spinning acquire/release? Algorithm 1
*without* the delays, which is like xv6's implementation of acquire
and release (xv6 uses XCHG instead of test_and_set):
<pre>
each of the 10 CPUs gets the lock in turn
meanwhile, remaining CPUs in XCHG on lock
lock must be X in cache to run XCHG
otherwise all might read, then all might write
so bus is busy all the time with XCHGs!
can we avoid constant XCHGs while lock is held?
</pre>
<p>test-and-test-and-set
<pre>
only run expensive TSL if not locked
spin on ordinary load instruction, so cache line is S
acquire(l)
while(1){
while(l->locked != 0) { }
if(TSL(&l->locked) == 0)
return;
}
</pre>
<p>suppose 10 CPUs are waiting, let's count cost in total bus
transactions
<pre>
CPU1 gets lock in one cycle
sets lock's cache line to I in other CPUs
9 CPUs each use bus once in XCHG
then everyone has the line S, so they spin locally
CPU1 release the lock
CPU2 gets the lock in one cycle
8 CPUs each use bus once...
So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs!
Look at "test-and-test-and-set" in Figure 6
</pre>
<p> Can we have <i>n</i> CPUs acquire a lock in O(<i>n</i>) time?
<p>What is the point of the exponential backoff in Algorithm 1?
<pre>
Does it buy us O(n) time for n acquires?
Is there anything wrong with it?
may not be fair
exponential backoff may increase delay after release
</pre>
<p>What's the point of the ticket locks, Algorithm 2?
<pre>
one interlocked instruction to get my ticket number
then I spin on now_serving with ordinary load
release() just increments now_serving
</pre>
<p>why is that good?
<pre>
+ fair
+ no exponential backoff overshoot
+ no spinning on
</pre>
<p>but what's the cost, in bus transactions?
<pre>
while lock is held, now_serving is S in all caches
release makes it I in all caches
then each waiters uses a bus transaction to get new value
so still O(n^2)
</pre>
<p>What's the point of the array-based queuing locks, Algorithm 3?
<pre>
a lock has an array of "slots"
waiter allocates a slot, spins on that slot
release wakes up just next slot
so O(n) bus transactions to get through n waiters: good!
anderson lines in Figure 4 and 6 are flat-ish
they only go up because lock data structures protected by simpler lock
but O(n) space *per lock*!
</pre>
<p>Algorithm 5 (MCS), the new algorithm of the paper, uses
compare_and_swap:
<pre>
int compare_and_swap(addr, v1, v2) {
int ret = 0;
// stop all memory activity and ignore interrupts
if (*addr == v1) {
*addr = v2;
ret = 1;
}
// resume other memory activity and take interrupts
return ret;
}
</pre>
<p>What's the point of the MCS lock, Algorithm 5?
<pre>
constant space per lock, rather than O(n)
one "qnode" per thread, used for whatever lock it's waiting for
lock holder's qnode points to start of list
lock variable points to end of list
acquire adds your qnode to end of list
then you spin on your own qnode
release wakes up next qnode
</pre>
<h2>Wait-free or non-blocking data structures</h2>
<p>The previous implementations all block threads when there is
contention for a lock. Other atomic hardware operations allows one
to build implementation wait-free data structures. For example, one
can make an insert of an element in a shared list that don't block a
thread. Such versions are called wait free.
<p>A linked list with locks is as follows:
<pre>
Lock list_lock;
insert(int x) {
element *n = new Element;
n->x = x;
acquire(&list_lock);
n->next = list;
list = n;
release(&list_lock);
}
</pre>
<p>A wait-free implementation is as follows:
<pre>
insert (int x) {
element *n = new Element;
n->x = x;
do {
n->next = list;
} while (compare_and_swap (&list, n->next, n) == 0);
}
</pre>
<p>How many bus transactions with 10 CPUs inserting one element in the
list? Could you do better?
<p><a href="http://www.cl.cam.ac.uk/netos/papers/2007-cpwl.pdf">This
paper by Fraser and Harris</a> compares lock-based implementations
versus corresponding non-blocking implementations of a number of data
structures.
<p>It is not possible to make every operation wait-free, and there are
times we will need an implementation of acquire and release.
research on non-blocking data structures is active; the last word
isn't said on this topic yet.
</body>
差异被折叠。
差异被折叠。
差异被折叠。
<html>
<head>
<title>XFI</title>
</head>
<body>
<h1>XFI</h1>
<p>Required reading: XFI: software guards for system address spaces.
<h2>Introduction</h2>
<p>Problem: how to use untrusted code (an "extension") in a trusted
program?
<ul>
<li>Use untrusted jpeg codec in Web browser
<li>Use an untrusted driver in the kernel
</ul>
<p>What are the dangers?
<ul>
<li>No fault isolations: extension modifies trusted code unintentionally
<li>No protection: extension causes a security hole
<ul>
<li>Extension has a buffer overrun problem
<li>Extension calls trusted program's functions
<li>Extensions calls a trusted program's functions that is allowed to
call, but supplies "bad" arguments
<li>Extensions calls privileged hardware instructions (when extending
kernel)
<li>Extensions reads data out of trusted program it shouldn't.
</ul>
</ul>
<p>Possible solutions approaches:
<ul>
<li>Run extension in its own address space with minimal
privileges. Rely on hardware and operating system protection
mechanism.
<li>Restrict the language in which the extension is written:
<ul>
<li>Packet filter language. Language is limited in its capabilities,
and it easy to guarantee "safe" execution.
<li>Type-safe language. Language runtime and compiler guarantee "safe"
execution.
</ul>
<li>Software-based sandboxing.
</ul>
<h2>Software-based sandboxing</h2>
<p>Sandboxer. A compiler or binary-rewriter sandboxes all unsafe
instructions in an extension by inserting additional instructions.
For example, every indirect store is preceded by a few instructions
that compute and check the target of the store at runtime.
<p>Verifier. When the extension is loaded in the trusted program, the
verifier checks if the extension is appropriately sandboxed (e.g.,
are all indirect stores sandboxed? does it call any privileged
instructions?). If not, the extension is rejected. If yes, the
extension is loaded, and can run. If the extension runs, the
instruction that sandbox unsafe instructions check if the unsafe
instruction is used in a safe way.
<p>The verifier must be trusted, but the sandboxer doesn't. We can do
without the verifier, if the trusted program can establish that the
extension has been sandboxed by a trusted sandboxer.
<p>The paper refers to this setup as instance of proof-carrying code.
<h2>Software fault isolation</h2>
<p><a href="http://citeseer.ist.psu.edu/wahbe93efficient.html">SFI</a>
by Wahbe et al. explored out to use sandboxing for fault isolation
extensions; that is, use sandboxing to control that stores and jump
stay within a specified memory range (i.e., they don't overwrite and
jump into addresses in the trusted program unchecked). They
implemented SFI for a RISC processor, which simplify things since
memory can be written only by store instructions (other instructions
modify registers). In addition, they assumed that there were plenty
of registers, so that they can dedicate a few for sandboxing code.
<p>The extension is loaded into a specific range (called a segment)
within the trusted application's address space. The segment is
identified by the upper bits of the addresses in the
segment. Separate code and data segments are necessary to prevent an
extension overwriting its code.
<p>An unsafe instruction on the MIPS is an instruction that jumps or
stores to an address that cannot be statically verified to be within
the correct segment. Most control transfer operations, such
program-counter relative can be statically verified. Stores to
static variables often use an immediate addressing mode and can be
statically verified. Indirect jumps and indirect stores are unsafe.
<p>To sandbox those instructions the sandboxer could generate the
following code for each unsafe instruction:
<pre>
DR0 <- target address
R0 <- DR0 >> shift-register; // load in R0 segment id of target
CMP R0, segment-register; // compare to segment id to segment's ID
BNE fault-isolation-error // if not equal, branch to trusted error code
STORE using DR0
</pre>
In this code, DR0, shift-register, and segment register
are <i>dedicated</i>: they cannot be used by the extension code. The
verifier must check if the extension doesn't use they registers. R0
is a scratch register, but doesn't have to be dedicated. The
dedicated registers are necessary, because otherwise extension could
load DR0 and jump to the STORE instruction directly, skipping the
check.
<p>This implementation costs 4 registers, and 4 additional instructions
for each unsafe instruction. One could do better, however:
<pre>
DR0 <- target address & and-mask-register // mask segment ID from target
DR0 <- DR0 | segment register // insert this segment's ID
STORE using DR0
</pre>
This code just sets the write segment ID bits. It doesn't catch
illegal addresses; it just ensures that illegal addresses are within
the segment, harming the extension but no other code. Even if the
extension jumps to the second instruction of this sandbox sequence,
nothing bad will happen (because DR0 will already contain the correct
segment ID).
<p>Optimizations include:
<ul>
<li>use guard zones for <i>store value, offset(reg)</i>
<li>treat SP as dedicated register (sandbox code that initializes it)
<li>etc.
</ul>
<h2>XFI</h2>
<p>XFI extends SFI in several ways:
<ul>
<li>Handles fault isolation and protection
<li>Uses control-folow integrity (CFI) to get good performance
<li>Doesn't use dedicated registers
<li>Use two stacks (a scoped stack and an allocation stack) and only
allocation stack can be corrupted by buffer-overrun attacks. The
scoped stack cannot via computed memory references.
<li>Uses a binary rewriter.
<li>Works for the x86
</ul>
<p>x86 is challenging, because limited registers and variable length
of instructions. SFI technique won't work with x86 instruction
set. For example if the binary contains:
<pre>
25 CD 80 00 00 # AND eax, 0x80CD
</pre>
and an adversary can arrange to jump to the second byte, then the
adversary calls system call on Linux, which has binary the binary
representation CD 80. Thus, XFI must control execution flow.
<p>XFI policy goals:
<ul>
<li>Memory-access constraints (like SFI)
<li>Interface restrictions (extension has fixed entry and exit points)
<li>Scoped-stack integrity (calling stack is well formed)
<li>Simplified instructions semantics (remove dangerous instructions)
<li>System-environment integrity (ensure certain machine model
invariants, such as x86 flags register cannot be modified)
<li>Control-flow integrity: execution must follow a static, expected
control-flow graph. (enter at beginning of basic blocks)
<li>Program-data integrity (certain global variables in extension
cannot be accessed via computed memory addresses)
</ul>
<p>The binary rewriter inserts guards to ensure these properties. The
verifier check if the appropriate guards in place. The primary
mechanisms used are:
<ul>
<li>CFI guards on computed control-flow transfers (see figure 2)
<li>Two stacks
<li>Guards on computer memory accesses (see figure 3)
<li>Module header has a section that contain access permissions for
region
<li>Binary rewriter, which performs intra-procedure analysis, and
generates guards, code for stack use, and verification hints
<li>Verifier checks specific conditions per basic block. hints specify
the verification state for the entry to each basic block, and at
exit of basic block the verifier checks that the final state implies
the verification state at entry to all possible successor basic
blocks. (see figure 4)
</ul>
<p>Can XFI protect against the attack discussed in last lecture?
<pre>
unsigned int j;
p=(unsigned char *)s->init_buf->data;
j= *(p++);
s->session->session_id_length=j;
memcpy(s->session->session_id,p,j);
</pre>
Where will <i>j</i> be located?
<p>How about the following one from the paper <a href="http://research.microsoft.com/users/jpincus/beyond-stack-smashing.pdf"><i>Beyond stack smashing:
recent advances in exploiting buffer overruns</i></a>?
<pre>
void f2b(void * arg, size_t len) {
char buf[100];
long val = ..;
long *ptr = ..;
extern void (*f)();
memcopy(buff, arg, len);
*ptr = val;
f();
...
return;
}
</pre>
What code can <i>(*f)()</i> call? Code that the attacker inserted?
Code in libc?
<p>How about an attack that use <i>ptr</i> in the above code to
overwrite a method's address in a class's dispatch table with an
address of support function?
<p>How about <a href="http://research.microsoft.com/~shuochen/papers/usenix05data_attack.pdf">data-only attacks</a>? For example, attacker
overwrites <i>pw_uid</i> in the heap with 0 before the following
code executes (when downloading /etc/passwd and then uploading it with a
modified entry).
<pre>
FILE *getdatasock( ... ) {
seteuid(0);
setsockeope ( ...);
...
seteuid(pw->pw_uid);
...
}
</pre>
<p>How much does XFI slow down applications? How many more
instructions are executed? (see Tables 1-4)
</body>
<title>L1</title>
<html>
<head>
</head>
<body>
<h1>OS overview</h1>
<h2>Overview</h2>
<ul>
<li>Goal of course:
<ul>
<li>Understand operating systems in detail by designing and
implementing miminal OS
<li>Hands-on experience with building systems ("Applying 6.033")
</ul>
<li>What is an operating system?
<ul>
<li>a piece of software that turns the hardware into something useful
<li>layered picture: hardware, OS, applications
<li>Three main functions: fault isolate applications, abstract hardware,
manage hardware
</ul>
<li>Examples:
<ul>
<li>OS-X, Windows, Linux, *BSD, ... (desktop, server)
<li>PalmOS Windows/CE (PDA)
<li>Symbian, JavaOS (Cell phones)
<li>VxWorks, pSOS (real-time)
<li> ...
</ul>
<li>OS Abstractions
<ul>
<li>processes: fork, wait, exec, exit, kill, getpid, brk, nice, sleep,
trace
<li>files: open, close, read, write, lseek, stat, sync
<li>directories: mkdir, rmdir, link, unlink, mount, umount
<li>users + security: chown, chmod, getuid, setuid
<li>interprocess communication: signals, pipe
<li>networking: socket, accept, snd, recv, connect
<li>time: gettimeofday
<li>terminal:
</ul>
<li>Sample Unix System calls (mostly POSIX)
<ul>
<li> int read(int fd, void*, int)
<li> int write(int fd, void*, int)
<li> off_t lseek(int fd, off_t, int [012])
<li> int close(int fd)
<li> int fsync(int fd)
<li> int open(const char*, int flags [, int mode])
<ul>
<li> O_RDONLY, O_WRONLY, O_RDWR, O_CREAT
</ul>
<li> mode_t umask(mode_t cmask)
<li> int mkdir(char *path, mode_t mode);
<li> DIR *opendir(char *dirname)
<li> struct dirent *readdir(DIR *dirp)
<li> int closedir(DIR *dirp)
<li> int chdir(char *path)
<li> int link(char *existing, char *new)
<li> int unlink(char *path)
<li> int rename(const char*, const char*)
<li> int rmdir(char *path)
<li> int stat(char *path, struct stat *buf)
<li> int mknod(char *path, mode_t mode, dev_t dev)
<li> int fork()
<ul>
<li> returns childPID in parent, 0 in child; only
difference
</ul>
<li>int getpid()
<li> int waitpid(int pid, int* stat, int opt)
<ul>
<li> pid==-1: any; opt==0||WNOHANG
<li> returns pid or error
</ul>
<li> void _exit(int status)
<li> int kill(int pid, int signal)
<li> int sigaction(int sig, struct sigaction *, struct sigaction *)
<li> int sleep (int sec)
<li> int execve(char* prog, char** argv, char** envp)
<li> void *sbrk(int incr)
<li> int dup2(int oldfd, int newfd)
<li> int fcntl(int fd, F_SETFD, int val)
<li> int pipe(int fds[2])
<ul>
<li> writes on fds[1] will be read on fds[0]
<li> when last fds[1] closed, read fds[0] retursn EOF
<li> when last fds[0] closed, write fds[1] kills SIGPIPE/fails
EPIPE
</ul>
<li> int fchown(int fd, uind_t owner, gid_t group)
<li> int fchmod(int fd, mode_t mode)
<li> int socket(int domain, int type, int protocol)
<li> int accept(int socket_fd, struct sockaddr*, int* namelen)
<ul>
<li> returns new fd
</ul>
<li> int listen(int fd, int backlog)
<li> int connect(int fd, const struct sockaddr*, int namelen)
<li> void* mmap(void* addr, size_t len, int prot, int flags, int fd,
off_t offset)
<li> int munmap(void* addr, size_t len)
<li> int gettimeofday(struct timeval*)
</ul>
</ul>
<p>See the <a href="../reference.html">reference page</a> for links to
the early Unix papers.
<h2>Class structure</h2>
<ul>
<li>Lab: minimal OS for x86 in an exokernel style (50%)
<ul>
<li>kernel interface: hardware + protection
<li>libOS implements fork, exec, pipe, ...
<li>applications: file system, shell, ..
<li>development environment: gcc, bochs
<li>lab 1 is out
</ul>
<li>Lecture structure (20%)
<ul>
<li>homework
<li>45min lecture
<li>45min case study
</ul>
<li>Two quizzes (30%)
<ul>
<li>mid-term
<li>final's exam week
</ul>
</ul>
<h2>Case study: the shell (simplified)</h2>
<ul>
<li>interactive command execution and a programming language
<li>Nice example that uses various OS abstractions. See <a
href="../readings/ritchie74unix.pdf">Unix
paper</a> if you are unfamiliar with the shell.
<li>Final lab is a simple shell.
<li>Basic structure:
<pre>
while (1) {
printf ("$");
readcommand (command, args); // parse user input
if ((pid = fork ()) == 0) { // child?
exec (command, args, 0);
} else if (pid > 0) { // parent?
wait (0); // wait for child to terminate
} else {
perror ("Failed to fork\n");
}
}
</pre>
<p>The split of creating a process with a new program in fork and exec
is mostly a historical accident. See the <a
href="../readings/ritchie79evolution.html">assigned paper</a> for today.
<li>Example:
<pre>
$ ls
</pre>
<li>why call "wait"? to wait for the child to terminate and collect
its exit status. (if child finishes, child becomes a zombie until
parent calls wait.)
<li>I/O: file descriptors. Child inherits open file descriptors
from parent. By convention:
<ul>
<li>file descriptor 0 for input (e.g., keyboard). read_command:
<pre>
read (1, buf, bufsize)
</pre>
<li>file descriptor 1 for output (e.g., terminal)
<pre>
write (1, "hello\n", strlen("hello\n")+1)
</pre>
<li>file descriptor 2 for error (e.g., terminal)
</ul>
<li>How does the shell implement:
<pre>
$ls > tmp1
</pre>
just before exec insert:
<pre>
close (1);
fd = open ("tmp1", O_CREAT|O_WRONLY); // fd will be 1!
</pre>
<p>The kernel will return the first free file descriptor, 1 in this case.
<li>How does the shell implement sharing an output file:
<pre>
$ls 2> tmp1 > tmp1
</pre>
replace last code with:
<pre>
close (1);
close (2);
fd1 = open ("tmp1", O_CREAT|O_WRONLY); // fd will be 1!
fd2 = dup (fd1);
</pre>
both file descriptors share offset
<li>how do programs communicate?
<pre>
$ sort file.txt | uniq | wc
</pre>
or
<pre>
$ sort file.txt > tmp1
$ uniq tmp1 > tmp2
$ wc tmp2
$ rm tmp1 tmp2
</pre>
or
<pre>
$ kill -9
</pre>
<li>A pipe is an one-way communication channel. Here is an example
where the parent is the writer and the child is the reader:
<pre>
int fdarray[2];
if (pipe(fdarray) < 0) panic ("error");
if ((pid = fork()) < 0) panic ("error");
else if (pid > 0) {
close(fdarray[0]);
write(fdarray[1], "hello world\n", 12);
} else {
close(fdarray[1]);
n = read (fdarray[0], buf, MAXBUF);
write (1, buf, n);
}
</pre>
<li>How does the shell implement pipelines (i.e., cmd 1 | cmd 2 |..)?
We want to arrange that the output of cmd 1 is the input of cmd 2.
The way to achieve this goal is to manipulate stdout and stdin.
<li>The shell creates processes for each command in
the pipeline, hooks up their stdin and stdout correctly. To do it
correct, and waits for the last process of the
pipeline to exit. A sketch of the core modifications to our shell for
setting up a pipe is:
<pre>
int fdarray[2];
if (pipe(fdarray) < 0) panic ("error");
if ((pid = fork ()) == 0) { child (left end of pipe)
close (1);
tmp = dup (fdarray[1]); // fdarray[1] is the write end, tmp will be 1
close (fdarray[0]); // close read end
close (fdarray[1]); // close fdarray[1]
exec (command1, args1, 0);
} else if (pid > 0) { // parent (right end of pipe)
close (0);
tmp = dup (fdarray[0]); // fdarray[0] is the read end, tmp will be 0
close (fdarray[0]);
close (fdarray[1]); // close write end
exec (command2, args2, 0);
} else {
printf ("Unable to fork\n");
}
</pre>
<li>Why close read-end and write-end? multiple reasons: maintain that
every process starts with 3 file descriptors and reading from an empty
pipe blocks reader, while reading from a closed pipe returns end of
file.
<li>How do you background jobs?
<pre>
$ compute &
</pre>
<li>How does the shell implement "&", backgrounding? (Don't call wait
immediately).
<li>More details in the shell lecture later in the term.
</body>
<title>High-performance File Systems</title>
<html>
<head>
</head>
<body>
<h1>High-performance File Systems</h1>
<p>Required reading: soft updates.
<h2>Overview</h2>
<p>A key problem in designing file systems is how to obtain
performance on file system operations while providing consistency.
With consistency, we mean, that file system invariants are maintained
is on disk. These invariants include that if a file is created, it
appears in its directory, etc. If the file system data structures are
consistent, then it is possible to rebuild the file system to a
correct state after a failure.
<p>To ensure consistency of on-disk file system data structures,
modifications to the file system must respect certain rules:
<ul>
<li>Never point to a structure before it is initialized. An inode must
be initialized before a directory entry references it. An block must
be initialized before an inode references it.
<li>Never reuse a structure before nullifying all pointers to it. An
inode pointer to a disk block must be reset before the file system can
reallocate the disk block.
<li>Never reset the last point to a live structure before a new
pointer is set. When renaming a file, the file system should not
remove the old name for an inode until after the new name has been
written.
</ul>
The paper calls these dependencies update dependencies.
<p>xv6 ensures these rules by writing every block synchronously, and
by ordering the writes appropriately. With synchronous, we mean
that a process waits until the current disk write has been
completed before continuing with execution.
<ul>
<li>What happens if power fails after 4776 in mknod1? Did we lose the
inode for ever? No, we have a separate program (called fsck), which
can rebuild the disk structures correctly and can mark the inode on
the free list.
<li>Does the order of writes in mknod1 matter? Say, what if we wrote
directory entry first and then wrote the allocated inode to disk?
This violates the update rules and it is not a good plan. If a
failure happens after the directory write, then on recovery we have
an directory pointing to an unallocated inode, which now may be
allocated by another process for another file!
<li>Can we turn the writes (i.e., the ones invoked by iupdate and
wdir) into delayed writes without creating problems? No, because
the cause might write them back to the disk in an incorrect order.
It has no information to decide in what order to write them.
</ul>
<p>xv6 is a nice example of the tension between consistency and
performance. To get consistency, xv6 uses synchronous writes,
but these writes are slow, because they perform at the rate of a
seek instead of the rate of the maximum data transfer rate. The
bandwidth to a disk is reasonable high for large transfer (around
50Mbyte/s), but latency is low, because of the cost of moving the
disk arm(s) (the seek latency is about 10msec).
<p>This tension is an implementation-dependent one. The Unix API
doesn't require that writes are synchronous. Updates don't have to
appear on disk until a sync, fsync, or open with O_SYNC. Thus, in
principle, the UNIX API allows delayed writes, which are good for
performance:
<ul>
<li>Batch many writes together in a big one, written at the disk data
rate.
<li>Absorp writes to the same block.
<li>Schedule writes to avoid seeks.
</ul>
<p>Thus the question: how to delay writes and achieve consistency?
The paper provides an answer.
<h2>This paper</h2>
<p>The paper surveys some of the existing techniques and introduces a
new to achieve the goal of performance and consistency.
<p>
<p>Techniques possible:
<ul>
<li>Equip system with NVRAM, and put buffer cache in NVRAM.
<li>Logging. Often used in UNIX file systems for metadata updates.
LFS is an extreme version of this strategy.
<li>Flusher-enforced ordering. All writes are delayed. This flusher
is aware of dependencies between blocks, but doesn't work because
circular dependencies need to be broken by writing blocks out.
</ul>
<p>Soft updates is the solution explored in this paper. It doesn't
require NVRAM, and performs as well as the naive strategy of keep all
dirty block in main memory. Compared to logging, it is unclear if
soft updates is better. The default BSD file systems uses soft
updates, but most Linux file systems use logging.
<p>Soft updates is a sophisticated variant of flusher-enforced
ordering. Instead of maintaining dependencies on the block-level, it
maintains dependencies on file structure level (per inode, per
directory, etc.), reducing circular dependencies. Furthermore, it
breaks any remaining circular dependencies by undo changes before
writing the block and then redoing them to the block after writing.
<p>Pseudocode for create:
<pre>
create (f) {
allocate inode in block i (assuming inode is available)
add i to directory data block d (assuming d has space)
mark d has dependent on i, and create undo/redo record
update directory inode in block di
mark di has dependent on d
}
</pre>
<p>Pseudocode for the flusher:
<pre>
flushblock (b)
{
lock b;
for all dependencies that b is relying on
"remove" that dependency by undoing the change to b
mark the dependency as "unrolled"
write b
}
write_completed (b) {
remove dependencies that depend on b
reapply "unrolled" dependencies that b depended on
unlock b
}
</pre>
<p>Apply flush algorithm to example:
<ul>
<li>A list of two dependencies: directory->inode, inode->directory.
<li>Lets say syncer picks directory first
<li>Undo directory->inode changes (i.e., unroll <A,#4>)
<li>Write directory block
<li>Remove met dependencies (i.e., remove inode->directory dependency)
<li>Perform redo operation (i.e., redo <A,#4>)
<li>Select inode block and write it
<li>Remove met dependencies (i.e., remove directory->inode dependency)
<li>Select directory block (it is dirty again!)
<li>Write it.
</ul>
<p>An file operation that is important for file-system consistency
is rename. Rename conceptually works as follows:
<pre>
rename (from, to)
unlink (to);
link (from, to);
unlink (from);
</pre>
<p>Rename it often used by programs to make a new version of a file
the current version. Committing to a new version must happen
atomically. Unfortunately, without a transaction-like support
atomicity is impossible to guarantee, so a typical file systems
provides weaker semantics for rename: if to already exists, an
instance of to will always exist, even if the system should crash in
the middle of the operation. Does the above implementation of rename
guarantee this semantics? (Answer: no).
<p>If rename is implemented as unlink, link, unlink, then it is
difficult to guarantee even the weak semantics. Modern UNIXes provide
rename as a file system call:
<pre>
update dir block for to point to from's inode // write block
update dir block for from to free entry // write block
</pre>
<p>fsck may need to correct refcounts in the inode if the file
system fails during rename. for example, a crash after the first
write followed by fsck should set refcount to 2, since both from
and to are pointing at the inode.
<p>This semantics is sufficient, however, for an application to ensure
atomicity. Before the call, there is a from and perhaps a to. If the
call is successful, following the call there is only a to. If there
is a crash, there may be both a from and a to, in which case the
caller knows the previous attempt failed, and must retry. The
subtlety is that if you now follow the two links, the "to" name may
link to either the old file or the new file. If it links to the new
file, that means that there was a crash and you just detected that the
rename operation was composite. On the other hand, the retry
procedure can be the same for either case (do the rename again), so it
isn't necessary to discover how it failed. The function follows the
golden rule of recoverability, and it is idempotent, so it lays all
the needed groundwork for use as part of a true atomic action.
<p>With soft updates renames becomes:
<pre>
rename (from, to) {
i = namei(from);
add "to" directory data block td a reference to inode i
mark td dependent on block i
update directory inode "to" tdi
mark tdi as dependent on td
remove "from" directory data block fd a reference to inode i
mark fd as dependent on tdi
update directory inode in block fdi
mark fdi as dependent on fd
}
</pre>
<p>No synchronous writes!
<p>What needs to be done on recovery? (Inspect every statement in
rename and see what inconsistencies could exist on the disk; e.g.,
refcnt inode could be too high.) None of these inconsitencies require
fixing before the file system can operate; they can be fixed by a
background file system repairer.
<h2>Paper discussion</h2>
<p>Do soft updates perform any useless writes? (A useless write is a
write that will be immediately overwritten.) (Answer: yes.) Fix
syncer to becareful with what block to start. Fix cache replacement
to selecting LRU block with no pendending dependencies.
<p>Can a log-structured file system implement rename better? (Answer:
yes, since it can get the refcnts right).
<p>Discuss all graphs.
</body>
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
<title>Lecture 5/title>
<html>
<head>
</head>
<body>
<h2>Address translation and sharing using page tables</h2>
<p> Reading: <a href="../readings/i386/toc.htm">80386</a> chapters 5 and 6<br>
<p> Handout: <b> x86 address translation diagram</b> -
<a href="x86_translation.ps">PS</a> -
<a href="x86_translation.eps">EPS</a> -
<a href="x86_translation.fig">xfig</a>
<br>
<p>Why do we care about x86 address translation?
<ul>
<li>It can simplify s/w structure by placing data at fixed known addresses.
<li>It can implement tricks like demand paging and copy-on-write.
<li>It can isolate programs to contain bugs.
<li>It can isolate programs to increase security.
<li>JOS uses paging a lot, and segments more than you might think.
</ul>
<p>Why aren't protected-mode segments enough?
<ul>
<li>Why did the 386 add translation using page tables as well?
<li>Isn't it enough to give each process its own segments?
</ul>
<p>Translation using page tables on x86:
<ul>
<li>paging hardware maps linear address (la) to physical address (pa)
<li>(we will often interchange "linear" and "virtual")
<li>page size is 4096 bytes, so there are 1,048,576 pages in 2^32
<li>why not just have a big array with each page #'s translation?
<ul>
<li>table[20-bit linear page #] => 20-bit phys page #
</ul>
<li>386 uses 2-level mapping structure
<li>one page directory page, with 1024 page directory entries (PDEs)
<li>up to 1024 page table pages, each with 1024 page table entries (PTEs)
<li>so la has 10 bits of directory index, 10 bits table index, 12 bits offset
<li>What's in a PDE or PTE?
<ul>
<li>20-bit phys page number, present, read/write, user/supervisor
</ul>
<li>cr3 register holds physical address of current page directory
<li>puzzle: what do PDE read/write and user/supervisor flags mean?
<li>puzzle: can supervisor read/write user pages?
<li>Here's how the MMU translates an la to a pa:
<pre>
uint
translate (uint la, bool user, bool write)
{
uint pde;
pde = read_mem (%CR3 + 4*(la >> 22));
access (pde, user, read);
pte = read_mem ( (pde & 0xfffff000) + 4*((la >> 12) & 0x3ff));
access (pte, user, read);
return (pte & 0xfffff000) + (la & 0xfff);
}
// check protection. pxe is a pte or pde.
// user is true if CPL==3
void
access (uint pxe, bool user, bool write)
{
if (!(pxe & PG_P)
=> page fault -- page not present
if (!(pxe & PG_U) && user)
=> page fault -- not access for user
if (write && !(pxe & PG_W))
if (user)
=> page fault -- not writable
else if (!(pxe & PG_U))
=> page fault -- not writable
else if (%CR0 & CR0_WP)
=> page fault -- not writable
}
</pre>
<li>CPU's TLB caches vpn => ppn mappings
<li>if you change a PDE or PTE, you must flush the TLB!
<ul>
<li>by re-loading cr3
</ul>
<li>turn on paging by setting CR0_PE bit of %cr0
</ul>
Can we use paging to limit what memory an app can read/write?
<ul>
<li>user can't modify cr3 (requires privilege)
<li>is that enough?
<li>could user modify page tables? after all, they are in memory.
</ul>
<p>How we will use paging (and segments) in JOS:
<ul>
<li>use segments only to switch privilege level into/out of kernel
<li>use paging to structure process address space
<li>use paging to limit process memory access to its own address space
<li>below is the JOS virtual memory map
<li>why map both kernel and current process? why not 4GB for each?
<li>why is the kernel at the top?
<li>why map all of phys mem at the top? i.e. why multiple mappings?
<li>why map page table a second time at VPT?
<li>why map page table a third time at UVPT?
<li>how do we switch mappings for a different process?
</ul>
<pre>
4 Gig --------> +------------------------------+
| | RW/--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
: . :
: . :
: . :
|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| RW/--
| | RW/--
| Remapped Physical Memory | RW/--
| | RW/--
KERNBASE -----> +------------------------------+ 0xf0000000
| Cur. Page Table (Kern. RW) | RW/-- PTSIZE
VPT,KSTACKTOP--> +------------------------------+ 0xefc00000 --+
| Kernel Stack | RW/-- KSTKSIZE |
| - - - - - - - - - - - - - - -| PTSIZE
| Invalid Memory | --/-- |
ULIM ------> +------------------------------+ 0xef800000 --+
| Cur. Page Table (User R-) | R-/R- PTSIZE
UVPT ----> +------------------------------+ 0xef400000
| RO PAGES | R-/R- PTSIZE
UPAGES ----> +------------------------------+ 0xef000000
| RO ENVS | R-/R- PTSIZE
UTOP,UENVS ------> +------------------------------+ 0xeec00000
UXSTACKTOP -/ | User Exception Stack | RW/RW PGSIZE
+------------------------------+ 0xeebff000
| Empty Memory | --/-- PGSIZE
USTACKTOP ---> +------------------------------+ 0xeebfe000
| Normal User Stack | RW/RW PGSIZE
+------------------------------+ 0xeebfd000
| |
| |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
. .
. .
. .
|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
| Program Data & Heap |
UTEXT --------> +------------------------------+ 0x00800000
PFTEMP -------> | Empty Memory | PTSIZE
| |
UTEMP --------> +------------------------------+ 0x00400000
| Empty Memory | PTSIZE
0 ------------> +------------------------------+
</pre>
<h3>The VPT </h3>
<p>Remember how the X86 translates virtual addresses into physical ones:
<p><img src=pagetables.png>
<p>CR3 points at the page directory. The PDX part of the address
indexes into the page directory to give you a page table. The
PTX part indexes into the page table to give you a page, and then
you add the low bits in.
<p>But the processor has no concept of page directories, page tables,
and pages being anything other than plain memory. So there's nothing
that says a particular page in memory can't serve as two or three of
these at once. The processor just follows pointers:
pd = lcr3();
pt = *(pd+4*PDX);
page = *(pt+4*PTX);
<p>Diagramatically, it starts at CR3, follows three arrows, and then stops.
<p>If we put a pointer into the page directory that points back to itself at
index Z, as in
<p><img src=vpt.png>
<p>then when we try to translate a virtual address with PDX and PTX
equal to V, following three arrows leaves us at the page directory.
So that virtual page translates to the page holding the page directory.
In Jos, V is 0x3BD, so the virtual address of the VPD is
(0x3BD&lt;&lt;22)|(0x3BD&lt;&lt;12).
<p>Now, if we try to translate a virtual address with PDX = V but an
arbitrary PTX != V, then following three arrows from CR3 ends
one level up from usual (instead of two as in the last case),
which is to say in the page tables. So the set of virtual pages
with PDX=V form a 4MB region whose page contents, as far
as the processor is concerned, are the page tables themselves.
In Jos, V is 0x3BD so the virtual address of the VPT is (0x3BD&lt;&lt;22).
<p>So because of the "no-op" arrow we've cleverly inserted into
the page directory, we've mapped the pages being used as
the page directory and page table (which are normally virtually
invisible) into the virtual address space.
</body>
#!/usr/bin/perl
my @lines = <>;
my $text = join('', @lines);
my $title;
if($text =~ /^\*\* (.*?)\n/m){
$title = $1;
$text = $` . $';
}else{
$title = "Untitled";
}
$text =~ s/[ \t]+$//mg;
$text =~ s/^$/<br><br>/mg;
$text =~ s!\b([a-z0-9]+\.(c|s|pl|h))\b!<a href="src/$1.html">$1</a>!g;
$text =~ s!^(Lecture [0-9]+\. .*?)$!<b><i>$1</i></b>!mg;
$text =~ s!^\* (.*?)$!<h2>$1</h2>!mg;
$text =~ s!((<br>)+\n)+<h2>!\n<h2>!g;
$text =~ s!</h2>\n?((<br>)+\n)+!</h2>\n!g;
$text =~ s!((<br>)+\n)+<b>!\n<br><br><b>!g;
$text =~ s!\b\s*--\s*\b!\&ndash;!g;
$text =~ s!\[([^\[\]|]+) \| ([^\[\]]+)\]!<a href="$1">$2</a>!g;
$text =~ s!\[([^ \t]+)\]!<a href="$1">$1</a>!g;
$text =~ s!``!\&ldquo;!g;
$text =~ s!''!\&rdquo;!g;
print <<EOF;
<!-- AUTOMATICALLY GENERATED: EDIT the .txt version, not the .html version -->
<html>
<head>
<title>$title</title>
<style type="text/css"><!--
body {
background-color: white;
color: black;
font-size: medium;
line-height: 1.2em;
margin-left: 0.5in;
margin-right: 0.5in;
margin-top: 0;
margin-bottom: 0;
}
h1 {
text-indent: 0in;
text-align: left;
margin-top: 2em;
font-weight: bold;
font-size: 1.4em;
}
h2 {
text-indent: 0in;
text-align: left;
margin-top: 2em;
font-weight: bold;
font-size: 1.2em;
}
--></style>
</head>
<body bgcolor=#ffffff>
<h1>$title</h1>
<br><br>
EOF
print $text;
print <<EOF;
</body>
</html>
EOF
<title>Homework: xv6 and Interrupts and Exceptions</title>
<html>
<head>
</head>
<body>
<h1>Homework: xv6 and Interrupts and Exceptions</h1>
<p>
<b>Read</b>: xv6's trapasm.S, trap.c, syscall.c, vectors.S, and usys.S. Skim
lapic.c, ioapic.c, and picirq.c
<p>
<b>Hand-In Procedure</b>
<p>
You are to turn in this homework during lecture. Please
write up your answers to the exercises below and hand them in to a
6.828 staff member at the beginning of the lecture.
<p>
<b>Introduction</b>
<p>Try to understand
xv6's trapasm.S, trap.c, syscall.c, vectors.S, and usys.S. Skim
You will need to consult:
<p>Chapter 5 of <a href="../readings/ia32/IA32-3.pdf">IA-32 Intel
Architecture Software Developer's Manual, Volume 3: System programming
guide</a>; you can skip sections 5.7.1, 5.8.2, and 5.12.2. Be aware
that terms such as exceptions, traps, interrupts, faults and aborts
have no standard meaning.
<p>Chapter 9 of the 1987 <a href="../readings/i386/toc.htm">i386
Programmer's Reference Manual</a> also covers exception and interrupt
handling in IA32 processors.
<p><b>Assignment</b>:
In xv6, set a breakpoint at the beginning of <code>syscall()</code> to
catch the very first system call. What values are on the stack at
this point? Turn in the output of <code>print-stack 35</code> at that
breakpoint with each value labeled as to what it is (e.g.,
saved <code>%ebp</code> for <code>trap</code>,
<code>trapframe.eip</code>, etc.).
<p>
<b>This completes the homework.</b>
</body>
<title>Homework: Intro to x86 and PC</title>
<html>
<head>
</head>
<body>
<h1>Homework: Intro to x86 and PC</h1>
<p>Today's lecture is an introduction to the x86 and the PC, the
platform for which you will write an operating system. The assigned
book is a reference for x86 assembly programming of which you will do
some.
<p><b>Assignment</b> Make sure to do exercise 1 of lab 1 before
coming to lecture.
</body>
<title>Homework: x86 MMU</title>
<html>
<head>
</head>
<body>
<h1>Homework: x86 MMU</h1>
<p>Read chapters 5 and 6 of
<a href="../readings/i386/toc.htm">Intel 80386 Reference Manual</a>.
These chapters explain
the x86 Memory Management Unit (MMU),
which we will cover in lecture today and which you need
to understand in order to do lab 2.
<p>
<b>Read</b>: bootasm.S and setupsegs() in proc.c
<p>
<b>Hand-In Procedure</b>
<p>
You are to turn in this homework during lecture. Please
write up your answers to the exercises below and hand them in to a
6.828 staff member by the beginning of lecture.
<p>
<p><b>Assignment</b>: Try to understand setupsegs() in proc.c.
What values are written into <code>gdt[SEG_UCODE]</code>
and <code>gdt[SEG_UDATA]</code> for init, the first user-space
process?
(You can use Bochs to answer this question.)
</body>
差异被折叠。
<html>
<head>
<title>Homework: Files and Disk I/O</title>
</head>
<body>
<h1>Homework: Files and Disk I/O</h1>
<p>
<b>Read</b>: bio.c, fd.c, fs.c, and ide.c
<p>
This homework should be turned in at the beginning of lecture.
<p>
<b>File and Disk I/O</b>
<p>Insert a print statement in bwrite so that you get a
print every time a block is written to disk:
<pre>
cprintf("bwrite sector %d\n", sector);
</pre>
<p>Build and boot a new kernel and run these three commands at the shell:
<pre>
echo &gt;a
echo &gt;a
rm a
mkdir d
</pre>
(You can try <tt>rm d</tt> if you are curious, but it should look
almost identical to <tt>rm a</tt>.)
<p>You should see a sequence of bwrite prints after running each command.
Record the list and annotate it with the calling function and
what block is being written.
For example, this is the <i>second</i> <tt>echo &gt;a</tt>:
<pre>
$ echo >a
bwrite sector 121 # writei (data block)
bwrite sector 3 # iupdate (inode block)
$
</pre>
<p>Hint: the easiest way to get the name of the
calling function is to add a string argument to bwrite,
edit all the calls to bwrite to pass the name of the
calling function, and just print it.
You should be able to reason about what kind of
block is being written just from the calling function.
<p>You need not write the following up, but try to
understand why each write is happening. This will
help your understanding of the file system layout
and the code.
<p>
<b>This completes the homework.</b>
</body>
<title>Homework: intro to xv6</title>
<html>
<head>
</head>
<body>
<h1>Homework: intro to xv6</h1>
<p>This lecture is the introduction to xv6, our re-implementation of
Unix v6. Read the source code in the assigned files. You won't have
to understand the details yet; we will focus on how the first
user-level process comes into existence after the computer is turned
on.
<p>
<b>Hand-In Procedure</b>
<p>
You are to turn in this homework during lecture. Please
write up your answers to the exercises below and hand them in to a
6.828 staff member at the beginning of lecture.
<p>
<p><b>Assignment</b>:
<br>
Fetch and un-tar the xv6 source:
<pre>
sh-3.00$ wget http://pdos.csail.mit.edu/6.828/2007/src/xv6-rev1.tar.gz
sh-3.00$ tar xzvf xv6-rev1.tar.gz
xv6/
xv6/asm.h
xv6/bio.c
xv6/bootasm.S
xv6/bootmain.c
...
$
</pre>
Build xv6:
<pre>
$ cd xv6
$ make
gcc -O -nostdinc -I. -c bootmain.c
gcc -nostdinc -I. -c bootasm.S
ld -N -e start -Ttext 0x7C00 -o bootblock.o bootasm.o bootmain.o
objdump -S bootblock.o > bootblock.asm
objcopy -S -O binary bootblock.o bootblock
...
$
</pre>
Find the address of the <code>main</code> function by
looking in <code>kernel.asm</code>:
<pre>
% grep main kernel.asm
...
00102454 &lt;mpmain&gt;:
mpmain(void)
001024d0 &lt;main&gt;:
10250d: 79 f1 jns 102500 &lt;main+0x30&gt;
1025f3: 76 6f jbe 102664 &lt;main+0x194&gt;
102611: 74 2f je 102642 &lt;main+0x172&gt;
</pre>
In this case, the address is <code>001024d0</code>.
<p>
Run the kernel inside Bochs, setting a breakpoint
at the beginning of <code>main</code> (i.e., the address
you just found).
<pre>
$ make bochs
if [ ! -e .bochsrc ]; then ln -s dot-bochsrc .bochsrc; fi
bochs -q
========================================================================
Bochs x86 Emulator 2.2.6
(6.828 distribution release 1)
========================================================================
00000000000i[ ] reading configuration from .bochsrc
00000000000i[ ] installing x module as the Bochs GUI
00000000000i[ ] Warning: no rc file specified.
00000000000i[ ] using log file bochsout.txt
Next at t=0
(0) [0xfffffff0] f000:fff0 (unk. ctxt): jmp far f000:e05b ; ea5be000f0
(1) [0xfffffff0] f000:fff0 (unk. ctxt): jmp far f000:e05b ; ea5be000f0
&lt;bochs&gt;
</pre>
Look at the registers and the stack contents:
<pre>
&lt;bochs&gt; info reg
...
&lt;bochs&gt; print-stack
...
&lt;bochs&gt;
</pre>
Which part of the stack printout is actually the stack?
(Hint: not all of it.) Identify all the non-zero values
on the stack.<p>
<b>Turn in:</b> the output of print-stack with
the valid part of the stack marked. Write a short (3-5 word)
comment next to each non-zero value explaining what it is.
<p>
Now look at kernel.asm for the instructions in main that read:
<pre>
10251e: 8b 15 00 78 10 00 mov 0x107800,%edx
102524: 8d 04 92 lea (%edx,%edx,4),%eax
102527: 8d 04 42 lea (%edx,%eax,2),%eax
10252a: c1 e0 04 shl $0x4,%eax
10252d: 01 d0 add %edx,%eax
10252f: 8d 04 85 1c ad 10 00 lea 0x10ad1c(,%eax,4),%eax
102536: 89 c4 mov %eax,%esp
</pre>
(The addresses and constants might be different on your system,
and the compiler might use <code>imul</code> instead of the <code>lea,lea,shl,add,lea</code> sequence.
Look for the move into <code>%esp</code>).
<p>
Which lines in <code>main.c</code> do these instructions correspond to?
<p>
Set a breakpoint at the first of those instructions
and let the program run until the breakpoint:
<pre>
&lt;bochs&gt; vb 0x8:0x10251e
&lt;bochs&gt; s
...
&lt;bochs&gt; c
(0) Breakpoint 2, 0x0010251e (0x0008:0x0010251e)
Next at t=1157430
(0) [0x0010251e] 0008:0x0010251e (unk. ctxt): mov edx, dword ptr ds:0x107800 ; 8b1500781000
(1) [0xfffffff0] f000:fff0 (unk. ctxt): jmp far f000:e05b ; ea5be000f0
&lt;bochs&gt;
</pre>
(The first <code>s</code> command is necessary
to single-step past the breakpoint at main, otherwise <code>c</code>
will not make any progress.)
<p>
Inspect the registers and stack again
(<code>info reg</code> and <code>print-stack</code>).
Then step past those seven instructions
(<code>s 7</code>)
and inspect them again.
Convince yourself that the stack has changed correctly.
<p>
<b>Turn in:</b> answers to the following questions.
Look at the assembly for the call to
<code>lapic_init</code> that occurs after the
the stack switch. Where does the
<code>bcpu</code> argument come from?
What would have happened if <code>main</code>
stored <code>bcpu</code>
on the stack before those four assembly instructions?
Would the code still work? Why or why not?
<p>
</body>
</html>
<title>Homework: Locking</title>
<html>
<head>
</head>
<body>
<h1>Homework: Locking</h1>
<p>
<b>Read</b>: spinlock.c
<p>
<b>Hand-In Procedure</b>
<p>
You are to turn in this homework at the beginning of lecture. Please
write up your answers to the exercises below and hand them in to a
6.828 staff member at the beginning of lecture.
<p>
<b>Assignment</b>:
In this assignment we will explore some of the interaction
between interrupts and locking.
<p>
Make sure you understand what would happen if the kernel executed
the following code snippet:
<pre>
struct spinlock lk;
initlock(&amp;lk, "test lock");
acquire(&amp;lk);
acquire(&amp;lk);
</pre>
(Feel free to use Bochs to find out. <code>acquire</code> is in <code>spinlock.c</code>.)
<p>
An <code>acquire</code> ensures interrupts are off
on the local processor using <code>cli</code>,
and interrupts remain off until the <code>release</code>
of the last lock held by that processor
(at which point they are enabled using <code>sti</code>).
<p>
Let's see what happens if we turn on interrupts while
holding the <code>ide</code> lock.
In <code>ide_rw</code> in <code>ide.c</code>, add a call
to <code>sti()</code> after the <code>acquire()</code>.
Rebuild the kernel and boot it in Bochs.
Chances are the kernel will panic soon after boot; try booting Bochs a few times
if it doesn't.
<p>
<b>Turn in</b>: explain in a few sentences why the kernel panicked.
You may find it useful to look up the stack trace
(the sequence of <code>%eip</code> values printed by <code>panic</code>)
in the <code>kernel.asm</code> listing.
<p>
Remove the <code>sti()</code> you added,
rebuild the kernel, and make sure it works again.
<p>
Now let's see what happens if we turn on interrupts
while holding the <code>kalloc_lock</code>.
In <code>kalloc()</code> in <code>kalloc.c</code>, add
a call to <code>sti()</code> after the call to <code>acquire()</code>.
You will also need to add
<code>#include "x86.h"</code> at the top of the file after
the other <code>#include</code> lines.
Rebuild the kernel and boot it in Bochs.
It will not panic.
<p>
<b>Turn in</b>: explain in a few sentences why the kernel didn't panic.
What is different about <code>kalloc_lock</code>
as compared to <code>ide_lock</code>?
<p>
You do not need to understand anything about the details of the IDE hardware
to answer this question, but you may find it helpful to look
at which functions acquire each lock, and then at when those
functions get called.
<p>
(There is a very small but non-zero chance that the kernel will panic
with the extra <code>sti()</code> in <code>kalloc</code>.
If the kernel <i>does</i> panic, make doubly sure that
you removed the <code>sti()</code> call from
<code>ide_rw</code>. If it continues to panic and the
only extra <code>sti()</code> is in <code>bio.c</code>,
then mail <i>6.828-staff&#64;pdos.csail.mit.edu</i>
and think about buying a lottery ticket.)
<p>
<b>Turn in</b>: Why does <code>release()</code> clear
<code>lock-&gt;pcs[0]</code> and <code>lock-&gt;cpu</code>
<i>before</i> clearing <code>lock-&gt;locked</code>?
Why not wait until after?
</body>
</html>
<html>
<head>
<title>Homework: Naming</title>
</head>
<body>
<h1>Homework: Naming</h1>
<p>
<b>Read</b>: namei in fs.c, fd.c, sysfile.c
<p>
This homework should be turned in at the beginning of lecture.
<p>
<b>Symbolic Links</b>
<p>
As you read namei and explore its varied uses throughout xv6,
think about what steps would be required to add symbolic links
to xv6.
A symbolic link is simply a file with a special type (e.g., T_SYMLINK
instead of T_FILE or T_DIR) whose contents contain the path being
linked to.
<p>
Turn in a short writeup of how you would change xv6 to support
symlinks. List the functions that would have to be added or changed,
with short descriptions of the new functionality or changes.
<p>
<b>This completes the homework.</b>
<p>
The following is <i>not required</i>. If you want to try implementing
symbolic links in xv6, here are the files that the course staff
had to change to implement them:
<pre>
fs.c: 20 lines added, 4 modified
syscall.c: 2 lines added
syscall.h: 1 line added
sysfile.c: 15 lines added
user.h: 1 line added
usys.S: 1 line added
</pre>
Also, here is an <i>ln</i> program:
<pre>
#include "types.h"
#include "user.h"
int
main(int argc, char *argv[])
{
int (*ln)(char*, char*);
ln = link;
if(argc &gt; 1 &amp;&amp; strcmp(argv[1], "-s") == 0){
ln = symlink;
argc--;
argv++;
}
if(argc != 3){
printf(2, "usage: ln [-s] old new (%d)\n", argc);
exit();
}
if(ln(argv[1], argv[2]) &lt; 0){
printf(2, "%s failed\n", ln == symlink ? "symlink" : "link");
exit();
}
exit();
}
</pre>
</body>
<title>Homework: Threads and Context Switching</title>
<html>
<head>
</head>
<body>
<h1>Homework: Threads and Context Switching</h1>
<p>
<b>Read</b>: swtch.S and proc.c (focus on the code that switches
between processes, specifically <code>scheduler</code> and <code>sched</code>).
<p>
<b>Hand-In Procedure</b>
<p>
You are to turn in this homework during lecture. Please
write up your answers to the exercises below and hand them in to a
6.828 staff member at the beginning of lecture.
<p>
<b>Introduction</b>
<p>
In this homework you will investigate how the kernel switches between
two processes.
<p>
<b>Assignment</b>:
<p>
Suppose a process that is running in the kernel
calls <code>sched()</code>, which ends up jumping
into <code>scheduler()</code>.
<p>
<b>Turn in</b>:
Where is the stack that <code>sched()</code> executes on?
<p>
<b>Turn in</b>:
Where is the stack that <code>scheduler()</code> executes on?
<p>
<b>Turn in:</b>
When <code>sched()</code> calls <code>swtch()</code>,
does that call to <code>swtch()</code> ever return? If so, when?
<p>
<b>Turn in:</b>
Why does <code>swtch()</code> copy %eip from the stack into the
context structure, only to copy it from the context
structure to the same place on the stack
when the process is re-activated?
What would go wrong if <code>swtch()</code> just left the
%eip on the stack and didn't store it in the context structure?
<p>
Surround the call to <code>swtch()</code> in <code>schedule()</code> with calls
to <code>cons_putc()</code> like this:
<pre>
cons_putc('a');
swtch(&cpus[cpu()].context, &p->context);
cons_putc('b');
</pre>
<p>
Similarly,
surround the call to <code>swtch()</code> in <code>sched()</code> with calls
to <code>cons_putc()</code> like this:
<pre>
cons_putc('c');
swtch(&cp->context, &cpus[cpu()].context);
cons_putc('d');
</pre>
<p>
Rebuild your kernel and boot it on bochs.
With a few exceptions
you should see a regular four-character pattern repeated over and over.
<p>
<b>Turn in</b>: What is the four-character pattern?
<p>
<b>Turn in</b>: The very first characters are <code>ac</code>. Why does
this happen?
<p>
<b>Turn in</b>: Near the start of the last line you should see
<code>bc</code>. How could this happen?
<p>
<b>This completes the homework.</b>
</body>
<title>Homework: sleep and wakeup</title>
<html>
<head>
</head>
<body>
<h1>Homework: sleep and wakeup</h1>
<p>
<b>Read</b>: pipe.c
<p>
<b>Hand-In Procedure</b>
<p>
You are to turn in this homework at the beginning of lecture. Please
write up your answers to the questions below and hand them in to a
6.828 staff member at the beginning of lecture.
<p>
<b>Introduction</b>
<p>
Remember in lecture 7 we discussed locking a linked list implementation.
The insert code was:
<pre>
struct list *l;
l = list_alloc();
l->next = list_head;
list_head = l;
</pre>
and if we run the insert on multiple processors simultaneously with no locking,
this ordering of instructions can cause one of the inserts to be lost:
<pre>
CPU1 CPU2
struct list *l;
l = list_alloc();
l->next = list_head;
struct list *l;
l = list_alloc();
l->next = list_head;
list_head = l;
list_head = l;
</pre>
(Even though the instructions can happen simultaneously, we
write out orderings where only one CPU is "executing" at a time,
to avoid complicating things more than necessary.)
<p>
In this case, the list element allocated by CPU2 is lost from
the list by CPU1's update of list_head.
Adding a lock that protects the final two instructions makes
the read and write of list_head atomic, so that this
ordering is impossible.
<p>
The reading for this lecture is the implementation of sleep and wakeup,
which are used for coordination between different processes executing
in the kernel, perhaps simultaneously.
<p>
If there were no locking at all in sleep and wakeup, it would be
possible for a sleep and its corresponding wakeup, if executing
simultaneously on different processors, to miss each other,
so that the wakeup didn't find any process to wake up, and yet the
process calling sleep does go to sleep, never to awake. Obviously this is something
we'd like to avoid.
<p>
Read the code with this in mind.
<p>
<br><br>
<b>Questions</b>
<p>
(Answer and hand in.)
<p>
1. How does the proc_table_lock help avoid this problem? Give an
ordering of instructions (like the above example for linked list
insertion)
that could result in a wakeup being missed if the proc_table_lock were not used.
You need only include the relevant lines of code.
<p>
2. sleep is also protected by a second lock, its second argument,
which need not be the proc_table_lock. Look at the example in ide.c,
which uses the ide_lock. Give an ordering of instructions that could
result in a wakeup being missed if the ide_lock were not being used.
(Hint: this should not be the same as your answer to question 2. The
two locks serve different purposes.)<p>
<br><br>
<b>This completes the homework.</b>
</body>
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论