| abstract |
in this paper, i try to give an overview
of the problems encountered while developing a disk driver for a
multithreaded and multiprogrammed environment. It focuses on the programmed-IO approach and also deals with concurrent access and racing conditions avoidance for the driver's resources |
| intended audience |
i'll assume that you have a good
experience in multi-threaded kernel programming, and that semaphores,
page tables, etc. are your daily breakfast :) |
There are a few things your multitasking kernel will have to handle before you can actually start writing a good multitasked disk driver.
- be able to suspend and resume threads. This will be mandatory so that the system can continue doing useful job while a requestor thread is waiting for I/O operation to complete.
- be able to perform dynamic memory allocation. As we'll need request queues, buffer pages list, etc. it is mandatory that the kernel programming interface provide a efficient way to allocate and free small structures on demand.
- be able to allocate memory that suits I/O needs. As a first step, we'll just use programmed I/O modes for ATA disks, which have no special needs in what memory buffers are used, but floppies (which use ISA DMA) and ultra-dma mode (which uses PCI-DMA) behave differently and will require for instance the guarantee that memory buffers are in the lowest 16MB of physical memory and do not cross a 64KB barrier.
- be able to lock I/O regions for a given driver. For instance, you must ensure that only one software component may attemp to perform I/O operations on the disk controller, to ensure operations coherency. If you're in a monolithic kernel that doesn't support any kind of dynamic loading of kernel-level code, it's just a compile-time problem. If you have microkernel or modular kernel approach, you'll probably need a centralized IO map that tells which IO region is free and who use locked regions. io/ioman/ioman.c does something like this.
- be able to access buffer memory whatever the current address space is. As the IO operation is likely to complete while the CPU is running another process than the requestor process, something will have to be done to modify the requestor process's address space from another process (at kernel level, of course). We'll see later how this can be done.
Rough overview
A hard disk can be seen as a large array of sectors, every sector having the very same byte size (usually 512 bytes for PC-compatible systems). Sectors are identified either geographically (with head-positionning, disk orientation and head selection informations -- this is known as Cylinder-Head-Sector addressing) or logically (the disk takes care of translating O..N-1 sector address into geographic information).
Requests can be done to read up to M sectors in memory or to write memory data in M contiguous sectors. The notion of "contiguous" depends on the disk, but usually, it is designed so that latency to access the next sector is minimized, which is:
To achieve good performance, it is critical that related data are kept on close cylinders -- or better, on the same cylinder. The more the heads move, the lower your throughput is. FAT system that have fragmented files and concentrate every disk management structures at the head of the disk are one of the worst possible organisation, from a performance point of view...
- the next sector is on the same "track", i.e. it just uses the same head position and selection - but the disk must rotate so that it can be accessed. As the disks are usually continuously rotating, this make the next sector readable almost immediately after the current one.
- the next sector is on the next "track", but in the same "cylinder", i.e. another head must be selected, but the heads do not need to move. Because the hardware usually uses only one head at a time, it may need some extra time for the head to synchronize bit starts and ends (see docs on bits encoding for more explanation), but it can roughly be assumed that the access time is close to case 1.
- the next sector is on the next cylinder, i.e. the heads have to move to another physical position to perform the read/write operation. Heads displacement are usually the slowest part of disk access, because it involves mechanical moves, while almost everything was electronic only, so far.
Programming a disk access - the PIO approach
In Programmed I/O mode, the following things need to be done:
Once the command has been issued, the disk will move heads to the given address (if needed) and starts reading the data in its internal buffer. As soon as one sector has been completely read, an interrupt is sent to the processor, which will:
- make sure the controller is ready to receive a new request.
- select disk on which we operate (if there are several disks for the same controller -- which is the case for ATA)
- write the sector address and the amount of sectors to be transferred in the controller's registers.
- issue the command (read or write, with the proper mode selection)
For ATA disks in legacy PIO mode, having read the last data word means that you're ready to receive the next interrupt (as soon as the disk is ready too). What is important to note is that the disk's internal buffer allow us to decouple CPU and disk logic -- i.e. if for some reason, reading the data from the controller is slower than reading the data from the disk into the controller, we just don't have to worry.
- check the status of the controller, making sure the request has been successful (for instance, if we ask for reading a sector that is not a valid address for this disk, an error code will be present).
- read datas out of the controller's buffer and writing them in their final location.
Writing requests are handled a very similar manner, except that you must fill the controller buffer with the data for one sector before the interrupt is raised, but after making sure the command has been accepted by the disk (i.e. the controller status returns to "not BUSY").
Basically, we have two options when handling buffers for disk I/O: either the buffer memory belongs to the driver or it belongs to the requestor. The former approach requires a copy of the data from the buffer to the final target place, while the latter one allow us to use the final target (or source) place immediately.
If the device allow us to do it (i.e. if this is not a ISA-DMA device), we will chose the latter approach for it gives better performances.
However, this implies that the interrupt (which may appear at any time, thus in any process) will have to access the requestor address space from the interrupt-handler space. Dealing with this kind of situation means that we'll need a memory-manager service that allow its client to say "please map the following page frames on some virtual addresses of your choice", while the usual approach is "please fill the given virtual addresses with page frames of your choice".
The problem is solved by a two-step model: the buffer provided by the application is first lockedand a page list is build from it, then the page list (or more likely parts of it) is mapped in the handler's address space, as depicted on the following picture.
so far, i have nothing to offer for people that want to lock the same pages several times... any attempt to do so in Clicker will result in a "locked pages" error returned by the disk driver.
The root component is the ataController, which is made of a list of low-level function abstracting the real ATA programming work, and an interrupt handler. The actual interrupt handler is made as generic as possible, delegating the interrupt processing to the current ataRequest. The current.handle() function's result tells the controller whether the request is still going on or if it has completed, in which case, the next pending request is issue()'d.
By this choice of leaving the ataRequest responsible of programming the controller (optionnally using the helper functions provided by the controller), we will allow ourself to add support of new access mode in a very modular way: all we'll have to do is replacing the component that submit()s requests at the queue scheduler, which can be done while the system is running.
Note that despites there's a single ioDiskList, there may be several physical disks and thus different scheduling policies for those disks (i.e. up to one queueScheduler per disk). The role of the queue scheduler is to set the requests order for a single disk - for instance to improve throughput, etc. A queue scheduler that is willing to operate on the ioDiskList must steal() the list first, leaving an empty list to the controller, and restore() it once its ordering job is complete. While it may sound weirds, this allows less interrupt blocking. If the controller finds no pending request, it will just stand by, and the scheduler will issue the head request when restoring the list.
On top of this design, you can add the legacy "block device" interface (with read/write calls) which simply creates the appropriate ataRequest implementation and submit them.
Low-level functions: the ataCore component.
I read once in a book a theory about bottom-up programming approach: "if your machine is not smart enough for solving your problem, define a virtual machine atop of it that will better fit your problem." This was exactly the kind of approach Tim Robinson used in his ATA module, so i reused it (enjoy GPL :)
Those basic commands could be done by just looking at the docs and do what is said, but having them for the start was really helpful.
kStatus ataSetup(ioAtaController *ctrl,
union ataAddr addr, byte cmd,
byte count)
{
enter(DBIO,"ataCtrl::setup");
if (ctrl->needReset) ataReset(ctrl);
if (!ataWaitFor(ctrl, STATUS_BSY,0))
errlog(KLOG_ERROR,KERR_HARDWARE,
"controller not ready");
sign("sending head, drive & lba bit");
outb(addr.chs.ldh,ctrl->base+REG_LDH);
if (!ataWaitFor(ctrl, STATUS_BSY,0))
errlog(KLOG_ERROR,KERR_HARDWARE,
"drive not ready");
if (count) {
kSysLock(&ctrl->lock);
outb(CTL_EIGHTHEADS, ctrl->base+REG_CTL); // ??
outb(0, ctrl->base + REG_PRECOMP);
outb(count, ctrl->base + REG_COUNT);
outb(addr.chs.sector, ctrl->base + REG_SECTOR);
outb(addr.chs.cyl_lo, ctrl->base + REG_CYL_LO);
outb(addr.chs.cyl_hi, ctrl->base + REG_CYL_HI);
sign("sending command byte");
outb(cmd, ctrl->base + REG_COMMAND);
kSysUnlock(&ctrl->lock);
} else {
outb(cmd, ctrl->base + REG_COMMAND);
}
return intleave(SUCCESS);
}
ataSetup is intended to issue a single command to the controller. both the command, address and sector count are written in the proper I/O registers.
Nothing magic here but interrupts locking around the I/O programming block.
As a generic I/O programming rule, still make sure you'll check status byte everytime needed and report the error if any.
void ataReset(ioAtaController *ctrl)
{
enter(DBIO,"ataCtrl::reset");
outb(CTL_RESET|CTL_INTDISABLE,ctrl->base+REG_CTL);
/** here should come a small delay, 400ns according to Mobius */
_nanosleep(400);
outb(0,ctrl->base+REG_CTL);
leave();
}
Sends the reset command according to the ATA specs. in Mobius, the 400ns delay is ignored (in the sources i have), Clicker handles it with a loop assuming a variable tells how much CPU cycle there are per microsecond. kStatus ataDoWrite(ioAtaController *ctrl, ioPagesList* io, dword from)
{
word *buf;
unsigned int i;
enter(DBIO,"ata::doWrite");
if (!(buf = ioBufferMap(io,from, SECTOR_SIZE)))
errlog(KLOG_ERROR,0,"cannot map memory for diskwrite");
sign("buffer mapped @%x",buf);
for (i=0;i<SECTOR_SIZE/2;i++)
outw(buf[i],ctrl->base + REG_DATA);
ioBufferUnmap(io,buf,SECTOR_SIZE);
return intleave(SUCCESS);
}
sends the datas to the controller after a write has been programmed. you'll notice that we must send words and not bytes or doublewords.
I/O devices are sometimes more this kind of details than memory ...
kStatus ataDoRead(ioAtaController *ctrl, ioPagesList* io, dword from)
{
word *buf;
unsigned int i;
enter(DBIO,"ata::doRead");
if (!(buf=ioBufferMap(io, from, SECTOR_SIZE)))
errlog(KLOG_ERROR,0,"cannot map memory for diskread");
sign("buffer mapped @%x",buf);
for (i=0;i<SECTOR_SIZE/2;i++)
buf[i]=inw(ctrl->base + REG_DATA);
ioBufferUnmap(io,buf, SECTOR_SIZE);
return intleave(SUCCESS);
}
very same job here ... but for reading
kStatus ataWaitFor(ioAtaController *ctrl, byte mask, byte value)
{
timeID notifier;
byte status;
volatile byte *timeout=&ctrl->timeout;
enter(DBIO,"ata::waitfor");
*timeout=0;
if ((inb(ctrl->base+REG_STATUS)& mask)==value)
return intleave(SUCCESS);
sign("programming the time-out (store value @%x): ",timeout);
anykey();
notifier=timing_notify_setAfter(ATA_STATUS_WAITTIME,
&ctrl->timeout);
sign("%i ticks to go for %x",
ATA_STATUS_WAITTIME,notifier);
do {
status=inb(ctrl->base+REG_STATUS);
} while (((status & mask)!=value) && !(*timeout));
if (*timeout) {
char pstatus[9];
_printable_status(status, pstatus);
log(KLOG_WARNING,0,"ATA[%s]: timeout (guru meditation %s)",
ctrl->pname,pstatus);
ctrl->needReset++;
return intleave(FAIL);
} else {
timing_notify_cancel(notifier);
sign("wait_for completed succesfully. status=%i",status);
anykey();
return intleave(SUCCESS);
}
}
here's probably the most complex part of ataCore. basically, this is a busy loop waiting for the controller status to reach a specific value.
If for some reason, the controller never gets ready, we need an escape route, which is performed by the "timeout event".
Before entering the loop, we program the timer so that it will change the timeout byte when the given STATUS_WAITTIME will be over.
an alternative would have been to poll the "uptime" counter and compute the difference before entering the loop and the current uptime, but as Clicker has a timing service, let's use it ;)
as you have notice, all these low-level function operate on the ioAtaController structure, which is the reason why they appear as methods of the "ataController class" on the UML diagram.
Also take care that the above code assumes CHS-LBA translation has been performed before calling the ataSetup function. This is not a problem because Clicker will load the appropriate driver (LBA or not) according to what the disk supports.
Defining a "sector read" request
One of the job of the disk driver will be to submit appropriate requests to the queue scheduler. Each request has a "common" part that make it processable by queuers and controller, and a "private part" that stores variables needed for issueing and handling the request. This occurs exactly as if ataDataRequest was a sub-class that completes the abstract diskRequest class.
The implementation of issueRead and handleRead therefore looks like
As the scheduler must tell whether two request are close from each other or not, the "block identifier" (i.e. the LBA address) is stored in the 'common' part. Note that it is not mandatory that it actually stores the block id. It could only be a "cylinder" number, provided that all the submitters use the same approach.
So far, we have two kind of request: the notice request that will just send an event to the submitter when executed and the datarequest (which can be a read request or a write request depending on the issue & handle functions ... the data structure remains the same). The notice requets can for instance be used for sync calls and it will usually be associated with a 'magic' block identifier (-1) which acts as a barrier at the scheduler: any request submitted after the "-1" block cannot be issued before the "-1" request is completed.
If we take the "request" point of view, things look like this:
- issue() is called a random time after the request has been submitted (i.e. as soon as the controller is ready for it :) It will then program the controller using ataSetup() and return a success code telling the request is now being processed by the disk.
- when the disk has read the first requested sector, it will raise an interrupt, which is hooked by the ataController and leads to current.handle(). the request can then extract the data from the disk cache using ataDoRead() and update its "count" and "offset" fields. it then returns a "continue" code that tells the controller to wait and see.
- when the last sector of the request has been read out by ataDoRead() count has reached 0 and the request sends the completed event to its requestor thread. A "done" code is then returned to the controller which now knows that it can issue the next request (if any).
- if, at any time the request fails, it will send the "failed" event, and tag the controller as "needing a reset" before it tells it to issue the next request.
kStatus _issueRead(void *c, struct ioDiskRequest* r)
{
ioAtaController *ctrl=c;
ioAtaDataRequest *request=(ioAtaDataRequest*)r;
ataDevice *device=request->dev;
union ataAddr addr;
enter(DBIO,"idelba(kdrv)::issueRead");
addr.lba=request->header.block;
addr.chs.ldh&=0x0f;
addr.chs.ldh|=device->nfo->ldhDefault;
sign("sending SCCH=%ac\n",4,&addr);
if (!ataSetup(ctrl, addr, CMD_READ, request->count))
errsign(0,"setup failed");
log(KLOG_LOG, 0, "issued read command to %s",ctrl->pname);
return intleave(SUCCESS);
fail:
ctrl->needReset++;
eSend(request->header.requestor->pnfo,
request->failed, ctrl->status+request->count*256, request->header.block);
return SUCCESS;
}
diskHandleStatus _handleRead(void* c, ioDiskRequest *r, byte stat)
{
ioAtaController *ctrl=c;
ioAtaDataRequest *rq=(ioAtaDataRequest*)r;
enter(DBIO,"IDE::handle_read");
if (stat&STATUS_ERR) {
log(KLOG_LOG,KERR_HARDWARE,"[IDE]: read failed %x, %x",
rq->header.block,rq->count);
eSend(rq->header.requestor->pnfo, rq->failed, stat+rq->count*256,
rq->header.block);
return intleave(IO_DISK_ERROR);
} else {
ataDoRead(ctrl,rq->pages,rq->offset);
rq->offset+=SECTOR_SIZE;
rq->header.block++;
if(--rq->count) return intleave(IO_DISK_CONTINUE);
log(KLOG_DEBUG,0,"[IDE]: read command completed@%x", rq->header.block);
eSend(rq->header.requestor->pnfo, rq->completed, stat+rq->count*256,
rq->header.block);
return intleave(IO_DISK_DONE);
}
}
which should no longer require a lot of explainations. Just notice that we've had to use a strange union to translate LBA block identifiers in bytes for filling the ATA control registers.
Also note that we must patch the "drive and head" register with the drive identifier we got from ataDevice and the "LBA" bit (which should be set for this implementation).
i will not show the "disk::read" API function, as all it does is allocating the requets structure and filling fields, once the "page list" has been constructed by ioBufferLock().
Of course, as several thread could possibly be sending requests at the same time, we must ensure that the state of the controller and of the queue remains consistent by a proper mutual exclusion scheme. However, as we're also coding in interrupt handlers, we cannot afford things like making the thread sleep until the object get unlocked.
We must also make sure that we minimize the amount of time while the interrupts are locked. calling ataDoRead() or ataDoWrite() for instance may be fairly long and we don't wish to block high-priority events (like clock or keyboard interrupt).
Thus rather than keeping interrupts disabled during the whole handler, it is good practice to re-enable them as soon as possible (for instance, just after having read the status out of the controller). We will not, however, send the "clearance code" outb(0x20,0x20) to the interrupt controller, which means we're sure that no other interrupt will occur for this disk controller until we're done.
Available synchronization primitives
Two synchronization techniques have been used for handling mutual exclusion in this driver: semaphores and system locks. The semaphore implements the well-known abstraction: when the resource is available, ksemWait() gets it atomicly and if it is not, the current thread will sleep until it becomes available.
System locks (called "mutexes" internally) never put the thread in sleep state. On single-cpu system, they just lock interrupts (and thus task switching) for a short time period, while on multi-cpu, it is planned that they will additionnally do a busy-waiting loop on some memory value. This implies that system locks must be used with care and only for very short amount of time ...
This being said, let's see what kind of conflict we want to avoid ...
Current request modification
The controller has a "current request" field that no other code but the interrupt handler may change. This guarantee us that we can always handle an interrupt regardless of what the schedulers are doing once the "current request" has been issued.
The only operation that modifies the "current request" consist of picking the head of the pending list and setting it as the new "current" request. A system lock is used to make sure noone else will try to access the pending list during this time.
Modifying the request queue
Unlike picking the list head, modifying the request queue may be a long operation, especially when a complex policy is in place (i.e. we have to sweep the list to find where to add the new request to minimize some metric). It wouldn't be nice to lock interrupts during that time interval, and we can't use a semaphore either, because the interrupt handler that will try to read the queue may not sleep.
The technique used here consist of stealing the list to the controller: for the duration of the operation, the queue will be replaced by an empty list (the replacement's atomicity is guaranteed by the same system lock as the one used for reading). Once the insertion has been performed by the queue scheduler, the whole list is restored to the controller.
If the scheduler detects that the controller entered the "idle" state during the operation (that is, the current request is now null), it will also promote the head of the list as the new current request and issue it. There's no risk that this operation cause any kind of trouble as there's no longer pending interrupts at this point (no operation is being held).
Concurrent updates of the request queue
So far, we only made sure that the interrupt handlerof the ataController will not attemp to read the request queue while being modified, but we also need an additionnal mechanism to make sure there will not be several attempt to update the queue by different queueSchedulers. This is achieved by adding a semaphore that will control mutual access to the request list as a writer.
In the absence of this semaphore, the following nightmare could occur:
operator component
operation
ending state
QS1
steal the controller's queue for an update
current!=NULL
pending==NULL
QS2
try to add a new request. No need to steal an empty queue.
current!=NULL
pending=={rq}
QS1
restores the modified queue. {rq} is lost.
so we clearly see we need that semaphore to prevent the queue from being "stolen" twice. As the queue modification between a steal and a restore may be long and as the queue scheduler do not operate at interrupt-handling priority, we can safely use a semaphore for these two (or more) components.
Note that it's a common technique to use two distinct synchronizator objects for reading and writing access to a resource. What is a bit special here is the asymetry between the two synchronizators used.
typedef struct ioDiskList {
ioDiskRequest *head;
ioDiskRequest *tail;
kMutex lock;
kSemaphore *write;
ioDiskRequest *request;
} ioDiskList;
typedef struct ioDiskRequest {
struct ioDiskRequest *next;
struct ioDiskRequest *prev;
...
} ioDiskRequest;
typedef struct ioAtaController {
ioDiskList pending;
...
};
a list of IO requests, as the disk queue scheduler can use it. the "lock" mutex is used to prevent the controller to access the list while a scheduler is actually accessing it.
In addition, we have a semaphore that provides mutual exclusion between schedulers in the case of a shared controller.
list.request is the current processed request.
ioDiskRequest *ioDiskListGet(ioDiskList *list)
{
kSysLock(&list->lock);
if (( list->request=list->head )) {
if (list->head==list->tail) list->tail=NULL;
list->head=list->head->next;
}
kSysUnlock(&list->lock);
return list->request;
}get the first request of the list. We just use the "list lock" mutex.
ioDiskRequest *ioDiskListSteal(ioDiskList *list, ioDiskRequest **tail)
{
ioDiskRequest *res=NULL;
ksemWait(list->write);
kSysLock(&list->lock);
res=list->head;
if (tail) *tail=list->tail;
list->head=NULL;
list->tail=NULL;
kSysUnlock(&list->lock);
return res;
}
leaves an empty list on the controller while updating it. You need to "restore" the list later because any attemp to steal it again will be blocked. Note that the "sysLock" occurs *after* the semWait, so that we're not waiting for a semaphore in a sysLock region.
ioDiskRequest *ioDiskListRestore(ioDiskList *list, ioDiskRequest *head,
ioDiskRequest *tail)
{
void* current;
kSysLock(&list->lock);
list->head=head;
list->tail=tail;
current=list->request;
kSysUnlock(&list->lock);
if (!current) {
while (ioDiskListGet(list)) {
if (list->request->issue && list->request->issue(list, list->request))
break;
}
}
ksemSignal(list->write);
}
restore the modified list. in the case of an idle controller at the time of the restore, the newly installed list is read and the new "current" request is issued.
Note that because list->request is modified within the "syslock" region of Get(), we don't fear concurrent issue of several requests. Moreover, the controller is expected to be idle and therefore it will *not* interfere.