Creation of Cybook 2416 (actually Gen4) repository

This commit is contained in:
mlt
2009-12-18 17:10:00 +00:00
committed by godzil
commit 76f20f4d40
13791 changed files with 6812321 additions and 0 deletions

View File

@@ -0,0 +1,167 @@
Anticipatory IO scheduler
-------------------------
Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003
Attention! Database servers, especially those using "TCQ" disks should
investigate performance with the 'deadline' IO scheduler. Any system with high
disk performance requirements should do so, in fact.
If you see unusual performance characteristics of your disk systems, or you
see big performance regressions versus the deadline scheduler, please email
me. Database users don't bother unless you're willing to test a lot of patches
from me ;) its a known issue.
Also, users with hardware RAID controllers, doing striping, may find
highly variable performance results with using the as-iosched. The
as-iosched anticipatory implementation is based on the notion that a disk
device has only one physical seeking head. A striped RAID controller
actually has a head for each physical device in the logical RAID device.
However, setting the antic_expire (see tunable parameters below) produces
very similar behavior to the deadline IO scheduler.
Selecting IO schedulers
-----------------------
To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
'noop', 'as' and 'cfq' (the default) are also available. IO schedulers are
assigned globally at boot time only presently. It's also possible to change
the IO scheduler for a determined device on the fly, as described in
Documentation/block/switching-sched.txt.
Anticipatory IO scheduler Policies
----------------------------------
The as-iosched implementation implements several layers of policies
to determine when an IO request is dispatched to the disk controller.
Here are the policies outlined, in order of application.
1. one-way Elevator algorithm.
The elevator algorithm is similar to that used in deadline scheduler, with
the addition that it allows limited backward movement of the elevator
(i.e. seeks backwards). A seek backwards can occur when choosing between
two IO requests where one is behind the elevator's current position, and
the other is in front of the elevator's position. If the seek distance to
the request in back of the elevator is less than half the seek distance to
the request in front of the elevator, then the request in back can be chosen.
Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
This favors forward movement of the elevator, while allowing opportunistic
"short" backward seeks.
2. FIFO expiration times for reads and for writes.
This is again very similar to the deadline IO scheduler. The expiration
times for requests on these lists is tunable using the parameters read_expire
and write_expire discussed below. When a read or a write expires in this way,
the IO scheduler will interrupt its current elevator sweep or read anticipation
to service the expired request.
3. Read and write request batching
A batch is a collection of read requests or a collection of write
requests. The as scheduler alternates dispatching read and write batches
to the driver. In the case a read batch, the scheduler submits read
requests to the driver as long as there are read requests to submit, and
the read batch time limit has not been exceeded (read_batch_expire).
The read batch time limit begins counting down only when there are
competing write requests pending.
In the case of a write batch, the scheduler submits write requests to
the driver as long as there are write requests available, and the
write batch time limit has not been exceeded (write_batch_expire).
However, the length of write batches will be gradually shortened
when read batches frequently exceed their time limit.
When changing between batch types, the scheduler waits for all requests
from the previous batch to complete before scheduling requests for the
next batch.
The read and write fifo expiration times described in policy 2 above
are checked only when in scheduling IO of a batch for the corresponding
(read/write) type. So for example, the read FIFO timeout values are
tested only during read batches. Likewise, the write FIFO timeout
values are tested only during write batches. For this reason,
it is generally not recommended for the read batch time
to be longer than the write expiration time, nor for the write batch
time to exceed the read expiration time (see tunable parameters below).
When the IO scheduler changes from a read to a write batch,
it begins the elevator from the request that is on the head of the
write expiration FIFO. Likewise, when changing from a write batch to
a read batch, scheduler begins the elevator from the first entry
on the read expiration FIFO.
4. Read anticipation.
Read anticipation occurs only when scheduling a read batch.
This implementation of read anticipation allows only one read request
to be dispatched to the disk controller at a time. In
contrast, many write requests may be dispatched to the disk controller
at a time during a write batch. It is this characteristic that can make
the anticipatory scheduler perform anomalously with controllers supporting
TCQ, or with hardware striped RAID devices. Setting the antic_expire
queue parameter (see below) to zero disables this behavior, and the
anticipatory scheduler behaves essentially like the deadline scheduler.
When read anticipation is enabled (antic_expire is not zero), reads
are dispatched to the disk controller one at a time.
At the end of each read request, the IO scheduler examines its next
candidate read request from its sorted read list. If that next request
is from the same process as the request that just completed,
or if the next request in the queue is "very close" to the
just completed request, it is dispatched immediately. Otherwise,
statistics (average think time, average seek distance) on the process
that submitted the just completed request are examined. If it seems
likely that that process will submit another request soon, and that
request is likely to be near the just completed request, then the IO
scheduler will stop dispatching more read requests for up time (antic_expire)
milliseconds, hoping that process will submit a new request near the one
that just completed. If such a request is made, then it is dispatched
immediately. If the antic_expire wait time expires, then the IO scheduler
will dispatch the next read request from the sorted read queue.
To decide whether an anticipatory wait is worthwhile, the scheduler
maintains statistics for each process that can be used to compute
mean "think time" (the time between read requests), and mean seek
distance for that process. One observation is that these statistics
are associated with each process, but those statistics are not associated
with a specific IO device. So for example, if a process is doing IO
on several file systems on separate devices, the statistics will be
a combination of IO behavior from all those devices.
Tuning the anticipatory IO scheduler
------------------------------------
When using 'as', the anticipatory IO scheduler there are 5 parameters under
/sys/block/*/queue/iosched/. All are units of milliseconds.
The parameters are:
* read_expire
Controls how long until a read request becomes "expired". It also controls the
interval between which expired requests are served, so set to 50, a request
might take anywhere < 100ms to be serviced _if_ it is the next on the
expired list. Obviously request expiration strategies won't make the disk
go faster. The result basically equates to the timeslice a single reader
gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
very roughly the % streaming read efficiency your disk should get with
multiple readers.
* read_batch_expire
Controls how much time a batch of reads is given before pending writes are
served. A higher value is more efficient. This might be set below read_expire
if writes are to be given higher priority than reads, but reads are to be
as efficient as possible when there are no writes. Generally though, it
should be some multiple of read_expire.
* write_expire, and
* write_batch_expire are equivalent to the above, for writes.
* antic_expire
Controls the maximum amount of time we can anticipate a good read (one
with a short seek distance from the most recently completed request) before
giving up. Many other factors may cause anticipation to be stopped early,
or some processes will not be "anticipated" at all. Should be a bit higher
for big seek time devices though not a linear correspondence - most
processes have only a few ms thinktime.

View File

@@ -0,0 +1,271 @@
I/O Barriers
============
Tejun Heo <htejun@gmail.com>, July 22 2005
I/O barrier requests are used to guarantee ordering around the barrier
requests. Unless you're crazy enough to use disk drives for
implementing synchronization constructs (wow, sounds interesting...),
the ordering is meaningful only for write requests for things like
journal checkpoints. All requests queued before a barrier request
must be finished (made it to the physical medium) before the barrier
request is started, and all requests queued after the barrier request
must be started only after the barrier request is finished (again,
made it to the physical medium).
In other words, I/O barrier requests have the following two properties.
1. Request ordering
Requests cannot pass the barrier request. Preceding requests are
processed before the barrier and following requests after.
Depending on what features a drive supports, this can be done in one
of the following three ways.
i. For devices which have queue depth greater than 1 (TCQ devices) and
support ordered tags, block layer can just issue the barrier as an
ordered request and the lower level driver, controller and drive
itself are responsible for making sure that the ordering constraint is
met. Most modern SCSI controllers/drives should support this.
NOTE: SCSI ordered tag isn't currently used due to limitation in the
SCSI midlayer, see the following random notes section.
ii. For devices which have queue depth greater than 1 but don't
support ordered tags, block layer ensures that the requests preceding
a barrier request finishes before issuing the barrier request. Also,
it defers requests following the barrier until the barrier request is
finished. Older SCSI controllers/drives and SATA drives fall in this
category.
iii. Devices which have queue depth of 1. This is a degenerate case
of ii. Just keeping issue order suffices. Ancient SCSI
controllers/drives and IDE drives are in this category.
2. Forced flushing to physical medium
Again, if you're not gonna do synchronization with disk drives (dang,
it sounds even more appealing now!), the reason you use I/O barriers
is mainly to protect filesystem integrity when power failure or some
other events abruptly stop the drive from operating and possibly make
the drive lose data in its cache. So, I/O barriers need to guarantee
that requests actually get written to non-volatile medium in order.
There are four cases,
i. No write-back cache. Keeping requests ordered is enough.
ii. Write-back cache but no flush operation. There's no way to
guarantee physical-medium commit order. This kind of devices can't to
I/O barriers.
iii. Write-back cache and flush operation but no FUA (forced unit
access). We need two cache flushes - before and after the barrier
request.
iv. Write-back cache, flush operation and FUA. We still need one
flush to make sure requests preceding a barrier are written to medium,
but post-barrier flush can be avoided by using FUA write on the
barrier itself.
How to support barrier requests in drivers
------------------------------------------
All barrier handling is done inside block layer proper. All low level
drivers have to are implementing its prepare_flush_fn and using one
the following two functions to indicate what barrier type it supports
and how to prepare flush requests. Note that the term 'ordered' is
used to indicate the whole sequence of performing barrier requests
including draining and flushing.
typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq);
int blk_queue_ordered(request_queue_t *q, unsigned ordered,
prepare_flush_fn *prepare_flush_fn,
unsigned gfp_mask);
int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
prepare_flush_fn *prepare_flush_fn,
unsigned gfp_mask);
The only difference between the two functions is whether or not the
caller is holding q->queue_lock on entry. The latter expects the
caller is holding the lock.
@q : the queue in question
@ordered : the ordered mode the driver/device supports
@prepare_flush_fn : this function should prepare @rq such that it
flushes cache to physical medium when executed
@gfp_mask : gfp_mask used when allocating data structures
for ordered processing
For example, SCSI disk driver's prepare_flush_fn looks like the
following.
static void sd_prepare_flush(request_queue_t *q, struct request *rq)
{
memset(rq->cmd, 0, sizeof(rq->cmd));
rq->flags |= REQ_BLOCK_PC;
rq->timeout = SD_TIMEOUT;
rq->cmd[0] = SYNCHRONIZE_CACHE;
}
The following seven ordered modes are supported. The following table
shows which mode should be used depending on what features a
device/driver supports. In the leftmost column of table,
QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
The table is followed by description of each mode. Note that in the
descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the
preceding step must be complete before proceeding to the next step.
'->' indicates that the next step can start as soon as the previous
step is issued.
write-back cache ordered tag flush FUA
-----------------------------------------------------------------------
NONE yes/no N/A no N/A
DRAIN no no N/A N/A
DRAIN_FLUSH yes no yes no
DRAIN_FUA yes no yes yes
TAG no yes N/A N/A
TAG_FLUSH yes yes yes no
TAG_FUA yes yes yes yes
QUEUE_ORDERED_NONE
I/O barriers are not needed and/or supported.
Sequence: N/A
QUEUE_ORDERED_DRAIN
Requests are ordered by draining the request queue and cache
flushing isn't needed.
Sequence: drain => barrier
QUEUE_ORDERED_DRAIN_FLUSH
Requests are ordered by draining the request queue and both
pre-barrier and post-barrier cache flushings are needed.
Sequence: drain => preflush => barrier => postflush
QUEUE_ORDERED_DRAIN_FUA
Requests are ordered by draining the request queue and
pre-barrier cache flushing is needed. By using FUA on barrier
request, post-barrier flushing can be skipped.
Sequence: drain => preflush => barrier
QUEUE_ORDERED_TAG
Requests are ordered by ordered tag and cache flushing isn't
needed.
Sequence: barrier
QUEUE_ORDERED_TAG_FLUSH
Requests are ordered by ordered tag and both pre-barrier and
post-barrier cache flushings are needed.
Sequence: preflush -> barrier -> postflush
QUEUE_ORDERED_TAG_FUA
Requests are ordered by ordered tag and pre-barrier cache
flushing is needed. By using FUA on barrier request,
post-barrier flushing can be skipped.
Sequence: preflush -> barrier
Random notes/caveats
--------------------
* SCSI layer currently can't use TAG ordering even if the drive,
controller and driver support it. The problem is that SCSI midlayer
request dispatch function is not atomic. It releases queue lock and
switch to SCSI host lock during issue and it's possible and likely to
happen in time that requests change their relative positions. Once
this problem is solved, TAG ordering can be enabled.
* Currently, no matter which ordered mode is used, there can be only
one barrier request in progress. All I/O barriers are held off by
block layer until the previous I/O barrier is complete. This doesn't
make any difference for DRAIN ordered devices, but, for TAG ordered
devices with very high command latency, passing multiple I/O barriers
to low level *might* be helpful if they are very frequent. Well, this
certainly is a non-issue. I'm writing this just to make clear that no
two I/O barrier is ever passed to low-level driver.
* Completion order. Requests in ordered sequence are issued in order
but not required to finish in order. Barrier implementation can
handle out-of-order completion of ordered sequence. IOW, the requests
MUST be processed in order but the hardware/software completion paths
are allowed to reorder completion notifications - eg. current SCSI
midlayer doesn't preserve completion order during error handling.
* Requeueing order. Low-level drivers are free to requeue any request
after they removed it from the request queue with
blkdev_dequeue_request(). As barrier sequence should be kept in order
when requeued, generic elevator code takes care of putting requests in
order around barrier. See blk_ordered_req_seq() and
ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
Note that block drivers must not requeue preceding requests while
completing latter requests in an ordered sequence. Currently, no
error checking is done against this.
* Error handling. Currently, block layer will report error to upper
layer if any of requests in an ordered sequence fails. Unfortunately,
this doesn't seem to be enough. Look at the following request flow.
QUEUE_ORDERED_TAG_FLUSH is in use.
[0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
still in elevator
Let's say request [2], [3] are write requests to update file system
metadata (journal or whatever) and [barrier] is used to mark that
those updates are valid. Consider the following sequence.
i. Requests [0] ~ [post] leaves the request queue and enters
low-level driver.
ii. After a while, unfortunately, something goes wrong and the
drive fails [2]. Note that any of [0], [1] and [3] could have
completed by this time, but [pre] couldn't have been finished
as the drive must process it in order and it failed before
processing that command.
iii. Error handling kicks in and determines that the error is
unrecoverable and fails [2], and resumes operation.
iv. [pre] [barrier] [post] gets processed.
v. *BOOM* power fails
The problem here is that the barrier request is *supposed* to indicate
that filesystem update requests [2] and [3] made it safely to the
physical medium and, if the machine crashes after the barrier is
written, filesystem recovery code can depend on that. Sadly, that
isn't true in this case anymore. IOW, the success of a I/O barrier
should also be dependent on success of some of the preceding requests,
where only upper layer (filesystem) knows what 'some' is.
This can be solved by implementing a way to tell the block layer which
requests affect the success of the following barrier request and
making lower lever drivers to resume operation on error only after
block layer tells it to do so.
As the probability of this happening is very low and the drive should
be faulty, implementing the fix is probably an overkill. But, still,
it's there.
* In previous drafts of barrier implementation, there was fallback
mechanism such that, if FUA or ordered TAG fails, less fancy ordered
mode can be selected and the failed barrier request is retried
automatically. The rationale for this feature was that as FUA is
pretty new in ATA world and ordered tag was never used widely, there
could be devices which report to support those features but choke when
actually given such requests.
This was removed for two reasons 1. it's an overkill 2. it's
impossible to implement properly when TAG ordering is used as low
level drivers resume after an error automatically. If it's ever
needed adding it back and modifying low level drivers accordingly
shouldn't be difficult.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,78 @@
Deadline IO scheduler tunables
==============================
This little file attempts to document how the deadline io scheduler works.
In particular, it will clarify the meaning of the exposed tunables that may be
of interest to power users.
Each io queue has a set of io scheduler tunables associated with it. These
tunables control how the io scheduler works. You can find these entries
in:
/sys/block/<device>/queue/iosched
assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted,
you can do so by typing:
# mount none /sys -t sysfs
********************************************************************************
read_expire (in ms)
-----------
The goal of the deadline io scheduler is to attempt to guarantee a start
service time for a request. As we focus mainly on read latencies, this is
tunable. When a read request first enters the io scheduler, it is assigned
a deadline that is the current time + the read_expire value in units of
milliseconds.
write_expire (in ms)
-----------
Similar to read_expire mentioned above, but for writes.
fifo_batch
----------
When a read request expires its deadline, we must move some requests from
the sorted io scheduler list to the block device dispatch queue. fifo_batch
controls how many requests we move, based on the cost of each request. A
request is either qualified as a seek or a stream. The io scheduler knows
the last request that was serviced by the drive (or will be serviced right
before this one). See seek_cost and stream_unit.
write_starved (number of dispatches)
-------------
When we have to move requests from the io scheduler queue to the block
device dispatch queue, we always give a preference to reads. However, we
don't want to starve writes indefinitely either. So writes_starved controls
how many times we give preference to reads over writes. When that has been
done writes_starved number of times, we dispatch some writes based on the
same criteria as reads.
front_merges (bool)
------------
Sometimes it happens that a request enters the io scheduler that is contigious
with a request that is already on the queue. Either it fits in the back of that
request, or it fits at the front. That is called either a back merge candidate
or a front merge candidate. Due to the way files are typically laid out,
back merges are much more common than front merges. For some work loads, you
may even know that it is a waste of time to spend any time attempting to
front merge requests. Setting front_merges to 0 disables this functionality.
Front merges may still occur due to the cached last_merge hint, but since
that comes at basically 0 cost we leave that on. We simply disable the
rbtree front sector lookup when the io scheduler merge function is called.
Nov 11 2002, Jens Axboe <axboe@suse.de>

View File

@@ -0,0 +1,176 @@
Block io priorities
===================
Intro
-----
With the introduction of cfq v3 (aka cfq-ts or time sliced cfq), basic io
priorities is supported for reads on files. This enables users to io nice
processes or process groups, similar to what has been possible to cpu
scheduling for ages. This document mainly details the current possibilites
with cfq, other io schedulers do not support io priorities so far.
Scheduling classes
------------------
CFQ implements three generic scheduling classes that determine how io is
served for a process.
IOPRIO_CLASS_RT: This is the realtime io class. This scheduling class is given
higher priority than any other in the system, processes from this class are
given first access to the disk every time. Thus it needs to be used with some
care, one io RT process can starve the entire system. Within the RT class,
there are 8 levels of class data that determine exactly how much time this
process needs the disk for on each service. In the future this might change
to be more directly mappable to performance, by passing in a wanted data
rate instead.
IOPRIO_CLASS_BE: This is the best-effort scheduling class, which is the default
for any process that hasn't set a specific io priority. The class data
determines how much io bandwidth the process will get, it's directly mappable
to the cpu nice levels just more coarsely implemented. 0 is the highest
BE prio level, 7 is the lowest. The mapping between cpu nice level and io
nice level is determined as: io_nice = (cpu_nice + 20) / 5.
IOPRIO_CLASS_IDLE: This is the idle scheduling class, processes running at this
level only get io time when no one else needs the disk. The idle class has no
class data, since it doesn't really apply here.
Tools
-----
See below for a sample ionice tool. Usage:
# ionice -c<class> -n<level> -p<pid>
If pid isn't given, the current process is assumed. IO priority settings
are inherited on fork, so you can use ionice to start the process at a given
level:
# ionice -c2 -n0 /bin/ls
will run ls at the best-effort scheduling class at the highest priority.
For a running process, you can give the pid instead:
# ionice -c1 -n2 -p100
will change pid 100 to run at the realtime scheduling class, at priority 2.
---> snip ionice.c tool <---
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <getopt.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <asm/unistd.h>
extern int sys_ioprio_set(int, int, int);
extern int sys_ioprio_get(int, int);
#if defined(__i386__)
#define __NR_ioprio_set 289
#define __NR_ioprio_get 290
#elif defined(__ppc__)
#define __NR_ioprio_set 273
#define __NR_ioprio_get 274
#elif defined(__x86_64__)
#define __NR_ioprio_set 251
#define __NR_ioprio_get 252
#elif defined(__ia64__)
#define __NR_ioprio_set 1274
#define __NR_ioprio_get 1275
#else
#error "Unsupported arch"
#endif
_syscall3(int, ioprio_set, int, which, int, who, int, ioprio);
_syscall2(int, ioprio_get, int, which, int, who);
enum {
IOPRIO_CLASS_NONE,
IOPRIO_CLASS_RT,
IOPRIO_CLASS_BE,
IOPRIO_CLASS_IDLE,
};
enum {
IOPRIO_WHO_PROCESS = 1,
IOPRIO_WHO_PGRP,
IOPRIO_WHO_USER,
};
#define IOPRIO_CLASS_SHIFT 13
const char *to_prio[] = { "none", "realtime", "best-effort", "idle", };
int main(int argc, char *argv[])
{
int ioprio = 4, set = 0, ioprio_class = IOPRIO_CLASS_BE;
int c, pid = 0;
while ((c = getopt(argc, argv, "+n:c:p:")) != EOF) {
switch (c) {
case 'n':
ioprio = strtol(optarg, NULL, 10);
set = 1;
break;
case 'c':
ioprio_class = strtol(optarg, NULL, 10);
set = 1;
break;
case 'p':
pid = strtol(optarg, NULL, 10);
break;
}
}
switch (ioprio_class) {
case IOPRIO_CLASS_NONE:
ioprio_class = IOPRIO_CLASS_BE;
break;
case IOPRIO_CLASS_RT:
case IOPRIO_CLASS_BE:
break;
case IOPRIO_CLASS_IDLE:
ioprio = 7;
break;
default:
printf("bad prio class %d\n", ioprio_class);
return 1;
}
if (!set) {
if (!pid && argv[optind])
pid = strtol(argv[optind], NULL, 10);
ioprio = ioprio_get(IOPRIO_WHO_PROCESS, pid);
printf("pid=%d, %d\n", pid, ioprio);
if (ioprio == -1)
perror("ioprio_get");
else {
ioprio_class = ioprio >> IOPRIO_CLASS_SHIFT;
ioprio = ioprio & 0xff;
printf("%s: prio %d\n", to_prio[ioprio_class], ioprio);
}
} else {
if (ioprio_set(IOPRIO_WHO_PROCESS, pid, ioprio | ioprio_class << IOPRIO_CLASS_SHIFT) == -1) {
perror("ioprio_set");
return 1;
}
if (argv[optind])
execvp(argv[optind], &argv[optind]);
}
return 0;
}
---> snip ionice.c tool <---
March 11 2005, Jens Axboe <axboe@suse.de>

View File

@@ -0,0 +1,88 @@
struct request documentation
Jens Axboe <axboe@suse.de> 27/05/02
1.0
Index
2.0 Struct request members classification
2.1 struct request members explanation
3.0
2.0
Short explanation of request members
Classification flags:
D driver member
B block layer member
I I/O scheduler member
Unless an entry contains a D classification, a device driver must not access
this member. Some members may contain D classifications, but should only be
access through certain macros or functions (eg ->flags).
<linux/blkdev.h>
2.1
Member Flag Comment
------ ---- -------
struct list_head queuelist BI Organization on various internal
queues
void *elevator_private I I/O scheduler private data
unsigned char cmd[16] D Driver can use this for setting up
a cdb before execution, see
blk_queue_prep_rq
unsigned long flags DBI Contains info about data direction,
request type, etc.
int rq_status D Request status bits
kdev_t rq_dev DBI Target device
int errors DB Error counts
sector_t sector DBI Target location
unsigned long hard_nr_sectors B Used to keep sector sane
unsigned long nr_sectors DBI Total number of sectors in request
unsigned long hard_nr_sectors B Used to keep nr_sectors sane
unsigned short nr_phys_segments DB Number of physical scatter gather
segments in a request
unsigned short nr_hw_segments DB Number of hardware scatter gather
segments in a request
unsigned int current_nr_sectors DB Number of sectors in first segment
of request
unsigned int hard_cur_sectors B Used to keep current_nr_sectors sane
int tag DB TCQ tag, if assigned
void *special D Free to be used by driver
char *buffer D Map of first segment, also see
section on bouncing SECTION
struct completion *waiting D Can be used by driver to get signalled
on request completion
struct bio *bio DBI First bio in request
struct bio *biotail DBI Last bio in request
request_queue_t *q DB Request queue this request belongs to
struct request_list *rl B Request list this request came from

View File

@@ -0,0 +1,82 @@
Block layer statistics in /sys/block/<dev>/stat
===============================================
This file documents the contents of the /sys/block/<dev>/stat file.
The stat file provides several statistics about the state of block
device <dev>.
Q. Why are there multiple statistics in a single file? Doesn't sysfs
normally contain a single value per file?
A. By having a single file, the kernel can guarantee that the statistics
represent a consistent snapshot of the state of the device. If the
statistics were exported as multiple files containing one statistic
each, it would be impossible to guarantee that a set of readings
represent a single point in time.
The stat file consists of a single line of text containing 11 decimal
values separated by whitespace. The fields are summarized in the
following table, and described in more detail below.
Name units description
---- ----- -----------
read I/Os requests number of read I/Os processed
read merges requests number of read I/Os merged with in-queue I/O
read sectors sectors number of sectors read
read ticks milliseconds total wait time for read requests
write I/Os requests number of write I/Os processed
write merges requests number of write I/Os merged with in-queue I/O
write sectors sectors number of sectors written
write ticks milliseconds total wait time for write requests
in_flight requests number of I/Os currently in flight
io_ticks milliseconds total time this block device has been active
time_in_queue milliseconds total wait time for all requests
read I/Os, write I/Os
=====================
These values increment when an I/O request completes.
read merges, write merges
=========================
These values increment when an I/O request is merged with an
already-queued I/O request.
read sectors, write sectors
===========================
These values count the number of sectors read from or written to this
block device. The "sectors" in question are the standard UNIX 512-byte
sectors, not any device- or filesystem-specific block size. The
counters are incremented when the I/O completes.
read ticks, write ticks
=======================
These values count the number of milliseconds that I/O requests have
waited on this block device. If there are multiple I/O requests waiting,
these values will increase at a rate greater than 1000/second; for
example, if 60 read requests wait for an average of 30 ms, the read_ticks
field will increase by 60*30 = 1800.
in_flight
=========
This value counts the number of I/O requests that have been issued to
the device driver but have not yet completed. It does not include I/O
requests that are in the queue but not yet issued to the device driver.
io_ticks
========
This value counts the number of milliseconds during which the device has
had I/O requests queued.
time_in_queue
=============
This value counts the number of milliseconds that I/O requests have waited
on this block device. If there are multiple I/O requests waiting, this
value will increase as the product of the number of milliseconds times the
number of requests waiting (see "read ticks" above for an example).

View File

@@ -0,0 +1,22 @@
As of the Linux 2.6.10 kernel, it is now possible to change the
IO scheduler for a given block device on the fly (thus making it possible,
for instance, to set the CFQ scheduler for the system default, but
set a specific device to use the anticipatory or noop schedulers - which
can improve that device's throughput).
To set a specific scheduler, simply do this:
echo SCHEDNAME > /sys/block/DEV/queue/scheduler
where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
device name (hda, hdb, sga, or whatever you happen to have).
The list of defined schedulers can be found by simply doing
a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
will be displayed, with the currently selected scheduler in brackets:
# cat /sys/block/hda/queue/scheduler
noop anticipatory deadline [cfq]
# echo anticipatory > /sys/block/hda/queue/scheduler
# cat /sys/block/hda/queue/scheduler
noop [anticipatory] deadline cfq