Creation of Cybook 2416 (actually Gen4) repository
This commit is contained in:
167
Documentation/block/as-iosched.txt
Normal file
167
Documentation/block/as-iosched.txt
Normal file
@@ -0,0 +1,167 @@
|
||||
Anticipatory IO scheduler
|
||||
-------------------------
|
||||
Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003
|
||||
|
||||
Attention! Database servers, especially those using "TCQ" disks should
|
||||
investigate performance with the 'deadline' IO scheduler. Any system with high
|
||||
disk performance requirements should do so, in fact.
|
||||
|
||||
If you see unusual performance characteristics of your disk systems, or you
|
||||
see big performance regressions versus the deadline scheduler, please email
|
||||
me. Database users don't bother unless you're willing to test a lot of patches
|
||||
from me ;) its a known issue.
|
||||
|
||||
Also, users with hardware RAID controllers, doing striping, may find
|
||||
highly variable performance results with using the as-iosched. The
|
||||
as-iosched anticipatory implementation is based on the notion that a disk
|
||||
device has only one physical seeking head. A striped RAID controller
|
||||
actually has a head for each physical device in the logical RAID device.
|
||||
|
||||
However, setting the antic_expire (see tunable parameters below) produces
|
||||
very similar behavior to the deadline IO scheduler.
|
||||
|
||||
|
||||
Selecting IO schedulers
|
||||
-----------------------
|
||||
To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
|
||||
'noop', 'as' and 'cfq' (the default) are also available. IO schedulers are
|
||||
assigned globally at boot time only presently. It's also possible to change
|
||||
the IO scheduler for a determined device on the fly, as described in
|
||||
Documentation/block/switching-sched.txt.
|
||||
|
||||
|
||||
Anticipatory IO scheduler Policies
|
||||
----------------------------------
|
||||
The as-iosched implementation implements several layers of policies
|
||||
to determine when an IO request is dispatched to the disk controller.
|
||||
Here are the policies outlined, in order of application.
|
||||
|
||||
1. one-way Elevator algorithm.
|
||||
|
||||
The elevator algorithm is similar to that used in deadline scheduler, with
|
||||
the addition that it allows limited backward movement of the elevator
|
||||
(i.e. seeks backwards). A seek backwards can occur when choosing between
|
||||
two IO requests where one is behind the elevator's current position, and
|
||||
the other is in front of the elevator's position. If the seek distance to
|
||||
the request in back of the elevator is less than half the seek distance to
|
||||
the request in front of the elevator, then the request in back can be chosen.
|
||||
Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
|
||||
This favors forward movement of the elevator, while allowing opportunistic
|
||||
"short" backward seeks.
|
||||
|
||||
2. FIFO expiration times for reads and for writes.
|
||||
|
||||
This is again very similar to the deadline IO scheduler. The expiration
|
||||
times for requests on these lists is tunable using the parameters read_expire
|
||||
and write_expire discussed below. When a read or a write expires in this way,
|
||||
the IO scheduler will interrupt its current elevator sweep or read anticipation
|
||||
to service the expired request.
|
||||
|
||||
3. Read and write request batching
|
||||
|
||||
A batch is a collection of read requests or a collection of write
|
||||
requests. The as scheduler alternates dispatching read and write batches
|
||||
to the driver. In the case a read batch, the scheduler submits read
|
||||
requests to the driver as long as there are read requests to submit, and
|
||||
the read batch time limit has not been exceeded (read_batch_expire).
|
||||
The read batch time limit begins counting down only when there are
|
||||
competing write requests pending.
|
||||
|
||||
In the case of a write batch, the scheduler submits write requests to
|
||||
the driver as long as there are write requests available, and the
|
||||
write batch time limit has not been exceeded (write_batch_expire).
|
||||
However, the length of write batches will be gradually shortened
|
||||
when read batches frequently exceed their time limit.
|
||||
|
||||
When changing between batch types, the scheduler waits for all requests
|
||||
from the previous batch to complete before scheduling requests for the
|
||||
next batch.
|
||||
|
||||
The read and write fifo expiration times described in policy 2 above
|
||||
are checked only when in scheduling IO of a batch for the corresponding
|
||||
(read/write) type. So for example, the read FIFO timeout values are
|
||||
tested only during read batches. Likewise, the write FIFO timeout
|
||||
values are tested only during write batches. For this reason,
|
||||
it is generally not recommended for the read batch time
|
||||
to be longer than the write expiration time, nor for the write batch
|
||||
time to exceed the read expiration time (see tunable parameters below).
|
||||
|
||||
When the IO scheduler changes from a read to a write batch,
|
||||
it begins the elevator from the request that is on the head of the
|
||||
write expiration FIFO. Likewise, when changing from a write batch to
|
||||
a read batch, scheduler begins the elevator from the first entry
|
||||
on the read expiration FIFO.
|
||||
|
||||
4. Read anticipation.
|
||||
|
||||
Read anticipation occurs only when scheduling a read batch.
|
||||
This implementation of read anticipation allows only one read request
|
||||
to be dispatched to the disk controller at a time. In
|
||||
contrast, many write requests may be dispatched to the disk controller
|
||||
at a time during a write batch. It is this characteristic that can make
|
||||
the anticipatory scheduler perform anomalously with controllers supporting
|
||||
TCQ, or with hardware striped RAID devices. Setting the antic_expire
|
||||
queue parameter (see below) to zero disables this behavior, and the
|
||||
anticipatory scheduler behaves essentially like the deadline scheduler.
|
||||
|
||||
When read anticipation is enabled (antic_expire is not zero), reads
|
||||
are dispatched to the disk controller one at a time.
|
||||
At the end of each read request, the IO scheduler examines its next
|
||||
candidate read request from its sorted read list. If that next request
|
||||
is from the same process as the request that just completed,
|
||||
or if the next request in the queue is "very close" to the
|
||||
just completed request, it is dispatched immediately. Otherwise,
|
||||
statistics (average think time, average seek distance) on the process
|
||||
that submitted the just completed request are examined. If it seems
|
||||
likely that that process will submit another request soon, and that
|
||||
request is likely to be near the just completed request, then the IO
|
||||
scheduler will stop dispatching more read requests for up time (antic_expire)
|
||||
milliseconds, hoping that process will submit a new request near the one
|
||||
that just completed. If such a request is made, then it is dispatched
|
||||
immediately. If the antic_expire wait time expires, then the IO scheduler
|
||||
will dispatch the next read request from the sorted read queue.
|
||||
|
||||
To decide whether an anticipatory wait is worthwhile, the scheduler
|
||||
maintains statistics for each process that can be used to compute
|
||||
mean "think time" (the time between read requests), and mean seek
|
||||
distance for that process. One observation is that these statistics
|
||||
are associated with each process, but those statistics are not associated
|
||||
with a specific IO device. So for example, if a process is doing IO
|
||||
on several file systems on separate devices, the statistics will be
|
||||
a combination of IO behavior from all those devices.
|
||||
|
||||
|
||||
Tuning the anticipatory IO scheduler
|
||||
------------------------------------
|
||||
When using 'as', the anticipatory IO scheduler there are 5 parameters under
|
||||
/sys/block/*/queue/iosched/. All are units of milliseconds.
|
||||
|
||||
The parameters are:
|
||||
* read_expire
|
||||
Controls how long until a read request becomes "expired". It also controls the
|
||||
interval between which expired requests are served, so set to 50, a request
|
||||
might take anywhere < 100ms to be serviced _if_ it is the next on the
|
||||
expired list. Obviously request expiration strategies won't make the disk
|
||||
go faster. The result basically equates to the timeslice a single reader
|
||||
gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
|
||||
very roughly the % streaming read efficiency your disk should get with
|
||||
multiple readers.
|
||||
|
||||
* read_batch_expire
|
||||
Controls how much time a batch of reads is given before pending writes are
|
||||
served. A higher value is more efficient. This might be set below read_expire
|
||||
if writes are to be given higher priority than reads, but reads are to be
|
||||
as efficient as possible when there are no writes. Generally though, it
|
||||
should be some multiple of read_expire.
|
||||
|
||||
* write_expire, and
|
||||
* write_batch_expire are equivalent to the above, for writes.
|
||||
|
||||
* antic_expire
|
||||
Controls the maximum amount of time we can anticipate a good read (one
|
||||
with a short seek distance from the most recently completed request) before
|
||||
giving up. Many other factors may cause anticipation to be stopped early,
|
||||
or some processes will not be "anticipated" at all. Should be a bit higher
|
||||
for big seek time devices though not a linear correspondence - most
|
||||
processes have only a few ms thinktime.
|
||||
|
||||
271
Documentation/block/barrier.txt
Normal file
271
Documentation/block/barrier.txt
Normal file
@@ -0,0 +1,271 @@
|
||||
I/O Barriers
|
||||
============
|
||||
Tejun Heo <htejun@gmail.com>, July 22 2005
|
||||
|
||||
I/O barrier requests are used to guarantee ordering around the barrier
|
||||
requests. Unless you're crazy enough to use disk drives for
|
||||
implementing synchronization constructs (wow, sounds interesting...),
|
||||
the ordering is meaningful only for write requests for things like
|
||||
journal checkpoints. All requests queued before a barrier request
|
||||
must be finished (made it to the physical medium) before the barrier
|
||||
request is started, and all requests queued after the barrier request
|
||||
must be started only after the barrier request is finished (again,
|
||||
made it to the physical medium).
|
||||
|
||||
In other words, I/O barrier requests have the following two properties.
|
||||
|
||||
1. Request ordering
|
||||
|
||||
Requests cannot pass the barrier request. Preceding requests are
|
||||
processed before the barrier and following requests after.
|
||||
|
||||
Depending on what features a drive supports, this can be done in one
|
||||
of the following three ways.
|
||||
|
||||
i. For devices which have queue depth greater than 1 (TCQ devices) and
|
||||
support ordered tags, block layer can just issue the barrier as an
|
||||
ordered request and the lower level driver, controller and drive
|
||||
itself are responsible for making sure that the ordering constraint is
|
||||
met. Most modern SCSI controllers/drives should support this.
|
||||
|
||||
NOTE: SCSI ordered tag isn't currently used due to limitation in the
|
||||
SCSI midlayer, see the following random notes section.
|
||||
|
||||
ii. For devices which have queue depth greater than 1 but don't
|
||||
support ordered tags, block layer ensures that the requests preceding
|
||||
a barrier request finishes before issuing the barrier request. Also,
|
||||
it defers requests following the barrier until the barrier request is
|
||||
finished. Older SCSI controllers/drives and SATA drives fall in this
|
||||
category.
|
||||
|
||||
iii. Devices which have queue depth of 1. This is a degenerate case
|
||||
of ii. Just keeping issue order suffices. Ancient SCSI
|
||||
controllers/drives and IDE drives are in this category.
|
||||
|
||||
2. Forced flushing to physical medium
|
||||
|
||||
Again, if you're not gonna do synchronization with disk drives (dang,
|
||||
it sounds even more appealing now!), the reason you use I/O barriers
|
||||
is mainly to protect filesystem integrity when power failure or some
|
||||
other events abruptly stop the drive from operating and possibly make
|
||||
the drive lose data in its cache. So, I/O barriers need to guarantee
|
||||
that requests actually get written to non-volatile medium in order.
|
||||
|
||||
There are four cases,
|
||||
|
||||
i. No write-back cache. Keeping requests ordered is enough.
|
||||
|
||||
ii. Write-back cache but no flush operation. There's no way to
|
||||
guarantee physical-medium commit order. This kind of devices can't to
|
||||
I/O barriers.
|
||||
|
||||
iii. Write-back cache and flush operation but no FUA (forced unit
|
||||
access). We need two cache flushes - before and after the barrier
|
||||
request.
|
||||
|
||||
iv. Write-back cache, flush operation and FUA. We still need one
|
||||
flush to make sure requests preceding a barrier are written to medium,
|
||||
but post-barrier flush can be avoided by using FUA write on the
|
||||
barrier itself.
|
||||
|
||||
|
||||
How to support barrier requests in drivers
|
||||
------------------------------------------
|
||||
|
||||
All barrier handling is done inside block layer proper. All low level
|
||||
drivers have to are implementing its prepare_flush_fn and using one
|
||||
the following two functions to indicate what barrier type it supports
|
||||
and how to prepare flush requests. Note that the term 'ordered' is
|
||||
used to indicate the whole sequence of performing barrier requests
|
||||
including draining and flushing.
|
||||
|
||||
typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq);
|
||||
|
||||
int blk_queue_ordered(request_queue_t *q, unsigned ordered,
|
||||
prepare_flush_fn *prepare_flush_fn,
|
||||
unsigned gfp_mask);
|
||||
|
||||
int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
|
||||
prepare_flush_fn *prepare_flush_fn,
|
||||
unsigned gfp_mask);
|
||||
|
||||
The only difference between the two functions is whether or not the
|
||||
caller is holding q->queue_lock on entry. The latter expects the
|
||||
caller is holding the lock.
|
||||
|
||||
@q : the queue in question
|
||||
@ordered : the ordered mode the driver/device supports
|
||||
@prepare_flush_fn : this function should prepare @rq such that it
|
||||
flushes cache to physical medium when executed
|
||||
@gfp_mask : gfp_mask used when allocating data structures
|
||||
for ordered processing
|
||||
|
||||
For example, SCSI disk driver's prepare_flush_fn looks like the
|
||||
following.
|
||||
|
||||
static void sd_prepare_flush(request_queue_t *q, struct request *rq)
|
||||
{
|
||||
memset(rq->cmd, 0, sizeof(rq->cmd));
|
||||
rq->flags |= REQ_BLOCK_PC;
|
||||
rq->timeout = SD_TIMEOUT;
|
||||
rq->cmd[0] = SYNCHRONIZE_CACHE;
|
||||
}
|
||||
|
||||
The following seven ordered modes are supported. The following table
|
||||
shows which mode should be used depending on what features a
|
||||
device/driver supports. In the leftmost column of table,
|
||||
QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
|
||||
|
||||
The table is followed by description of each mode. Note that in the
|
||||
descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
|
||||
used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the
|
||||
preceding step must be complete before proceeding to the next step.
|
||||
'->' indicates that the next step can start as soon as the previous
|
||||
step is issued.
|
||||
|
||||
write-back cache ordered tag flush FUA
|
||||
-----------------------------------------------------------------------
|
||||
NONE yes/no N/A no N/A
|
||||
DRAIN no no N/A N/A
|
||||
DRAIN_FLUSH yes no yes no
|
||||
DRAIN_FUA yes no yes yes
|
||||
TAG no yes N/A N/A
|
||||
TAG_FLUSH yes yes yes no
|
||||
TAG_FUA yes yes yes yes
|
||||
|
||||
|
||||
QUEUE_ORDERED_NONE
|
||||
I/O barriers are not needed and/or supported.
|
||||
|
||||
Sequence: N/A
|
||||
|
||||
QUEUE_ORDERED_DRAIN
|
||||
Requests are ordered by draining the request queue and cache
|
||||
flushing isn't needed.
|
||||
|
||||
Sequence: drain => barrier
|
||||
|
||||
QUEUE_ORDERED_DRAIN_FLUSH
|
||||
Requests are ordered by draining the request queue and both
|
||||
pre-barrier and post-barrier cache flushings are needed.
|
||||
|
||||
Sequence: drain => preflush => barrier => postflush
|
||||
|
||||
QUEUE_ORDERED_DRAIN_FUA
|
||||
Requests are ordered by draining the request queue and
|
||||
pre-barrier cache flushing is needed. By using FUA on barrier
|
||||
request, post-barrier flushing can be skipped.
|
||||
|
||||
Sequence: drain => preflush => barrier
|
||||
|
||||
QUEUE_ORDERED_TAG
|
||||
Requests are ordered by ordered tag and cache flushing isn't
|
||||
needed.
|
||||
|
||||
Sequence: barrier
|
||||
|
||||
QUEUE_ORDERED_TAG_FLUSH
|
||||
Requests are ordered by ordered tag and both pre-barrier and
|
||||
post-barrier cache flushings are needed.
|
||||
|
||||
Sequence: preflush -> barrier -> postflush
|
||||
|
||||
QUEUE_ORDERED_TAG_FUA
|
||||
Requests are ordered by ordered tag and pre-barrier cache
|
||||
flushing is needed. By using FUA on barrier request,
|
||||
post-barrier flushing can be skipped.
|
||||
|
||||
Sequence: preflush -> barrier
|
||||
|
||||
|
||||
Random notes/caveats
|
||||
--------------------
|
||||
|
||||
* SCSI layer currently can't use TAG ordering even if the drive,
|
||||
controller and driver support it. The problem is that SCSI midlayer
|
||||
request dispatch function is not atomic. It releases queue lock and
|
||||
switch to SCSI host lock during issue and it's possible and likely to
|
||||
happen in time that requests change their relative positions. Once
|
||||
this problem is solved, TAG ordering can be enabled.
|
||||
|
||||
* Currently, no matter which ordered mode is used, there can be only
|
||||
one barrier request in progress. All I/O barriers are held off by
|
||||
block layer until the previous I/O barrier is complete. This doesn't
|
||||
make any difference for DRAIN ordered devices, but, for TAG ordered
|
||||
devices with very high command latency, passing multiple I/O barriers
|
||||
to low level *might* be helpful if they are very frequent. Well, this
|
||||
certainly is a non-issue. I'm writing this just to make clear that no
|
||||
two I/O barrier is ever passed to low-level driver.
|
||||
|
||||
* Completion order. Requests in ordered sequence are issued in order
|
||||
but not required to finish in order. Barrier implementation can
|
||||
handle out-of-order completion of ordered sequence. IOW, the requests
|
||||
MUST be processed in order but the hardware/software completion paths
|
||||
are allowed to reorder completion notifications - eg. current SCSI
|
||||
midlayer doesn't preserve completion order during error handling.
|
||||
|
||||
* Requeueing order. Low-level drivers are free to requeue any request
|
||||
after they removed it from the request queue with
|
||||
blkdev_dequeue_request(). As barrier sequence should be kept in order
|
||||
when requeued, generic elevator code takes care of putting requests in
|
||||
order around barrier. See blk_ordered_req_seq() and
|
||||
ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
|
||||
|
||||
Note that block drivers must not requeue preceding requests while
|
||||
completing latter requests in an ordered sequence. Currently, no
|
||||
error checking is done against this.
|
||||
|
||||
* Error handling. Currently, block layer will report error to upper
|
||||
layer if any of requests in an ordered sequence fails. Unfortunately,
|
||||
this doesn't seem to be enough. Look at the following request flow.
|
||||
QUEUE_ORDERED_TAG_FLUSH is in use.
|
||||
|
||||
[0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
|
||||
still in elevator
|
||||
|
||||
Let's say request [2], [3] are write requests to update file system
|
||||
metadata (journal or whatever) and [barrier] is used to mark that
|
||||
those updates are valid. Consider the following sequence.
|
||||
|
||||
i. Requests [0] ~ [post] leaves the request queue and enters
|
||||
low-level driver.
|
||||
ii. After a while, unfortunately, something goes wrong and the
|
||||
drive fails [2]. Note that any of [0], [1] and [3] could have
|
||||
completed by this time, but [pre] couldn't have been finished
|
||||
as the drive must process it in order and it failed before
|
||||
processing that command.
|
||||
iii. Error handling kicks in and determines that the error is
|
||||
unrecoverable and fails [2], and resumes operation.
|
||||
iv. [pre] [barrier] [post] gets processed.
|
||||
v. *BOOM* power fails
|
||||
|
||||
The problem here is that the barrier request is *supposed* to indicate
|
||||
that filesystem update requests [2] and [3] made it safely to the
|
||||
physical medium and, if the machine crashes after the barrier is
|
||||
written, filesystem recovery code can depend on that. Sadly, that
|
||||
isn't true in this case anymore. IOW, the success of a I/O barrier
|
||||
should also be dependent on success of some of the preceding requests,
|
||||
where only upper layer (filesystem) knows what 'some' is.
|
||||
|
||||
This can be solved by implementing a way to tell the block layer which
|
||||
requests affect the success of the following barrier request and
|
||||
making lower lever drivers to resume operation on error only after
|
||||
block layer tells it to do so.
|
||||
|
||||
As the probability of this happening is very low and the drive should
|
||||
be faulty, implementing the fix is probably an overkill. But, still,
|
||||
it's there.
|
||||
|
||||
* In previous drafts of barrier implementation, there was fallback
|
||||
mechanism such that, if FUA or ordered TAG fails, less fancy ordered
|
||||
mode can be selected and the failed barrier request is retried
|
||||
automatically. The rationale for this feature was that as FUA is
|
||||
pretty new in ATA world and ordered tag was never used widely, there
|
||||
could be devices which report to support those features but choke when
|
||||
actually given such requests.
|
||||
|
||||
This was removed for two reasons 1. it's an overkill 2. it's
|
||||
impossible to implement properly when TAG ordering is used as low
|
||||
level drivers resume after an error automatically. If it's ever
|
||||
needed adding it back and modifying low level drivers accordingly
|
||||
shouldn't be difficult.
|
||||
1215
Documentation/block/biodoc.txt
Normal file
1215
Documentation/block/biodoc.txt
Normal file
File diff suppressed because it is too large
Load Diff
78
Documentation/block/deadline-iosched.txt
Normal file
78
Documentation/block/deadline-iosched.txt
Normal file
@@ -0,0 +1,78 @@
|
||||
Deadline IO scheduler tunables
|
||||
==============================
|
||||
|
||||
This little file attempts to document how the deadline io scheduler works.
|
||||
In particular, it will clarify the meaning of the exposed tunables that may be
|
||||
of interest to power users.
|
||||
|
||||
Each io queue has a set of io scheduler tunables associated with it. These
|
||||
tunables control how the io scheduler works. You can find these entries
|
||||
in:
|
||||
|
||||
/sys/block/<device>/queue/iosched
|
||||
|
||||
assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted,
|
||||
you can do so by typing:
|
||||
|
||||
# mount none /sys -t sysfs
|
||||
|
||||
|
||||
********************************************************************************
|
||||
|
||||
|
||||
read_expire (in ms)
|
||||
-----------
|
||||
|
||||
The goal of the deadline io scheduler is to attempt to guarantee a start
|
||||
service time for a request. As we focus mainly on read latencies, this is
|
||||
tunable. When a read request first enters the io scheduler, it is assigned
|
||||
a deadline that is the current time + the read_expire value in units of
|
||||
milliseconds.
|
||||
|
||||
|
||||
write_expire (in ms)
|
||||
-----------
|
||||
|
||||
Similar to read_expire mentioned above, but for writes.
|
||||
|
||||
|
||||
fifo_batch
|
||||
----------
|
||||
|
||||
When a read request expires its deadline, we must move some requests from
|
||||
the sorted io scheduler list to the block device dispatch queue. fifo_batch
|
||||
controls how many requests we move, based on the cost of each request. A
|
||||
request is either qualified as a seek or a stream. The io scheduler knows
|
||||
the last request that was serviced by the drive (or will be serviced right
|
||||
before this one). See seek_cost and stream_unit.
|
||||
|
||||
|
||||
write_starved (number of dispatches)
|
||||
-------------
|
||||
|
||||
When we have to move requests from the io scheduler queue to the block
|
||||
device dispatch queue, we always give a preference to reads. However, we
|
||||
don't want to starve writes indefinitely either. So writes_starved controls
|
||||
how many times we give preference to reads over writes. When that has been
|
||||
done writes_starved number of times, we dispatch some writes based on the
|
||||
same criteria as reads.
|
||||
|
||||
|
||||
front_merges (bool)
|
||||
------------
|
||||
|
||||
Sometimes it happens that a request enters the io scheduler that is contigious
|
||||
with a request that is already on the queue. Either it fits in the back of that
|
||||
request, or it fits at the front. That is called either a back merge candidate
|
||||
or a front merge candidate. Due to the way files are typically laid out,
|
||||
back merges are much more common than front merges. For some work loads, you
|
||||
may even know that it is a waste of time to spend any time attempting to
|
||||
front merge requests. Setting front_merges to 0 disables this functionality.
|
||||
Front merges may still occur due to the cached last_merge hint, but since
|
||||
that comes at basically 0 cost we leave that on. We simply disable the
|
||||
rbtree front sector lookup when the io scheduler merge function is called.
|
||||
|
||||
|
||||
Nov 11 2002, Jens Axboe <axboe@suse.de>
|
||||
|
||||
|
||||
176
Documentation/block/ioprio.txt
Normal file
176
Documentation/block/ioprio.txt
Normal file
@@ -0,0 +1,176 @@
|
||||
Block io priorities
|
||||
===================
|
||||
|
||||
|
||||
Intro
|
||||
-----
|
||||
|
||||
With the introduction of cfq v3 (aka cfq-ts or time sliced cfq), basic io
|
||||
priorities is supported for reads on files. This enables users to io nice
|
||||
processes or process groups, similar to what has been possible to cpu
|
||||
scheduling for ages. This document mainly details the current possibilites
|
||||
with cfq, other io schedulers do not support io priorities so far.
|
||||
|
||||
Scheduling classes
|
||||
------------------
|
||||
|
||||
CFQ implements three generic scheduling classes that determine how io is
|
||||
served for a process.
|
||||
|
||||
IOPRIO_CLASS_RT: This is the realtime io class. This scheduling class is given
|
||||
higher priority than any other in the system, processes from this class are
|
||||
given first access to the disk every time. Thus it needs to be used with some
|
||||
care, one io RT process can starve the entire system. Within the RT class,
|
||||
there are 8 levels of class data that determine exactly how much time this
|
||||
process needs the disk for on each service. In the future this might change
|
||||
to be more directly mappable to performance, by passing in a wanted data
|
||||
rate instead.
|
||||
|
||||
IOPRIO_CLASS_BE: This is the best-effort scheduling class, which is the default
|
||||
for any process that hasn't set a specific io priority. The class data
|
||||
determines how much io bandwidth the process will get, it's directly mappable
|
||||
to the cpu nice levels just more coarsely implemented. 0 is the highest
|
||||
BE prio level, 7 is the lowest. The mapping between cpu nice level and io
|
||||
nice level is determined as: io_nice = (cpu_nice + 20) / 5.
|
||||
|
||||
IOPRIO_CLASS_IDLE: This is the idle scheduling class, processes running at this
|
||||
level only get io time when no one else needs the disk. The idle class has no
|
||||
class data, since it doesn't really apply here.
|
||||
|
||||
Tools
|
||||
-----
|
||||
|
||||
See below for a sample ionice tool. Usage:
|
||||
|
||||
# ionice -c<class> -n<level> -p<pid>
|
||||
|
||||
If pid isn't given, the current process is assumed. IO priority settings
|
||||
are inherited on fork, so you can use ionice to start the process at a given
|
||||
level:
|
||||
|
||||
# ionice -c2 -n0 /bin/ls
|
||||
|
||||
will run ls at the best-effort scheduling class at the highest priority.
|
||||
For a running process, you can give the pid instead:
|
||||
|
||||
# ionice -c1 -n2 -p100
|
||||
|
||||
will change pid 100 to run at the realtime scheduling class, at priority 2.
|
||||
|
||||
---> snip ionice.c tool <---
|
||||
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <errno.h>
|
||||
#include <getopt.h>
|
||||
#include <unistd.h>
|
||||
#include <sys/ptrace.h>
|
||||
#include <asm/unistd.h>
|
||||
|
||||
extern int sys_ioprio_set(int, int, int);
|
||||
extern int sys_ioprio_get(int, int);
|
||||
|
||||
#if defined(__i386__)
|
||||
#define __NR_ioprio_set 289
|
||||
#define __NR_ioprio_get 290
|
||||
#elif defined(__ppc__)
|
||||
#define __NR_ioprio_set 273
|
||||
#define __NR_ioprio_get 274
|
||||
#elif defined(__x86_64__)
|
||||
#define __NR_ioprio_set 251
|
||||
#define __NR_ioprio_get 252
|
||||
#elif defined(__ia64__)
|
||||
#define __NR_ioprio_set 1274
|
||||
#define __NR_ioprio_get 1275
|
||||
#else
|
||||
#error "Unsupported arch"
|
||||
#endif
|
||||
|
||||
_syscall3(int, ioprio_set, int, which, int, who, int, ioprio);
|
||||
_syscall2(int, ioprio_get, int, which, int, who);
|
||||
|
||||
enum {
|
||||
IOPRIO_CLASS_NONE,
|
||||
IOPRIO_CLASS_RT,
|
||||
IOPRIO_CLASS_BE,
|
||||
IOPRIO_CLASS_IDLE,
|
||||
};
|
||||
|
||||
enum {
|
||||
IOPRIO_WHO_PROCESS = 1,
|
||||
IOPRIO_WHO_PGRP,
|
||||
IOPRIO_WHO_USER,
|
||||
};
|
||||
|
||||
#define IOPRIO_CLASS_SHIFT 13
|
||||
|
||||
const char *to_prio[] = { "none", "realtime", "best-effort", "idle", };
|
||||
|
||||
int main(int argc, char *argv[])
|
||||
{
|
||||
int ioprio = 4, set = 0, ioprio_class = IOPRIO_CLASS_BE;
|
||||
int c, pid = 0;
|
||||
|
||||
while ((c = getopt(argc, argv, "+n:c:p:")) != EOF) {
|
||||
switch (c) {
|
||||
case 'n':
|
||||
ioprio = strtol(optarg, NULL, 10);
|
||||
set = 1;
|
||||
break;
|
||||
case 'c':
|
||||
ioprio_class = strtol(optarg, NULL, 10);
|
||||
set = 1;
|
||||
break;
|
||||
case 'p':
|
||||
pid = strtol(optarg, NULL, 10);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
switch (ioprio_class) {
|
||||
case IOPRIO_CLASS_NONE:
|
||||
ioprio_class = IOPRIO_CLASS_BE;
|
||||
break;
|
||||
case IOPRIO_CLASS_RT:
|
||||
case IOPRIO_CLASS_BE:
|
||||
break;
|
||||
case IOPRIO_CLASS_IDLE:
|
||||
ioprio = 7;
|
||||
break;
|
||||
default:
|
||||
printf("bad prio class %d\n", ioprio_class);
|
||||
return 1;
|
||||
}
|
||||
|
||||
if (!set) {
|
||||
if (!pid && argv[optind])
|
||||
pid = strtol(argv[optind], NULL, 10);
|
||||
|
||||
ioprio = ioprio_get(IOPRIO_WHO_PROCESS, pid);
|
||||
|
||||
printf("pid=%d, %d\n", pid, ioprio);
|
||||
|
||||
if (ioprio == -1)
|
||||
perror("ioprio_get");
|
||||
else {
|
||||
ioprio_class = ioprio >> IOPRIO_CLASS_SHIFT;
|
||||
ioprio = ioprio & 0xff;
|
||||
printf("%s: prio %d\n", to_prio[ioprio_class], ioprio);
|
||||
}
|
||||
} else {
|
||||
if (ioprio_set(IOPRIO_WHO_PROCESS, pid, ioprio | ioprio_class << IOPRIO_CLASS_SHIFT) == -1) {
|
||||
perror("ioprio_set");
|
||||
return 1;
|
||||
}
|
||||
|
||||
if (argv[optind])
|
||||
execvp(argv[optind], &argv[optind]);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
---> snip ionice.c tool <---
|
||||
|
||||
|
||||
March 11 2005, Jens Axboe <axboe@suse.de>
|
||||
88
Documentation/block/request.txt
Normal file
88
Documentation/block/request.txt
Normal file
@@ -0,0 +1,88 @@
|
||||
|
||||
struct request documentation
|
||||
|
||||
Jens Axboe <axboe@suse.de> 27/05/02
|
||||
|
||||
1.0
|
||||
Index
|
||||
|
||||
2.0 Struct request members classification
|
||||
|
||||
2.1 struct request members explanation
|
||||
|
||||
3.0
|
||||
|
||||
|
||||
2.0
|
||||
Short explanation of request members
|
||||
|
||||
Classification flags:
|
||||
|
||||
D driver member
|
||||
B block layer member
|
||||
I I/O scheduler member
|
||||
|
||||
Unless an entry contains a D classification, a device driver must not access
|
||||
this member. Some members may contain D classifications, but should only be
|
||||
access through certain macros or functions (eg ->flags).
|
||||
|
||||
<linux/blkdev.h>
|
||||
|
||||
2.1
|
||||
Member Flag Comment
|
||||
------ ---- -------
|
||||
|
||||
struct list_head queuelist BI Organization on various internal
|
||||
queues
|
||||
|
||||
void *elevator_private I I/O scheduler private data
|
||||
|
||||
unsigned char cmd[16] D Driver can use this for setting up
|
||||
a cdb before execution, see
|
||||
blk_queue_prep_rq
|
||||
|
||||
unsigned long flags DBI Contains info about data direction,
|
||||
request type, etc.
|
||||
|
||||
int rq_status D Request status bits
|
||||
|
||||
kdev_t rq_dev DBI Target device
|
||||
|
||||
int errors DB Error counts
|
||||
|
||||
sector_t sector DBI Target location
|
||||
|
||||
unsigned long hard_nr_sectors B Used to keep sector sane
|
||||
|
||||
unsigned long nr_sectors DBI Total number of sectors in request
|
||||
|
||||
unsigned long hard_nr_sectors B Used to keep nr_sectors sane
|
||||
|
||||
unsigned short nr_phys_segments DB Number of physical scatter gather
|
||||
segments in a request
|
||||
|
||||
unsigned short nr_hw_segments DB Number of hardware scatter gather
|
||||
segments in a request
|
||||
|
||||
unsigned int current_nr_sectors DB Number of sectors in first segment
|
||||
of request
|
||||
|
||||
unsigned int hard_cur_sectors B Used to keep current_nr_sectors sane
|
||||
|
||||
int tag DB TCQ tag, if assigned
|
||||
|
||||
void *special D Free to be used by driver
|
||||
|
||||
char *buffer D Map of first segment, also see
|
||||
section on bouncing SECTION
|
||||
|
||||
struct completion *waiting D Can be used by driver to get signalled
|
||||
on request completion
|
||||
|
||||
struct bio *bio DBI First bio in request
|
||||
|
||||
struct bio *biotail DBI Last bio in request
|
||||
|
||||
request_queue_t *q DB Request queue this request belongs to
|
||||
|
||||
struct request_list *rl B Request list this request came from
|
||||
82
Documentation/block/stat.txt
Normal file
82
Documentation/block/stat.txt
Normal file
@@ -0,0 +1,82 @@
|
||||
Block layer statistics in /sys/block/<dev>/stat
|
||||
===============================================
|
||||
|
||||
This file documents the contents of the /sys/block/<dev>/stat file.
|
||||
|
||||
The stat file provides several statistics about the state of block
|
||||
device <dev>.
|
||||
|
||||
Q. Why are there multiple statistics in a single file? Doesn't sysfs
|
||||
normally contain a single value per file?
|
||||
A. By having a single file, the kernel can guarantee that the statistics
|
||||
represent a consistent snapshot of the state of the device. If the
|
||||
statistics were exported as multiple files containing one statistic
|
||||
each, it would be impossible to guarantee that a set of readings
|
||||
represent a single point in time.
|
||||
|
||||
The stat file consists of a single line of text containing 11 decimal
|
||||
values separated by whitespace. The fields are summarized in the
|
||||
following table, and described in more detail below.
|
||||
|
||||
Name units description
|
||||
---- ----- -----------
|
||||
read I/Os requests number of read I/Os processed
|
||||
read merges requests number of read I/Os merged with in-queue I/O
|
||||
read sectors sectors number of sectors read
|
||||
read ticks milliseconds total wait time for read requests
|
||||
write I/Os requests number of write I/Os processed
|
||||
write merges requests number of write I/Os merged with in-queue I/O
|
||||
write sectors sectors number of sectors written
|
||||
write ticks milliseconds total wait time for write requests
|
||||
in_flight requests number of I/Os currently in flight
|
||||
io_ticks milliseconds total time this block device has been active
|
||||
time_in_queue milliseconds total wait time for all requests
|
||||
|
||||
read I/Os, write I/Os
|
||||
=====================
|
||||
|
||||
These values increment when an I/O request completes.
|
||||
|
||||
read merges, write merges
|
||||
=========================
|
||||
|
||||
These values increment when an I/O request is merged with an
|
||||
already-queued I/O request.
|
||||
|
||||
read sectors, write sectors
|
||||
===========================
|
||||
|
||||
These values count the number of sectors read from or written to this
|
||||
block device. The "sectors" in question are the standard UNIX 512-byte
|
||||
sectors, not any device- or filesystem-specific block size. The
|
||||
counters are incremented when the I/O completes.
|
||||
|
||||
read ticks, write ticks
|
||||
=======================
|
||||
|
||||
These values count the number of milliseconds that I/O requests have
|
||||
waited on this block device. If there are multiple I/O requests waiting,
|
||||
these values will increase at a rate greater than 1000/second; for
|
||||
example, if 60 read requests wait for an average of 30 ms, the read_ticks
|
||||
field will increase by 60*30 = 1800.
|
||||
|
||||
in_flight
|
||||
=========
|
||||
|
||||
This value counts the number of I/O requests that have been issued to
|
||||
the device driver but have not yet completed. It does not include I/O
|
||||
requests that are in the queue but not yet issued to the device driver.
|
||||
|
||||
io_ticks
|
||||
========
|
||||
|
||||
This value counts the number of milliseconds during which the device has
|
||||
had I/O requests queued.
|
||||
|
||||
time_in_queue
|
||||
=============
|
||||
|
||||
This value counts the number of milliseconds that I/O requests have waited
|
||||
on this block device. If there are multiple I/O requests waiting, this
|
||||
value will increase as the product of the number of milliseconds times the
|
||||
number of requests waiting (see "read ticks" above for an example).
|
||||
22
Documentation/block/switching-sched.txt
Normal file
22
Documentation/block/switching-sched.txt
Normal file
@@ -0,0 +1,22 @@
|
||||
As of the Linux 2.6.10 kernel, it is now possible to change the
|
||||
IO scheduler for a given block device on the fly (thus making it possible,
|
||||
for instance, to set the CFQ scheduler for the system default, but
|
||||
set a specific device to use the anticipatory or noop schedulers - which
|
||||
can improve that device's throughput).
|
||||
|
||||
To set a specific scheduler, simply do this:
|
||||
|
||||
echo SCHEDNAME > /sys/block/DEV/queue/scheduler
|
||||
|
||||
where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
|
||||
device name (hda, hdb, sga, or whatever you happen to have).
|
||||
|
||||
The list of defined schedulers can be found by simply doing
|
||||
a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
|
||||
will be displayed, with the currently selected scheduler in brackets:
|
||||
|
||||
# cat /sys/block/hda/queue/scheduler
|
||||
noop anticipatory deadline [cfq]
|
||||
# echo anticipatory > /sys/block/hda/queue/scheduler
|
||||
# cat /sys/block/hda/queue/scheduler
|
||||
noop [anticipatory] deadline cfq
|
||||
Reference in New Issue
Block a user