Creation of Cybook 2416 (actually Gen4) repository
This commit is contained in:
commit
76f20f4d40
1
.cross_compile
Normal file
1
.cross_compile
Normal file
@ -0,0 +1 @@
|
||||
/usr/local/arm/4.2.2-eabi/usr/bin/arm-linux-
|
||||
98
.mailmap
Normal file
98
.mailmap
Normal file
@ -0,0 +1,98 @@
|
||||
#
|
||||
# This list is used by git-shortlog to fix a few botched name translations
|
||||
# in the git archive, either because the author's full name was messed up
|
||||
# and/or not always written the same way, making contributions from the
|
||||
# same person appearing not to be so or badly displayed.
|
||||
#
|
||||
# repo-abbrev: /pub/scm/linux/kernel/git/
|
||||
#
|
||||
|
||||
Aaron Durbin <adurbin@google.com>
|
||||
Adam Oldham <oldhamca@gmail.com>
|
||||
Adam Radford <aradford@gmail.com>
|
||||
Adrian Bunk <bunk@stusta.de>
|
||||
Alan Cox <alan@lxorguk.ukuu.org.uk>
|
||||
Alan Cox <root@hraefn.swansea.linux.org.uk>
|
||||
Aleksey Gorelov <aleksey_gorelov@phoenix.com>
|
||||
Al Viro <viro@ftp.linux.org.uk>
|
||||
Al Viro <viro@zenIV.linux.org.uk>
|
||||
Andreas Herrmann <aherrman@de.ibm.com>
|
||||
Andrew Morton <akpm@osdl.org>
|
||||
Andrew Vasquez <andrew.vasquez@qlogic.com>
|
||||
Andy Adamson <andros@citi.umich.edu>
|
||||
Arnaud Patard <arnaud.patard@rtp-net.org>
|
||||
Arnd Bergmann <arnd@arndb.de>
|
||||
Axel Dyks <xl@xlsigned.net>
|
||||
Ben Gardner <bgardner@wabtec.com>
|
||||
Ben M Cahill <ben.m.cahill@intel.com>
|
||||
Björn Steinbrink <B.Steinbrink@gmx.de>
|
||||
Brian Avery <b.avery@hp.com>
|
||||
Brian King <brking@us.ibm.com>
|
||||
Christoph Hellwig <hch@lst.de>
|
||||
Corey Minyard <minyard@acm.org>
|
||||
David Brownell <david-b@pacbell.net>
|
||||
David Woodhouse <dwmw2@shinybook.infradead.org>
|
||||
Domen Puncer <domen@coderock.org>
|
||||
Douglas Gilbert <dougg@torque.net>
|
||||
Ed L. Cashin <ecashin@coraid.com>
|
||||
Evgeniy Polyakov <johnpol@2ka.mipt.ru>
|
||||
Felipe W Damasio <felipewd@terra.com.br>
|
||||
Felix Kuhling <fxkuehl@gmx.de>
|
||||
Felix Moeller <felix@derklecks.de>
|
||||
Filipe Lautert <filipe@icewall.org>
|
||||
Franck Bui-Huu <vagabon.xyz@gmail.com>
|
||||
Frank Zago <fzago@systemfabricworks.com>
|
||||
Greg Kroah-Hartman <greg@echidna.(none)>
|
||||
Greg Kroah-Hartman <gregkh@suse.de>
|
||||
Greg Kroah-Hartman <greg@kroah.com>
|
||||
Henk Vergonet <Henk.Vergonet@gmail.com>
|
||||
Henrik Kretzschmar <henne@nachtwindheim.de>
|
||||
Herbert Xu <herbert@gondor.apana.org.au>
|
||||
Jacob Shin <Jacob.Shin@amd.com>
|
||||
James Bottomley <jejb@mulgrave.(none)>
|
||||
James Bottomley <jejb@titanic.il.steeleye.com>
|
||||
James E Wilson <wilson@specifix.com>
|
||||
James Ketrenos <jketreno@io.(none)>
|
||||
Jean Tourrilhes <jt@hpl.hp.com>
|
||||
Jeff Garzik <jgarzik@pretzel.yyz.us>
|
||||
Jens Axboe <axboe@suse.de>
|
||||
Jens Osterkamp <Jens.Osterkamp@de.ibm.com>
|
||||
John Stultz <johnstul@us.ibm.com>
|
||||
Juha Yrjola <at solidboot.com>
|
||||
Juha Yrjola <juha.yrjola@nokia.com>
|
||||
Juha Yrjola <juha.yrjola@solidboot.com>
|
||||
Kay Sievers <kay.sievers@vrfy.org>
|
||||
Kenneth W Chen <kenneth.w.chen@intel.com>
|
||||
Koushik <raghavendra.koushik@neterion.com>
|
||||
Leonid I Ananiev <leonid.i.ananiev@intel.com>
|
||||
Linas Vepstas <linas@austin.ibm.com>
|
||||
Matthieu CASTET <castet.matthieu@free.fr>
|
||||
Michael Buesch <mb@bu3sch.de>
|
||||
Michael Buesch <mbuesch@freenet.de>
|
||||
Michel Dänzer <michel@tungstengraphics.com>
|
||||
Mitesh shah <mshah@teja.com>
|
||||
Morten Welinder <terra@gnome.org>
|
||||
Morten Welinder <welinder@anemone.rentec.com>
|
||||
Morten Welinder <welinder@darter.rentec.com>
|
||||
Morten Welinder <welinder@troll.com>
|
||||
Nguyen Anh Quynh <aquynh@gmail.com>
|
||||
Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
|
||||
Patrick Mochel <mochel@digitalimplant.org>
|
||||
Peter A Jonsson <pj@ludd.ltu.se>
|
||||
Praveen BP <praveenbp@ti.com>
|
||||
Rajesh Shah <rajesh.shah@intel.com>
|
||||
Ralf Baechle <ralf@linux-mips.org>
|
||||
Ralf Wildenhues <Ralf.Wildenhues@gmx.de>
|
||||
Rémi Denis-Courmont <rdenis@simphalempin.com>
|
||||
Rudolf Marek <R.Marek@sh.cvut.cz>
|
||||
Rui Saraiva <rmps@joel.ist.utl.pt>
|
||||
Sachin P Sant <ssant@in.ibm.com>
|
||||
Sam Ravnborg <sam@mars.ravnborg.org>
|
||||
Simon Kelley <simon@thekelleys.org.uk>
|
||||
Stéphane Witzmann <stephane.witzmann@ubpmes.univ-bpclermont.fr>
|
||||
Stephen Hemminger <shemminger@osdl.org>
|
||||
Tejun Heo <htejun@gmail.com>
|
||||
Thomas Graf <tgraf@suug.ch>
|
||||
Tony Luck <tony.luck@intel.com>
|
||||
Tsuneo Yoshioka <Tsuneo.Yoshioka@f-secure.com>
|
||||
Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
|
||||
356
COPYING
Normal file
356
COPYING
Normal file
@ -0,0 +1,356 @@
|
||||
|
||||
NOTE! This copyright does *not* cover user programs that use kernel
|
||||
services by normal system calls - this is merely considered normal use
|
||||
of the kernel, and does *not* fall under the heading of "derived work".
|
||||
Also note that the GPL below is copyrighted by the Free Software
|
||||
Foundation, but the instance of code that it refers to (the Linux
|
||||
kernel) is copyrighted by me and others who actually wrote it.
|
||||
|
||||
Also note that the only valid version of the GPL as far as the kernel
|
||||
is concerned is _this_ particular version of the license (ie v2, not
|
||||
v2.2 or v3.x or whatever), unless explicitly otherwise stated.
|
||||
|
||||
Linus Torvalds
|
||||
|
||||
----------------------------------------
|
||||
|
||||
GNU GENERAL PUBLIC LICENSE
|
||||
Version 2, June 1991
|
||||
|
||||
Copyright (C) 1989, 1991 Free Software Foundation, Inc.
|
||||
51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
|
||||
Everyone is permitted to copy and distribute verbatim copies
|
||||
of this license document, but changing it is not allowed.
|
||||
|
||||
Preamble
|
||||
|
||||
The licenses for most software are designed to take away your
|
||||
freedom to share and change it. By contrast, the GNU General Public
|
||||
License is intended to guarantee your freedom to share and change free
|
||||
software--to make sure the software is free for all its users. This
|
||||
General Public License applies to most of the Free Software
|
||||
Foundation's software and to any other program whose authors commit to
|
||||
using it. (Some other Free Software Foundation software is covered by
|
||||
the GNU Library General Public License instead.) You can apply it to
|
||||
your programs, too.
|
||||
|
||||
When we speak of free software, we are referring to freedom, not
|
||||
price. Our General Public Licenses are designed to make sure that you
|
||||
have the freedom to distribute copies of free software (and charge for
|
||||
this service if you wish), that you receive source code or can get it
|
||||
if you want it, that you can change the software or use pieces of it
|
||||
in new free programs; and that you know you can do these things.
|
||||
|
||||
To protect your rights, we need to make restrictions that forbid
|
||||
anyone to deny you these rights or to ask you to surrender the rights.
|
||||
These restrictions translate to certain responsibilities for you if you
|
||||
distribute copies of the software, or if you modify it.
|
||||
|
||||
For example, if you distribute copies of such a program, whether
|
||||
gratis or for a fee, you must give the recipients all the rights that
|
||||
you have. You must make sure that they, too, receive or can get the
|
||||
source code. And you must show them these terms so they know their
|
||||
rights.
|
||||
|
||||
We protect your rights with two steps: (1) copyright the software, and
|
||||
(2) offer you this license which gives you legal permission to copy,
|
||||
distribute and/or modify the software.
|
||||
|
||||
Also, for each author's protection and ours, we want to make certain
|
||||
that everyone understands that there is no warranty for this free
|
||||
software. If the software is modified by someone else and passed on, we
|
||||
want its recipients to know that what they have is not the original, so
|
||||
that any problems introduced by others will not reflect on the original
|
||||
authors' reputations.
|
||||
|
||||
Finally, any free program is threatened constantly by software
|
||||
patents. We wish to avoid the danger that redistributors of a free
|
||||
program will individually obtain patent licenses, in effect making the
|
||||
program proprietary. To prevent this, we have made it clear that any
|
||||
patent must be licensed for everyone's free use or not licensed at all.
|
||||
|
||||
The precise terms and conditions for copying, distribution and
|
||||
modification follow.
|
||||
|
||||
GNU GENERAL PUBLIC LICENSE
|
||||
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
|
||||
|
||||
0. This License applies to any program or other work which contains
|
||||
a notice placed by the copyright holder saying it may be distributed
|
||||
under the terms of this General Public License. The "Program", below,
|
||||
refers to any such program or work, and a "work based on the Program"
|
||||
means either the Program or any derivative work under copyright law:
|
||||
that is to say, a work containing the Program or a portion of it,
|
||||
either verbatim or with modifications and/or translated into another
|
||||
language. (Hereinafter, translation is included without limitation in
|
||||
the term "modification".) Each licensee is addressed as "you".
|
||||
|
||||
Activities other than copying, distribution and modification are not
|
||||
covered by this License; they are outside its scope. The act of
|
||||
running the Program is not restricted, and the output from the Program
|
||||
is covered only if its contents constitute a work based on the
|
||||
Program (independent of having been made by running the Program).
|
||||
Whether that is true depends on what the Program does.
|
||||
|
||||
1. You may copy and distribute verbatim copies of the Program's
|
||||
source code as you receive it, in any medium, provided that you
|
||||
conspicuously and appropriately publish on each copy an appropriate
|
||||
copyright notice and disclaimer of warranty; keep intact all the
|
||||
notices that refer to this License and to the absence of any warranty;
|
||||
and give any other recipients of the Program a copy of this License
|
||||
along with the Program.
|
||||
|
||||
You may charge a fee for the physical act of transferring a copy, and
|
||||
you may at your option offer warranty protection in exchange for a fee.
|
||||
|
||||
2. You may modify your copy or copies of the Program or any portion
|
||||
of it, thus forming a work based on the Program, and copy and
|
||||
distribute such modifications or work under the terms of Section 1
|
||||
above, provided that you also meet all of these conditions:
|
||||
|
||||
a) You must cause the modified files to carry prominent notices
|
||||
stating that you changed the files and the date of any change.
|
||||
|
||||
b) You must cause any work that you distribute or publish, that in
|
||||
whole or in part contains or is derived from the Program or any
|
||||
part thereof, to be licensed as a whole at no charge to all third
|
||||
parties under the terms of this License.
|
||||
|
||||
c) If the modified program normally reads commands interactively
|
||||
when run, you must cause it, when started running for such
|
||||
interactive use in the most ordinary way, to print or display an
|
||||
announcement including an appropriate copyright notice and a
|
||||
notice that there is no warranty (or else, saying that you provide
|
||||
a warranty) and that users may redistribute the program under
|
||||
these conditions, and telling the user how to view a copy of this
|
||||
License. (Exception: if the Program itself is interactive but
|
||||
does not normally print such an announcement, your work based on
|
||||
the Program is not required to print an announcement.)
|
||||
|
||||
These requirements apply to the modified work as a whole. If
|
||||
identifiable sections of that work are not derived from the Program,
|
||||
and can be reasonably considered independent and separate works in
|
||||
themselves, then this License, and its terms, do not apply to those
|
||||
sections when you distribute them as separate works. But when you
|
||||
distribute the same sections as part of a whole which is a work based
|
||||
on the Program, the distribution of the whole must be on the terms of
|
||||
this License, whose permissions for other licensees extend to the
|
||||
entire whole, and thus to each and every part regardless of who wrote it.
|
||||
|
||||
Thus, it is not the intent of this section to claim rights or contest
|
||||
your rights to work written entirely by you; rather, the intent is to
|
||||
exercise the right to control the distribution of derivative or
|
||||
collective works based on the Program.
|
||||
|
||||
In addition, mere aggregation of another work not based on the Program
|
||||
with the Program (or with a work based on the Program) on a volume of
|
||||
a storage or distribution medium does not bring the other work under
|
||||
the scope of this License.
|
||||
|
||||
3. You may copy and distribute the Program (or a work based on it,
|
||||
under Section 2) in object code or executable form under the terms of
|
||||
Sections 1 and 2 above provided that you also do one of the following:
|
||||
|
||||
a) Accompany it with the complete corresponding machine-readable
|
||||
source code, which must be distributed under the terms of Sections
|
||||
1 and 2 above on a medium customarily used for software interchange; or,
|
||||
|
||||
b) Accompany it with a written offer, valid for at least three
|
||||
years, to give any third party, for a charge no more than your
|
||||
cost of physically performing source distribution, a complete
|
||||
machine-readable copy of the corresponding source code, to be
|
||||
distributed under the terms of Sections 1 and 2 above on a medium
|
||||
customarily used for software interchange; or,
|
||||
|
||||
c) Accompany it with the information you received as to the offer
|
||||
to distribute corresponding source code. (This alternative is
|
||||
allowed only for noncommercial distribution and only if you
|
||||
received the program in object code or executable form with such
|
||||
an offer, in accord with Subsection b above.)
|
||||
|
||||
The source code for a work means the preferred form of the work for
|
||||
making modifications to it. For an executable work, complete source
|
||||
code means all the source code for all modules it contains, plus any
|
||||
associated interface definition files, plus the scripts used to
|
||||
control compilation and installation of the executable. However, as a
|
||||
special exception, the source code distributed need not include
|
||||
anything that is normally distributed (in either source or binary
|
||||
form) with the major components (compiler, kernel, and so on) of the
|
||||
operating system on which the executable runs, unless that component
|
||||
itself accompanies the executable.
|
||||
|
||||
If distribution of executable or object code is made by offering
|
||||
access to copy from a designated place, then offering equivalent
|
||||
access to copy the source code from the same place counts as
|
||||
distribution of the source code, even though third parties are not
|
||||
compelled to copy the source along with the object code.
|
||||
|
||||
4. You may not copy, modify, sublicense, or distribute the Program
|
||||
except as expressly provided under this License. Any attempt
|
||||
otherwise to copy, modify, sublicense or distribute the Program is
|
||||
void, and will automatically terminate your rights under this License.
|
||||
However, parties who have received copies, or rights, from you under
|
||||
this License will not have their licenses terminated so long as such
|
||||
parties remain in full compliance.
|
||||
|
||||
5. You are not required to accept this License, since you have not
|
||||
signed it. However, nothing else grants you permission to modify or
|
||||
distribute the Program or its derivative works. These actions are
|
||||
prohibited by law if you do not accept this License. Therefore, by
|
||||
modifying or distributing the Program (or any work based on the
|
||||
Program), you indicate your acceptance of this License to do so, and
|
||||
all its terms and conditions for copying, distributing or modifying
|
||||
the Program or works based on it.
|
||||
|
||||
6. Each time you redistribute the Program (or any work based on the
|
||||
Program), the recipient automatically receives a license from the
|
||||
original licensor to copy, distribute or modify the Program subject to
|
||||
these terms and conditions. You may not impose any further
|
||||
restrictions on the recipients' exercise of the rights granted herein.
|
||||
You are not responsible for enforcing compliance by third parties to
|
||||
this License.
|
||||
|
||||
7. If, as a consequence of a court judgment or allegation of patent
|
||||
infringement or for any other reason (not limited to patent issues),
|
||||
conditions are imposed on you (whether by court order, agreement or
|
||||
otherwise) that contradict the conditions of this License, they do not
|
||||
excuse you from the conditions of this License. If you cannot
|
||||
distribute so as to satisfy simultaneously your obligations under this
|
||||
License and any other pertinent obligations, then as a consequence you
|
||||
may not distribute the Program at all. For example, if a patent
|
||||
license would not permit royalty-free redistribution of the Program by
|
||||
all those who receive copies directly or indirectly through you, then
|
||||
the only way you could satisfy both it and this License would be to
|
||||
refrain entirely from distribution of the Program.
|
||||
|
||||
If any portion of this section is held invalid or unenforceable under
|
||||
any particular circumstance, the balance of the section is intended to
|
||||
apply and the section as a whole is intended to apply in other
|
||||
circumstances.
|
||||
|
||||
It is not the purpose of this section to induce you to infringe any
|
||||
patents or other property right claims or to contest validity of any
|
||||
such claims; this section has the sole purpose of protecting the
|
||||
integrity of the free software distribution system, which is
|
||||
implemented by public license practices. Many people have made
|
||||
generous contributions to the wide range of software distributed
|
||||
through that system in reliance on consistent application of that
|
||||
system; it is up to the author/donor to decide if he or she is willing
|
||||
to distribute software through any other system and a licensee cannot
|
||||
impose that choice.
|
||||
|
||||
This section is intended to make thoroughly clear what is believed to
|
||||
be a consequence of the rest of this License.
|
||||
|
||||
8. If the distribution and/or use of the Program is restricted in
|
||||
certain countries either by patents or by copyrighted interfaces, the
|
||||
original copyright holder who places the Program under this License
|
||||
may add an explicit geographical distribution limitation excluding
|
||||
those countries, so that distribution is permitted only in or among
|
||||
countries not thus excluded. In such case, this License incorporates
|
||||
the limitation as if written in the body of this License.
|
||||
|
||||
9. The Free Software Foundation may publish revised and/or new versions
|
||||
of the General Public License from time to time. Such new versions will
|
||||
be similar in spirit to the present version, but may differ in detail to
|
||||
address new problems or concerns.
|
||||
|
||||
Each version is given a distinguishing version number. If the Program
|
||||
specifies a version number of this License which applies to it and "any
|
||||
later version", you have the option of following the terms and conditions
|
||||
either of that version or of any later version published by the Free
|
||||
Software Foundation. If the Program does not specify a version number of
|
||||
this License, you may choose any version ever published by the Free Software
|
||||
Foundation.
|
||||
|
||||
10. If you wish to incorporate parts of the Program into other free
|
||||
programs whose distribution conditions are different, write to the author
|
||||
to ask for permission. For software which is copyrighted by the Free
|
||||
Software Foundation, write to the Free Software Foundation; we sometimes
|
||||
make exceptions for this. Our decision will be guided by the two goals
|
||||
of preserving the free status of all derivatives of our free software and
|
||||
of promoting the sharing and reuse of software generally.
|
||||
|
||||
NO WARRANTY
|
||||
|
||||
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
|
||||
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
|
||||
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
|
||||
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
|
||||
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
|
||||
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
|
||||
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
|
||||
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
|
||||
REPAIR OR CORRECTION.
|
||||
|
||||
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
|
||||
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
|
||||
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
|
||||
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
|
||||
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
|
||||
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
|
||||
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
|
||||
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
|
||||
POSSIBILITY OF SUCH DAMAGES.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
How to Apply These Terms to Your New Programs
|
||||
|
||||
If you develop a new program, and you want it to be of the greatest
|
||||
possible use to the public, the best way to achieve this is to make it
|
||||
free software which everyone can redistribute and change under these terms.
|
||||
|
||||
To do so, attach the following notices to the program. It is safest
|
||||
to attach them to the start of each source file to most effectively
|
||||
convey the exclusion of warranty; and each file should have at least
|
||||
the "copyright" line and a pointer to where the full notice is found.
|
||||
|
||||
<one line to give the program's name and a brief idea of what it does.>
|
||||
Copyright (C) <year> <name of author>
|
||||
|
||||
This program is free software; you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation; either version 2 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with this program; if not, write to the Free Software
|
||||
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
|
||||
|
||||
|
||||
Also add information on how to contact you by electronic and paper mail.
|
||||
|
||||
If the program is interactive, make it output a short notice like this
|
||||
when it starts in an interactive mode:
|
||||
|
||||
Gnomovision version 69, Copyright (C) year name of author
|
||||
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
|
||||
This is free software, and you are welcome to redistribute it
|
||||
under certain conditions; type `show c' for details.
|
||||
|
||||
The hypothetical commands `show w' and `show c' should show the appropriate
|
||||
parts of the General Public License. Of course, the commands you use may
|
||||
be called something other than `show w' and `show c'; they could even be
|
||||
mouse-clicks or menu items--whatever suits your program.
|
||||
|
||||
You should also get your employer (if you work as a programmer) or your
|
||||
school, if any, to sign a "copyright disclaimer" for the program, if
|
||||
necessary. Here is a sample; alter the names:
|
||||
|
||||
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
|
||||
`Gnomovision' (which makes passes at compilers) written by James Hacker.
|
||||
|
||||
<signature of Ty Coon>, 1 April 1989
|
||||
Ty Coon, President of Vice
|
||||
|
||||
This General Public License does not permit incorporating your program into
|
||||
proprietary programs. If your program is a subroutine library, you may
|
||||
consider it more useful to permit linking proprietary applications with the
|
||||
library. If this is what you want to do, use the GNU Library General
|
||||
Public License instead of this License.
|
||||
302
Documentation/00-INDEX
Normal file
302
Documentation/00-INDEX
Normal file
@ -0,0 +1,302 @@
|
||||
|
||||
This is a brief list of all the files in ./linux/Documentation and what
|
||||
they contain. If you add a documentation file, please list it here in
|
||||
alphabetical order as well, or risk being hunted down like a rabid dog.
|
||||
Please try and keep the descriptions small enough to fit on one line.
|
||||
Thanks -- Paul G.
|
||||
|
||||
Following translations are available on the WWW:
|
||||
|
||||
- Japanese, maintained by the JF Project (JF@linux.or.jp), at
|
||||
http://www.linux.or.jp/JF/
|
||||
|
||||
00-INDEX
|
||||
- this file.
|
||||
BUG-HUNTING
|
||||
- brute force method of doing binary search of patches to find bug.
|
||||
Changes
|
||||
- list of changes that break older software packages.
|
||||
CodingStyle
|
||||
- how the boss likes the C code in the kernel to look.
|
||||
DMA-API.txt
|
||||
- DMA API, pci_ API & extensions for non-consistent memory machines.
|
||||
DMA-mapping.txt
|
||||
- info for PCI drivers using DMA portably across all platforms.
|
||||
DocBook/
|
||||
- directory with DocBook templates etc. for kernel documentation.
|
||||
HOWTO
|
||||
- The process and procedures of how to do Linux kernel development.
|
||||
IO-mapping.txt
|
||||
- how to access I/O mapped memory from within device drivers.
|
||||
IPMI.txt
|
||||
- info on Linux Intelligent Platform Management Interface (IPMI) Driver.
|
||||
IRQ-affinity.txt
|
||||
- how to select which CPU(s) handle which interrupt events on SMP.
|
||||
ManagementStyle
|
||||
- how to (attempt to) manage kernel hackers.
|
||||
MSI-HOWTO.txt
|
||||
- the Message Signaled Interrupts (MSI) Driver Guide HOWTO and FAQ.
|
||||
RCU/
|
||||
- directory with info on RCU (read-copy update).
|
||||
README.DAC960
|
||||
- info on Mylex DAC960/DAC1100 PCI RAID Controller Driver for Linux.
|
||||
SAK.txt
|
||||
- info on Secure Attention Keys.
|
||||
SubmittingDrivers
|
||||
- procedure to get a new driver source included into the kernel tree.
|
||||
SubmittingPatches
|
||||
- procedure to get a source patch included into the kernel tree.
|
||||
VGA-softcursor.txt
|
||||
- how to change your VGA cursor from a blinking underscore.
|
||||
applying-patches.txt
|
||||
- description of various trees and how to apply their patches.
|
||||
arm/
|
||||
- directory with info about Linux on the ARM architecture.
|
||||
basic_profiling.txt
|
||||
- basic instructions for those who wants to profile Linux kernel.
|
||||
binfmt_misc.txt
|
||||
- info on the kernel support for extra binary formats.
|
||||
block/
|
||||
- info on the Block I/O (BIO) layer.
|
||||
cachetlb.txt
|
||||
- describes the cache/TLB flushing interfaces Linux uses.
|
||||
cciss.txt
|
||||
- info, major/minor #'s for Compaq's SMART Array Controllers.
|
||||
cdrom/
|
||||
- directory with information on the CD-ROM drivers that Linux has.
|
||||
cli-sti-removal.txt
|
||||
- cli()/sti() removal guide.
|
||||
computone.txt
|
||||
- info on Computone Intelliport II/Plus Multiport Serial Driver.
|
||||
cpqarray.txt
|
||||
- info on using Compaq's SMART2 Intelligent Disk Array Controllers.
|
||||
cpu-freq/
|
||||
- info on CPU frequency and voltage scaling.
|
||||
cris/
|
||||
- directory with info about Linux on CRIS architecture.
|
||||
crypto/
|
||||
- directory with info on the Crypto API.
|
||||
debugging-modules.txt
|
||||
- some notes on debugging modules after Linux 2.6.3.
|
||||
device-mapper/
|
||||
- directory with info on Device Mapper.
|
||||
devices.txt
|
||||
- plain ASCII listing of all the nodes in /dev/ with major minor #'s.
|
||||
digiepca.txt
|
||||
- info on Digi Intl. {PC,PCI,EISA}Xx and Xem series cards.
|
||||
dnotify.txt
|
||||
- info about directory notification in Linux.
|
||||
driver-model/
|
||||
- directory with info about Linux driver model.
|
||||
dvb/
|
||||
- info on Linux Digital Video Broadcast (DVB) subsystem.
|
||||
early-userspace/
|
||||
- info about initramfs, klibc, and userspace early during boot.
|
||||
eisa.txt
|
||||
- info on EISA bus support.
|
||||
exception.txt
|
||||
- how Linux v2.2 handles exceptions without verify_area etc.
|
||||
fb/
|
||||
- directory with info on the frame buffer graphics abstraction layer.
|
||||
filesystems/
|
||||
- directory with info on the various filesystems that Linux supports.
|
||||
firmware_class/
|
||||
- request_firmware() hotplug interface info.
|
||||
floppy.txt
|
||||
- notes and driver options for the floppy disk driver.
|
||||
hayes-esp.txt
|
||||
- info on using the Hayes ESP serial driver.
|
||||
highuid.txt
|
||||
- notes on the change from 16 bit to 32 bit user/group IDs.
|
||||
hpet.txt
|
||||
- High Precision Event Timer Driver for Linux.
|
||||
hw_random.txt
|
||||
- info on Linux support for random number generator in i8xx chipsets.
|
||||
i2c/
|
||||
- directory with info about the I2C bus/protocol (2 wire, kHz speed).
|
||||
i2o/
|
||||
- directory with info about the Linux I2O subsystem.
|
||||
i386/
|
||||
- directory with info about Linux on Intel 32 bit architecture.
|
||||
ia64/
|
||||
- directory with info about Linux on Intel 64 bit architecture.
|
||||
ide.txt
|
||||
- important info for users of ATA devices (IDE/EIDE disks and CD-ROMS).
|
||||
initrd.txt
|
||||
- how to use the RAM disk as an initial/temporary root filesystem.
|
||||
input/
|
||||
- info on Linux input device support.
|
||||
io_ordering.txt
|
||||
- info on ordering I/O writes to memory-mapped addresses.
|
||||
ioctl-number.txt
|
||||
- how to implement and register device/driver ioctl calls.
|
||||
iostats.txt
|
||||
- info on I/O statistics Linux kernel provides.
|
||||
isapnp.txt
|
||||
- info on Linux ISA Plug & Play support.
|
||||
isdn/
|
||||
- directory with info on the Linux ISDN support, and supported cards.
|
||||
java.txt
|
||||
- info on the in-kernel binary support for Java(tm).
|
||||
kbuild/
|
||||
- directory with info about the kernel build process.
|
||||
kdumpt.txt
|
||||
- mini HowTo on getting the crash dump code to work.
|
||||
kernel-doc-nano-HOWTO.txt
|
||||
- mini HowTo on generation and location of kernel documentation files.
|
||||
kernel-docs.txt
|
||||
- listing of various WWW + books that document kernel internals.
|
||||
kernel-parameters.txt
|
||||
- summary listing of command line / boot prompt args for the kernel.
|
||||
kobject.txt
|
||||
- info of the kobject infrastructure of the Linux kernel.
|
||||
laptop-mode.txt
|
||||
- How to conserve battery power using laptop-mode.
|
||||
ldm.txt
|
||||
- a brief description of LDM (Windows Dynamic Disks).
|
||||
locks.txt
|
||||
- info on file locking implementations, flock() vs. fcntl(), etc.
|
||||
logo.gif
|
||||
- Full colour GIF image of Linux logo (penguin).
|
||||
logo.txt
|
||||
- Info on creator of above logo & site to get additional images from.
|
||||
m68k/
|
||||
- directory with info about Linux on Motorola 68k architecture.
|
||||
magic-number.txt
|
||||
- list of magic numbers used to mark/protect kernel data structures.
|
||||
mandatory.txt
|
||||
- info on the Linux implementation of Sys V mandatory file locking.
|
||||
mca.txt
|
||||
- info on supporting Micro Channel Architecture (e.g. PS/2) systems.
|
||||
md.txt
|
||||
- info on boot arguments for the multiple devices driver.
|
||||
memory.txt
|
||||
- info on typical Linux memory problems.
|
||||
mips/
|
||||
- directory with info about Linux on MIPS architecture.
|
||||
mono.txt
|
||||
- how to execute Mono-based .NET binaries with the help of BINFMT_MISC.
|
||||
moxa-smartio
|
||||
- info on installing/using Moxa multiport serial driver.
|
||||
mtrr.txt
|
||||
- how to use PPro Memory Type Range Registers to increase performance.
|
||||
nbd.txt
|
||||
- info on a TCP implementation of a network block device.
|
||||
netlabel/
|
||||
- directory with information on the NetLabel subsystem.
|
||||
networking/
|
||||
- directory with info on various aspects of networking with Linux.
|
||||
nfsroot.txt
|
||||
- short guide on setting up a diskless box with NFS root filesystem.
|
||||
nmi_watchdog.txt
|
||||
- info on NMI watchdog for SMP systems.
|
||||
numastat.txt
|
||||
- info on how to read Numa policy hit/miss statistics in sysfs.
|
||||
oops-tracing.txt
|
||||
- how to decode those nasty internal kernel error dump messages.
|
||||
paride.txt
|
||||
- information about the parallel port IDE subsystem.
|
||||
parisc/
|
||||
- directory with info on using Linux on PA-RISC architecture.
|
||||
parport.txt
|
||||
- how to use the parallel-port driver.
|
||||
parport-lowlevel.txt
|
||||
- description and usage of the low level parallel port functions.
|
||||
pci.txt
|
||||
- info on the PCI subsystem for device driver authors.
|
||||
pm.txt
|
||||
- info on Linux power management support.
|
||||
pnp.txt
|
||||
- Linux Plug and Play documentation.
|
||||
power/
|
||||
- directory with info on Linux PCI power management.
|
||||
powerpc/
|
||||
- directory with info on using Linux with the PowerPC.
|
||||
preempt-locking.txt
|
||||
- info on locking under a preemptive kernel.
|
||||
ramdisk.txt
|
||||
- short guide on how to set up and use the RAM disk.
|
||||
riscom8.txt
|
||||
- notes on using the RISCom/8 multi-port serial driver.
|
||||
rocket.txt
|
||||
- info on the Comtrol RocketPort multiport serial driver.
|
||||
rpc-cache.txt
|
||||
- introduction to the caching mechanisms in the sunrpc layer.
|
||||
rtc.txt
|
||||
- notes on how to use the Real Time Clock (aka CMOS clock) driver.
|
||||
s390/
|
||||
- directory with info on using Linux on the IBM S390.
|
||||
sched-coding.txt
|
||||
- reference for various scheduler-related methods in the O(1) scheduler.
|
||||
sched-design.txt
|
||||
- goals, design and implementation of the Linux O(1) scheduler.
|
||||
sched-domains.txt
|
||||
- information on scheduling domains.
|
||||
sched-stats.txt
|
||||
- information on schedstats (Linux Scheduler Statistics).
|
||||
scsi/
|
||||
- directory with info on Linux scsi support.
|
||||
serial/
|
||||
- directory with info on the low level serial API.
|
||||
serial-console.txt
|
||||
- how to set up Linux with a serial line console as the default.
|
||||
sgi-visws.txt
|
||||
- short blurb on the SGI Visual Workstations.
|
||||
sh/
|
||||
- directory with info on porting Linux to a new architecture.
|
||||
smart-config.txt
|
||||
- description of the Smart Config makefile feature.
|
||||
smp.txt
|
||||
- a few notes on symmetric multi-processing.
|
||||
sonypi.txt
|
||||
- info on Linux Sony Programmable I/O Device support.
|
||||
sound/
|
||||
- directory with info on sound card support.
|
||||
sparc/
|
||||
- directory with info on using Linux on Sparc architecture.
|
||||
specialix.txt
|
||||
- info on hardware/driver for specialix IO8+ multiport serial card.
|
||||
spinlocks.txt
|
||||
- info on using spinlocks to provide exclusive access in kernel.
|
||||
stable_api_nonsense.txt
|
||||
- info on why the kernel does not have a stable in-kernel api or abi.
|
||||
stable_kernel_rules.txt
|
||||
- rules and procedures for the -stable kernel releases.
|
||||
stallion.txt
|
||||
- info on using the Stallion multiport serial driver.
|
||||
svga.txt
|
||||
- short guide on selecting video modes at boot via VGA BIOS.
|
||||
sx.txt
|
||||
- info on the Specialix SX/SI multiport serial driver.
|
||||
sysctl/
|
||||
- directory with info on the /proc/sys/* files.
|
||||
sysrq.txt
|
||||
- info on the magic SysRq key.
|
||||
telephony/
|
||||
- directory with info on telephony (e.g. voice over IP) support.
|
||||
time_interpolators.txt
|
||||
- info on time interpolators.
|
||||
tipar.txt
|
||||
- information about Parallel link cable for Texas Instruments handhelds.
|
||||
tty.txt
|
||||
- guide to the locking policies of the tty layer.
|
||||
unicode.txt
|
||||
- info on the Unicode character/font mapping used in Linux.
|
||||
uml/
|
||||
- directory with information about User Mode Linux.
|
||||
usb/
|
||||
- directory with info regarding the Universal Serial Bus.
|
||||
video4linux/
|
||||
- directory with info regarding video/TV/radio cards and linux.
|
||||
vm/
|
||||
- directory with info on the Linux vm code.
|
||||
voyager.txt
|
||||
- guide to running Linux on the Voyager architecture.
|
||||
watchdog/
|
||||
- how to auto-reboot Linux if it has "fallen and can't get up". ;-)
|
||||
x86_64/
|
||||
- directory with info on Linux support for AMD x86-64 (Hammer) machines.
|
||||
xterm-linux.xpm
|
||||
- XPM image of penguin logo (see logo.txt) sitting on an xterm.
|
||||
zorro.txt
|
||||
- info on writing drivers for Zorro bus devices found on Amigas.
|
||||
77
Documentation/ABI/README
Normal file
77
Documentation/ABI/README
Normal file
@ -0,0 +1,77 @@
|
||||
This directory attempts to document the ABI between the Linux kernel and
|
||||
userspace, and the relative stability of these interfaces. Due to the
|
||||
everchanging nature of Linux, and the differing maturity levels, these
|
||||
interfaces should be used by userspace programs in different ways.
|
||||
|
||||
We have four different levels of ABI stability, as shown by the four
|
||||
different subdirectories in this location. Interfaces may change levels
|
||||
of stability according to the rules described below.
|
||||
|
||||
The different levels of stability are:
|
||||
|
||||
stable/
|
||||
This directory documents the interfaces that the developer has
|
||||
defined to be stable. Userspace programs are free to use these
|
||||
interfaces with no restrictions, and backward compatibility for
|
||||
them will be guaranteed for at least 2 years. Most interfaces
|
||||
(like syscalls) are expected to never change and always be
|
||||
available.
|
||||
|
||||
testing/
|
||||
This directory documents interfaces that are felt to be stable,
|
||||
as the main development of this interface has been completed.
|
||||
The interface can be changed to add new features, but the
|
||||
current interface will not break by doing this, unless grave
|
||||
errors or security problems are found in them. Userspace
|
||||
programs can start to rely on these interfaces, but they must be
|
||||
aware of changes that can occur before these interfaces move to
|
||||
be marked stable. Programs that use these interfaces are
|
||||
strongly encouraged to add their name to the description of
|
||||
these interfaces, so that the kernel developers can easily
|
||||
notify them if any changes occur (see the description of the
|
||||
layout of the files below for details on how to do this.)
|
||||
|
||||
obsolete/
|
||||
This directory documents interfaces that are still remaining in
|
||||
the kernel, but are marked to be removed at some later point in
|
||||
time. The description of the interface will document the reason
|
||||
why it is obsolete and when it can be expected to be removed.
|
||||
The file Documentation/feature-removal-schedule.txt may describe
|
||||
some of these interfaces, giving a schedule for when they will
|
||||
be removed.
|
||||
|
||||
removed/
|
||||
This directory contains a list of the old interfaces that have
|
||||
been removed from the kernel.
|
||||
|
||||
Every file in these directories will contain the following information:
|
||||
|
||||
What: Short description of the interface
|
||||
Date: Date created
|
||||
KernelVersion: Kernel version this feature first showed up in.
|
||||
Contact: Primary contact for this interface (may be a mailing list)
|
||||
Description: Long description of the interface and how to use it.
|
||||
Users: All users of this interface who wish to be notified when
|
||||
it changes. This is very important for interfaces in
|
||||
the "testing" stage, so that kernel developers can work
|
||||
with userspace developers to ensure that things do not
|
||||
break in ways that are unacceptable. It is also
|
||||
important to get feedback for these interfaces to make
|
||||
sure they are working in a proper way and do not need to
|
||||
be changed further.
|
||||
|
||||
|
||||
How things move between levels:
|
||||
|
||||
Interfaces in stable may move to obsolete, as long as the proper
|
||||
notification is given.
|
||||
|
||||
Interfaces may be removed from obsolete and the kernel as long as the
|
||||
documented amount of time has gone by.
|
||||
|
||||
Interfaces in the testing state can move to the stable state when the
|
||||
developers feel they are finished. They cannot be removed from the
|
||||
kernel tree without going through the obsolete state first.
|
||||
|
||||
It's up to the developer to place their interfaces in the category they
|
||||
wish for it to start out in.
|
||||
9
Documentation/ABI/obsolete/dv1394
Normal file
9
Documentation/ABI/obsolete/dv1394
Normal file
@ -0,0 +1,9 @@
|
||||
What: dv1394 (a.k.a. "OHCI-DV I/O support" for FireWire)
|
||||
Contact: linux1394-devel@lists.sourceforge.net
|
||||
Description:
|
||||
New application development should use raw1394 + userspace libraries
|
||||
instead, notably libiec61883 which is functionally equivalent.
|
||||
|
||||
Users:
|
||||
ffmpeg/libavformat (used by a variety of media players)
|
||||
dvgrab v1.x (replaced by dvgrab2 on top of raw1394 and resp. libraries)
|
||||
12
Documentation/ABI/removed/devfs
Normal file
12
Documentation/ABI/removed/devfs
Normal file
@ -0,0 +1,12 @@
|
||||
What: devfs
|
||||
Date: July 2005 (scheduled), finally removed in kernel v2.6.18
|
||||
Contact: Greg Kroah-Hartman <gregkh@suse.de>
|
||||
Description:
|
||||
devfs has been unmaintained for a number of years, has unfixable
|
||||
races, contains a naming policy within the kernel that is
|
||||
against the LSB, and can be replaced by using udev.
|
||||
The files fs/devfs/*, include/linux/devfs_fs*.h were removed,
|
||||
along with the the assorted devfs function calls throughout the
|
||||
kernel tree.
|
||||
|
||||
Users:
|
||||
10
Documentation/ABI/stable/syscalls
Normal file
10
Documentation/ABI/stable/syscalls
Normal file
@ -0,0 +1,10 @@
|
||||
What: The kernel syscall interface
|
||||
Description:
|
||||
This interface matches much of the POSIX interface and is based
|
||||
on it and other Unix based interfaces. It will only be added to
|
||||
over time, and not have things removed from it.
|
||||
|
||||
Note that this interface is different for every architecture
|
||||
that Linux supports. Please see the architecture-specific
|
||||
documentation for details on the syscall numbers that are to be
|
||||
mapped to each syscall.
|
||||
30
Documentation/ABI/stable/sysfs-module
Normal file
30
Documentation/ABI/stable/sysfs-module
Normal file
@ -0,0 +1,30 @@
|
||||
What: /sys/module
|
||||
Description:
|
||||
The /sys/module tree consists of the following structure:
|
||||
|
||||
/sys/module/MODULENAME
|
||||
The name of the module that is in the kernel. This
|
||||
module name will show up either if the module is built
|
||||
directly into the kernel, or if it is loaded as a
|
||||
dyanmic module.
|
||||
|
||||
/sys/module/MODULENAME/parameters
|
||||
This directory contains individual files that are each
|
||||
individual parameters of the module that are able to be
|
||||
changed at runtime. See the individual module
|
||||
documentation as to the contents of these parameters and
|
||||
what they accomplish.
|
||||
|
||||
Note: The individual parameter names and values are not
|
||||
considered stable, only the fact that they will be
|
||||
placed in this location within sysfs. See the
|
||||
individual driver documentation for details as to the
|
||||
stability of the different parameters.
|
||||
|
||||
/sys/module/MODULENAME/refcnt
|
||||
If the module is able to be unloaded from the kernel, this file
|
||||
will contain the current reference count of the module.
|
||||
|
||||
Note: If the module is built into the kernel, or if the
|
||||
CONFIG_MODULE_UNLOAD kernel configuration value is not enabled,
|
||||
this file will not be present.
|
||||
19
Documentation/ABI/testing/debugfs-pktcdvd
Normal file
19
Documentation/ABI/testing/debugfs-pktcdvd
Normal file
@ -0,0 +1,19 @@
|
||||
What: /debug/pktcdvd/pktcdvd[0-7]
|
||||
Date: Oct. 2006
|
||||
KernelVersion: 2.6.20
|
||||
Contact: Thomas Maier <balagi@justmail.de>
|
||||
Description:
|
||||
|
||||
debugfs interface
|
||||
-----------------
|
||||
|
||||
The pktcdvd module (packet writing driver) creates
|
||||
these files in debugfs:
|
||||
|
||||
/debug/pktcdvd/pktcdvd[0-7]/
|
||||
info (0444) Lots of driver statistics and infos.
|
||||
|
||||
Example:
|
||||
-------
|
||||
|
||||
cat /debug/pktcdvd/pktcdvd0/info
|
||||
16
Documentation/ABI/testing/sysfs-class
Normal file
16
Documentation/ABI/testing/sysfs-class
Normal file
@ -0,0 +1,16 @@
|
||||
What: /sys/class/
|
||||
Date: Febuary 2006
|
||||
Contact: Greg Kroah-Hartman <gregkh@suse.de>
|
||||
Description:
|
||||
The /sys/class directory will consist of a group of
|
||||
subdirectories describing individual classes of devices
|
||||
in the kernel. The individual directories will consist
|
||||
of either subdirectories, or symlinks to other
|
||||
directories.
|
||||
|
||||
All programs that use this directory tree must be able
|
||||
to handle both subdirectories or symlinks in order to
|
||||
work properly.
|
||||
|
||||
Users:
|
||||
udev <linux-hotplug-devel@lists.sourceforge.net>
|
||||
72
Documentation/ABI/testing/sysfs-class-pktcdvd
Normal file
72
Documentation/ABI/testing/sysfs-class-pktcdvd
Normal file
@ -0,0 +1,72 @@
|
||||
What: /sys/class/pktcdvd/
|
||||
Date: Oct. 2006
|
||||
KernelVersion: 2.6.20
|
||||
Contact: Thomas Maier <balagi@justmail.de>
|
||||
Description:
|
||||
|
||||
sysfs interface
|
||||
---------------
|
||||
|
||||
The pktcdvd module (packet writing driver) creates
|
||||
these files in the sysfs:
|
||||
(<devid> is in format major:minor )
|
||||
|
||||
/sys/class/pktcdvd/
|
||||
add (0200) Write a block device id (major:minor)
|
||||
to create a new pktcdvd device and map
|
||||
it to the block device.
|
||||
|
||||
remove (0200) Write the pktcdvd device id (major:minor)
|
||||
to it to remove the pktcdvd device.
|
||||
|
||||
device_map (0444) Shows the device mapping in format:
|
||||
pktcdvd[0-7] <pktdevid> <blkdevid>
|
||||
|
||||
/sys/class/pktcdvd/pktcdvd[0-7]/
|
||||
dev (0444) Device id
|
||||
uevent (0200) To send an uevent.
|
||||
|
||||
/sys/class/pktcdvd/pktcdvd[0-7]/stat/
|
||||
packets_started (0444) Number of started packets.
|
||||
packets_finished (0444) Number of finished packets.
|
||||
|
||||
kb_written (0444) kBytes written.
|
||||
kb_read (0444) kBytes read.
|
||||
kb_read_gather (0444) kBytes read to fill write packets.
|
||||
|
||||
reset (0200) Write any value to it to reset
|
||||
pktcdvd device statistic values, like
|
||||
bytes read/written.
|
||||
|
||||
/sys/class/pktcdvd/pktcdvd[0-7]/write_queue/
|
||||
size (0444) Contains the size of the bio write
|
||||
queue.
|
||||
|
||||
congestion_off (0644) If bio write queue size is below
|
||||
this mark, accept new bio requests
|
||||
from the block layer.
|
||||
|
||||
congestion_on (0644) If bio write queue size is higher
|
||||
as this mark, do no longer accept
|
||||
bio write requests from the block
|
||||
layer and wait till the pktcdvd
|
||||
device has processed enough bio's
|
||||
so that bio write queue size is
|
||||
below congestion off mark.
|
||||
A value of <= 0 disables congestion
|
||||
control.
|
||||
|
||||
|
||||
Example:
|
||||
--------
|
||||
To use the pktcdvd sysfs interface directly, you can do:
|
||||
|
||||
# create a new pktcdvd device mapped to /dev/hdc
|
||||
echo "22:0" >/sys/class/pktcdvd/add
|
||||
cat /sys/class/pktcdvd/device_map
|
||||
# assuming device pktcdvd0 was created, look at stat's
|
||||
cat /sys/class/pktcdvd/pktcdvd0/stat/kb_written
|
||||
# print the device id of the mapped block device
|
||||
fgrep pktcdvd0 /sys/class/pktcdvd/device_map
|
||||
# remove device, using pktcdvd0 device id 253:0
|
||||
echo "253:0" >/sys/class/pktcdvd/remove
|
||||
25
Documentation/ABI/testing/sysfs-devices
Normal file
25
Documentation/ABI/testing/sysfs-devices
Normal file
@ -0,0 +1,25 @@
|
||||
What: /sys/devices
|
||||
Date: February 2006
|
||||
Contact: Greg Kroah-Hartman <gregkh@suse.de>
|
||||
Description:
|
||||
The /sys/devices tree contains a snapshot of the
|
||||
internal state of the kernel device tree. Devices will
|
||||
be added and removed dynamically as the machine runs,
|
||||
and between different kernel versions, the layout of the
|
||||
devices within this tree will change.
|
||||
|
||||
Please do not rely on the format of this tree because of
|
||||
this. If a program wishes to find different things in
|
||||
the tree, please use the /sys/class structure and rely
|
||||
on the symlinks there to point to the proper location
|
||||
within the /sys/devices tree of the individual devices.
|
||||
Or rely on the uevent messages to notify programs of
|
||||
devices being added and removed from this tree to find
|
||||
the location of those devices.
|
||||
|
||||
Note that sometimes not all devices along the directory
|
||||
chain will have emitted uevent messages, so userspace
|
||||
programs must be able to handle such occurrences.
|
||||
|
||||
Users:
|
||||
udev <linux-hotplug-devel@lists.sourceforge.net>
|
||||
103
Documentation/ABI/testing/sysfs-power
Normal file
103
Documentation/ABI/testing/sysfs-power
Normal file
@ -0,0 +1,103 @@
|
||||
What: /sys/power/
|
||||
Date: August 2006
|
||||
Contact: Rafael J. Wysocki <rjw@sisk.pl>
|
||||
Description:
|
||||
The /sys/power directory will contain files that will
|
||||
provide a unified interface to the power management
|
||||
subsystem.
|
||||
|
||||
What: /sys/power/state
|
||||
Date: August 2006
|
||||
Contact: Rafael J. Wysocki <rjw@sisk.pl>
|
||||
Description:
|
||||
The /sys/power/state file controls the system power state.
|
||||
Reading from this file returns what states are supported,
|
||||
which is hard-coded to 'standby' (Power-On Suspend), 'mem'
|
||||
(Suspend-to-RAM), and 'disk' (Suspend-to-Disk).
|
||||
|
||||
Writing to this file one of these strings causes the system to
|
||||
transition into that state. Please see the file
|
||||
Documentation/power/states.txt for a description of each of
|
||||
these states.
|
||||
|
||||
What: /sys/power/disk
|
||||
Date: September 2006
|
||||
Contact: Rafael J. Wysocki <rjw@sisk.pl>
|
||||
Description:
|
||||
The /sys/power/disk file controls the operating mode of the
|
||||
suspend-to-disk mechanism. Reading from this file returns
|
||||
the name of the method by which the system will be put to
|
||||
sleep on the next suspend. There are four methods supported:
|
||||
'firmware' - means that the memory image will be saved to disk
|
||||
by some firmware, in which case we also assume that the
|
||||
firmware will handle the system suspend.
|
||||
'platform' - the memory image will be saved by the kernel and
|
||||
the system will be put to sleep by the platform driver (e.g.
|
||||
ACPI or other PM registers).
|
||||
'shutdown' - the memory image will be saved by the kernel and
|
||||
the system will be powered off.
|
||||
'reboot' - the memory image will be saved by the kernel and
|
||||
the system will be rebooted.
|
||||
|
||||
Additionally, /sys/power/disk can be used to turn on one of the
|
||||
two testing modes of the suspend-to-disk mechanism: 'testproc'
|
||||
or 'test'. If the suspend-to-disk mechanism is in the
|
||||
'testproc' mode, writing 'disk' to /sys/power/state will cause
|
||||
the kernel to disable nonboot CPUs and freeze tasks, wait for 5
|
||||
seconds, unfreeze tasks and enable nonboot CPUs. If it is in
|
||||
the 'test' mode, writing 'disk' to /sys/power/state will cause
|
||||
the kernel to disable nonboot CPUs and freeze tasks, shrink
|
||||
memory, suspend devices, wait for 5 seconds, resume devices,
|
||||
unfreeze tasks and enable nonboot CPUs. Then, we are able to
|
||||
look in the log messages and work out, for example, which code
|
||||
is being slow and which device drivers are misbehaving.
|
||||
|
||||
The suspend-to-disk method may be chosen by writing to this
|
||||
file one of the accepted strings:
|
||||
|
||||
'firmware'
|
||||
'platform'
|
||||
'shutdown'
|
||||
'reboot'
|
||||
'testproc'
|
||||
'test'
|
||||
|
||||
It will only change to 'firmware' or 'platform' if the system
|
||||
supports that.
|
||||
|
||||
What: /sys/power/image_size
|
||||
Date: August 2006
|
||||
Contact: Rafael J. Wysocki <rjw@sisk.pl>
|
||||
Description:
|
||||
The /sys/power/image_size file controls the size of the image
|
||||
created by the suspend-to-disk mechanism. It can be written a
|
||||
string representing a non-negative integer that will be used
|
||||
as an upper limit of the image size, in bytes. The kernel's
|
||||
suspend-to-disk code will do its best to ensure the image size
|
||||
will not exceed this number. However, if it turns out to be
|
||||
impossible, the kernel will try to suspend anyway using the
|
||||
smallest image possible. In particular, if "0" is written to
|
||||
this file, the suspend image will be as small as possible.
|
||||
|
||||
Reading from this file will display the current image size
|
||||
limit, which is set to 500 MB by default.
|
||||
|
||||
What: /sys/power/pm_trace
|
||||
Date: August 2006
|
||||
Contact: Rafael J. Wysocki <rjw@sisk.pl>
|
||||
Description:
|
||||
The /sys/power/pm_trace file controls the code which saves the
|
||||
last PM event point in the RTC across reboots, so that you can
|
||||
debug a machine that just hangs during suspend (or more
|
||||
commonly, during resume). Namely, the RTC is only used to save
|
||||
the last PM event point if this file contains '1'. Initially
|
||||
it contains '0' which may be changed to '1' by writing a
|
||||
string representing a nonzero integer into it.
|
||||
|
||||
To use this debugging feature you should attempt to suspend
|
||||
the machine, then reboot it and run
|
||||
|
||||
dmesg -s 1000000 | grep 'hash matches'
|
||||
|
||||
CAUTION: Using it will cause your machine's real-time (CMOS)
|
||||
clock to be set to a random invalid time after a resume.
|
||||
205
Documentation/BUG-HUNTING
Normal file
205
Documentation/BUG-HUNTING
Normal file
@ -0,0 +1,205 @@
|
||||
Table of contents
|
||||
=================
|
||||
|
||||
Last updated: 20 December 2005
|
||||
|
||||
Contents
|
||||
========
|
||||
|
||||
- Introduction
|
||||
- Devices not appearing
|
||||
- Finding patch that caused a bug
|
||||
-- Finding using git-bisect
|
||||
-- Finding it the old way
|
||||
- Fixing the bug
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
Always try the latest kernel from kernel.org and build from source. If you are
|
||||
not confident in doing that please report the bug to your distribution vendor
|
||||
instead of to a kernel developer.
|
||||
|
||||
Finding bugs is not always easy. Have a go though. If you can't find it don't
|
||||
give up. Report as much as you have found to the relevant maintainer. See
|
||||
MAINTAINERS for who that is for the subsystem you have worked on.
|
||||
|
||||
Before you submit a bug report read REPORTING-BUGS.
|
||||
|
||||
Devices not appearing
|
||||
=====================
|
||||
|
||||
Often this is caused by udev. Check that first before blaming it on the
|
||||
kernel.
|
||||
|
||||
Finding patch that caused a bug
|
||||
===============================
|
||||
|
||||
|
||||
|
||||
Finding using git-bisect
|
||||
------------------------
|
||||
|
||||
Using the provided tools with git makes finding bugs easy provided the bug is
|
||||
reproducible.
|
||||
|
||||
Steps to do it:
|
||||
- start using git for the kernel source
|
||||
- read the man page for git-bisect
|
||||
- have fun
|
||||
|
||||
Finding it the old way
|
||||
----------------------
|
||||
|
||||
[Sat Mar 2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)]
|
||||
|
||||
This is how to track down a bug if you know nothing about kernel hacking.
|
||||
It's a brute force approach but it works pretty well.
|
||||
|
||||
You need:
|
||||
|
||||
. A reproducible bug - it has to happen predictably (sorry)
|
||||
. All the kernel tar files from a revision that worked to the
|
||||
revision that doesn't
|
||||
|
||||
You will then do:
|
||||
|
||||
. Rebuild a revision that you believe works, install, and verify that.
|
||||
. Do a binary search over the kernels to figure out which one
|
||||
introduced the bug. I.e., suppose 1.3.28 didn't have the bug, but
|
||||
you know that 1.3.69 does. Pick a kernel in the middle and build
|
||||
that, like 1.3.50. Build & test; if it works, pick the mid point
|
||||
between .50 and .69, else the mid point between .28 and .50.
|
||||
. You'll narrow it down to the kernel that introduced the bug. You
|
||||
can probably do better than this but it gets tricky.
|
||||
|
||||
. Narrow it down to a subdirectory
|
||||
|
||||
- Copy kernel that works into "test". Let's say that 3.62 works,
|
||||
but 3.63 doesn't. So you diff -r those two kernels and come
|
||||
up with a list of directories that changed. For each of those
|
||||
directories:
|
||||
|
||||
Copy the non-working directory next to the working directory
|
||||
as "dir.63".
|
||||
One directory at time, try moving the working directory to
|
||||
"dir.62" and mv dir.63 dir"time, try
|
||||
|
||||
mv dir dir.62
|
||||
mv dir.63 dir
|
||||
find dir -name '*.[oa]' -print | xargs rm -f
|
||||
|
||||
And then rebuild and retest. Assuming that all related
|
||||
changes were contained in the sub directory, this should
|
||||
isolate the change to a directory.
|
||||
|
||||
Problems: changes in header files may have occurred; I've
|
||||
found in my case that they were self explanatory - you may
|
||||
or may not want to give up when that happens.
|
||||
|
||||
. Narrow it down to a file
|
||||
|
||||
- You can apply the same technique to each file in the directory,
|
||||
hoping that the changes in that file are self contained.
|
||||
|
||||
. Narrow it down to a routine
|
||||
|
||||
- You can take the old file and the new file and manually create
|
||||
a merged file that has
|
||||
|
||||
#ifdef VER62
|
||||
routine()
|
||||
{
|
||||
...
|
||||
}
|
||||
#else
|
||||
routine()
|
||||
{
|
||||
...
|
||||
}
|
||||
#endif
|
||||
|
||||
And then walk through that file, one routine at a time and
|
||||
prefix it with
|
||||
|
||||
#define VER62
|
||||
/* both routines here */
|
||||
#undef VER62
|
||||
|
||||
Then recompile, retest, move the ifdefs until you find the one
|
||||
that makes the difference.
|
||||
|
||||
Finally, you take all the info that you have, kernel revisions, bug
|
||||
description, the extent to which you have narrowed it down, and pass
|
||||
that off to whomever you believe is the maintainer of that section.
|
||||
A post to linux.dev.kernel isn't such a bad idea if you've done some
|
||||
work to narrow it down.
|
||||
|
||||
If you get it down to a routine, you'll probably get a fix in 24 hours.
|
||||
|
||||
My apologies to Linus and the other kernel hackers for describing this
|
||||
brute force approach, it's hardly what a kernel hacker would do. However,
|
||||
it does work and it lets non-hackers help fix bugs. And it is cool
|
||||
because Linux snapshots will let you do this - something that you can't
|
||||
do with vendor supplied releases.
|
||||
|
||||
Fixing the bug
|
||||
==============
|
||||
|
||||
Nobody is going to tell you how to fix bugs. Seriously. You need to work it
|
||||
out. But below are some hints on how to use the tools.
|
||||
|
||||
To debug a kernel, use objdump and look for the hex offset from the crash
|
||||
output to find the valid line of code/assembler. Without debug symbols, you
|
||||
will see the assembler code for the routine shown, but if your kernel has
|
||||
debug symbols the C code will also be available. (Debug symbols can be enabled
|
||||
in the kernel hacking menu of the menu configuration.) For example:
|
||||
|
||||
objdump -r -S -l --disassemble net/dccp/ipv4.o
|
||||
|
||||
NB.: you need to be at the top level of the kernel tree for this to pick up
|
||||
your C files.
|
||||
|
||||
If you don't have access to the code you can also debug on some crash dumps
|
||||
e.g. crash dump output as shown by Dave Miller.
|
||||
|
||||
> EIP is at ip_queue_xmit+0x14/0x4c0
|
||||
> ...
|
||||
> Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00
|
||||
> 00 00 55 57 56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08
|
||||
> <8b> 83 3c 01 00 00 89 44 24 14 8b 45 28 85 c0 89 44 24 18 0f 85
|
||||
>
|
||||
> Put the bytes into a "foo.s" file like this:
|
||||
>
|
||||
> .text
|
||||
> .globl foo
|
||||
> foo:
|
||||
> .byte .... /* bytes from Code: part of OOPS dump */
|
||||
>
|
||||
> Compile it with "gcc -c -o foo.o foo.s" then look at the output of
|
||||
> "objdump --disassemble foo.o".
|
||||
>
|
||||
> Output:
|
||||
>
|
||||
> ip_queue_xmit:
|
||||
> push %ebp
|
||||
> push %edi
|
||||
> push %esi
|
||||
> push %ebx
|
||||
> sub $0xbc, %esp
|
||||
> mov 0xd0(%esp), %ebp ! %ebp = arg0 (skb)
|
||||
> mov 0x8(%ebp), %ebx ! %ebx = skb->sk
|
||||
> mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt
|
||||
|
||||
Another very useful option of the Kernel Hacking section in menuconfig is
|
||||
Debug memory allocations. This will help you see whether data has been
|
||||
initialised and not set before use etc. To see the values that get assigned
|
||||
with this look at mm/slab.c and search for POISON_INUSE. When using this an
|
||||
Oops will often show the poisoned data instead of zero which is the default.
|
||||
|
||||
Once you have worked out a fix please submit it upstream. After all open
|
||||
source is about sharing what you do and don't you want to be recognised for
|
||||
your genius?
|
||||
|
||||
Please do read Documentation/SubmittingPatches though to help your code get
|
||||
accepted.
|
||||
395
Documentation/Changes
Normal file
395
Documentation/Changes
Normal file
@ -0,0 +1,395 @@
|
||||
Intro
|
||||
=====
|
||||
|
||||
This document is designed to provide a list of the minimum levels of
|
||||
software necessary to run the 2.6 kernels, as well as provide brief
|
||||
instructions regarding any other "Gotchas" users may encounter when
|
||||
trying life on the Bleeding Edge. If upgrading from a pre-2.4.x
|
||||
kernel, please consult the Changes file included with 2.4.x kernels for
|
||||
additional information; most of that information will not be repeated
|
||||
here. Basically, this document assumes that your system is already
|
||||
functional and running at least 2.4.x kernels.
|
||||
|
||||
This document is originally based on my "Changes" file for 2.0.x kernels
|
||||
and therefore owes credit to the same people as that file (Jared Mauch,
|
||||
Axel Boldt, Alessandro Sigala, and countless other users all over the
|
||||
'net).
|
||||
|
||||
Current Minimal Requirements
|
||||
============================
|
||||
|
||||
Upgrade to at *least* these software revisions before thinking you've
|
||||
encountered a bug! If you're unsure what version you're currently
|
||||
running, the suggested command should tell you.
|
||||
|
||||
Again, keep in mind that this list assumes you are already
|
||||
functionally running a Linux 2.4 kernel. Also, not all tools are
|
||||
necessary on all systems; obviously, if you don't have any ISDN
|
||||
hardware, for example, you probably needn't concern yourself with
|
||||
isdn4k-utils.
|
||||
|
||||
o Gnu C 3.2 # gcc --version
|
||||
o Gnu make 3.79.1 # make --version
|
||||
o binutils 2.12 # ld -v
|
||||
o util-linux 2.10o # fdformat --version
|
||||
o module-init-tools 0.9.10 # depmod -V
|
||||
o e2fsprogs 1.29 # tune2fs
|
||||
o jfsutils 1.1.3 # fsck.jfs -V
|
||||
o reiserfsprogs 3.6.3 # reiserfsck -V 2>&1|grep reiserfsprogs
|
||||
o xfsprogs 2.6.0 # xfs_db -V
|
||||
o pcmciautils 004 # pccardctl -V
|
||||
o quota-tools 3.09 # quota -V
|
||||
o PPP 2.4.0 # pppd --version
|
||||
o isdn4k-utils 3.1pre1 # isdnctrl 2>&1|grep version
|
||||
o nfs-utils 1.0.5 # showmount --version
|
||||
o procps 3.2.0 # ps --version
|
||||
o oprofile 0.9 # oprofiled --version
|
||||
o udev 081 # udevinfo -V
|
||||
|
||||
Kernel compilation
|
||||
==================
|
||||
|
||||
GCC
|
||||
---
|
||||
|
||||
The gcc version requirements may vary depending on the type of CPU in your
|
||||
computer.
|
||||
|
||||
Make
|
||||
----
|
||||
|
||||
You will need Gnu make 3.79.1 or later to build the kernel.
|
||||
|
||||
Binutils
|
||||
--------
|
||||
|
||||
Linux on IA-32 has recently switched from using as86 to using gas for
|
||||
assembling the 16-bit boot code, removing the need for as86 to compile
|
||||
your kernel. This change does, however, mean that you need a recent
|
||||
release of binutils.
|
||||
|
||||
System utilities
|
||||
================
|
||||
|
||||
Architectural changes
|
||||
---------------------
|
||||
|
||||
DevFS has been obsoleted in favour of udev
|
||||
(http://www.kernel.org/pub/linux/utils/kernel/hotplug/)
|
||||
|
||||
32-bit UID support is now in place. Have fun!
|
||||
|
||||
Linux documentation for functions is transitioning to inline
|
||||
documentation via specially-formatted comments near their
|
||||
definitions in the source. These comments can be combined with the
|
||||
SGML templates in the Documentation/DocBook directory to make DocBook
|
||||
files, which can then be converted by DocBook stylesheets to PostScript,
|
||||
HTML, PDF files, and several other formats. In order to convert from
|
||||
DocBook format to a format of your choice, you'll need to install Jade as
|
||||
well as the desired DocBook stylesheets.
|
||||
|
||||
Util-linux
|
||||
----------
|
||||
|
||||
New versions of util-linux provide *fdisk support for larger disks,
|
||||
support new options to mount, recognize more supported partition
|
||||
types, have a fdformat which works with 2.4 kernels, and similar goodies.
|
||||
You'll probably want to upgrade.
|
||||
|
||||
Ksymoops
|
||||
--------
|
||||
|
||||
If the unthinkable happens and your kernel oopses, you may need the
|
||||
ksymoops tool to decode it, but in most cases you don't.
|
||||
In the 2.6 kernel it is generally preferred to build the kernel with
|
||||
CONFIG_KALLSYMS so that it produces readable dumps that can be used as-is
|
||||
(this also produces better output than ksymoops).
|
||||
If for some reason your kernel is not build with CONFIG_KALLSYMS and
|
||||
you have no way to rebuild and reproduce the Oops with that option, then
|
||||
you can still decode that Oops with ksymoops.
|
||||
|
||||
Module-Init-Tools
|
||||
-----------------
|
||||
|
||||
A new module loader is now in the kernel that requires module-init-tools
|
||||
to use. It is backward compatible with the 2.4.x series kernels.
|
||||
|
||||
Mkinitrd
|
||||
--------
|
||||
|
||||
These changes to the /lib/modules file tree layout also require that
|
||||
mkinitrd be upgraded.
|
||||
|
||||
E2fsprogs
|
||||
---------
|
||||
|
||||
The latest version of e2fsprogs fixes several bugs in fsck and
|
||||
debugfs. Obviously, it's a good idea to upgrade.
|
||||
|
||||
JFSutils
|
||||
--------
|
||||
|
||||
The jfsutils package contains the utilities for the file system.
|
||||
The following utilities are available:
|
||||
o fsck.jfs - initiate replay of the transaction log, and check
|
||||
and repair a JFS formatted partition.
|
||||
o mkfs.jfs - create a JFS formatted partition.
|
||||
o other file system utilities are also available in this package.
|
||||
|
||||
Reiserfsprogs
|
||||
-------------
|
||||
|
||||
The reiserfsprogs package should be used for reiserfs-3.6.x
|
||||
(Linux kernels 2.4.x). It is a combined package and contains working
|
||||
versions of mkreiserfs, resize_reiserfs, debugreiserfs and
|
||||
reiserfsck. These utils work on both i386 and alpha platforms.
|
||||
|
||||
Xfsprogs
|
||||
--------
|
||||
|
||||
The latest version of xfsprogs contains mkfs.xfs, xfs_db, and the
|
||||
xfs_repair utilities, among others, for the XFS filesystem. It is
|
||||
architecture independent and any version from 2.0.0 onward should
|
||||
work correctly with this version of the XFS kernel code (2.6.0 or
|
||||
later is recommended, due to some significant improvements).
|
||||
|
||||
PCMCIAutils
|
||||
-----------
|
||||
|
||||
PCMCIAutils replaces pcmcia-cs (see below). It properly sets up
|
||||
PCMCIA sockets at system startup and loads the appropriate modules
|
||||
for 16-bit PCMCIA devices if the kernel is modularized and the hotplug
|
||||
subsystem is used.
|
||||
|
||||
Pcmcia-cs
|
||||
---------
|
||||
|
||||
PCMCIA (PC Card) support is now partially implemented in the main
|
||||
kernel source. The "pcmciautils" package (see above) replaces pcmcia-cs
|
||||
for newest kernels.
|
||||
|
||||
Quota-tools
|
||||
-----------
|
||||
|
||||
Support for 32 bit uid's and gid's is required if you want to use
|
||||
the newer version 2 quota format. Quota-tools version 3.07 and
|
||||
newer has this support. Use the recommended version or newer
|
||||
from the table above.
|
||||
|
||||
Intel IA32 microcode
|
||||
--------------------
|
||||
|
||||
A driver has been added to allow updating of Intel IA32 microcode,
|
||||
accessible as a normal (misc) character device. If you are not using
|
||||
udev you may need to:
|
||||
|
||||
mkdir /dev/cpu
|
||||
mknod /dev/cpu/microcode c 10 184
|
||||
chmod 0644 /dev/cpu/microcode
|
||||
|
||||
as root before you can use this. You'll probably also want to
|
||||
get the user-space microcode_ctl utility to use with this.
|
||||
|
||||
Powertweak
|
||||
----------
|
||||
|
||||
If you are running v0.1.17 or earlier, you should upgrade to
|
||||
version v0.99.0 or higher. Running old versions may cause problems
|
||||
with programs using shared memory.
|
||||
|
||||
udev
|
||||
----
|
||||
udev is a userspace application for populating /dev dynamically with
|
||||
only entries for devices actually present. udev replaces the basic
|
||||
functionality of devfs, while allowing persistent device naming for
|
||||
devices.
|
||||
|
||||
FUSE
|
||||
----
|
||||
|
||||
Needs libfuse 2.4.0 or later. Absolute minimum is 2.3.0 but mount
|
||||
options 'direct_io' and 'kernel_cache' won't work.
|
||||
|
||||
Networking
|
||||
==========
|
||||
|
||||
General changes
|
||||
---------------
|
||||
|
||||
If you have advanced network configuration needs, you should probably
|
||||
consider using the network tools from ip-route2.
|
||||
|
||||
Packet Filter / NAT
|
||||
-------------------
|
||||
The packet filtering and NAT code uses the same tools like the previous 2.4.x
|
||||
kernel series (iptables). It still includes backwards-compatibility modules
|
||||
for 2.2.x-style ipchains and 2.0.x-style ipfwadm.
|
||||
|
||||
PPP
|
||||
---
|
||||
|
||||
The PPP driver has been restructured to support multilink and to
|
||||
enable it to operate over diverse media layers. If you use PPP,
|
||||
upgrade pppd to at least 2.4.0.
|
||||
|
||||
If you are not using udev, you must have the device file /dev/ppp
|
||||
which can be made by:
|
||||
|
||||
mknod /dev/ppp c 108 0
|
||||
|
||||
as root.
|
||||
|
||||
Isdn4k-utils
|
||||
------------
|
||||
|
||||
Due to changes in the length of the phone number field, isdn4k-utils
|
||||
needs to be recompiled or (preferably) upgraded.
|
||||
|
||||
NFS-utils
|
||||
---------
|
||||
|
||||
In 2.4 and earlier kernels, the nfs server needed to know about any
|
||||
client that expected to be able to access files via NFS. This
|
||||
information would be given to the kernel by "mountd" when the client
|
||||
mounted the filesystem, or by "exportfs" at system startup. exportfs
|
||||
would take information about active clients from /var/lib/nfs/rmtab.
|
||||
|
||||
This approach is quite fragile as it depends on rmtab being correct
|
||||
which is not always easy, particularly when trying to implement
|
||||
fail-over. Even when the system is working well, rmtab suffers from
|
||||
getting lots of old entries that never get removed.
|
||||
|
||||
With 2.6 we have the option of having the kernel tell mountd when it
|
||||
gets a request from an unknown host, and mountd can give appropriate
|
||||
export information to the kernel. This removes the dependency on
|
||||
rmtab and means that the kernel only needs to know about currently
|
||||
active clients.
|
||||
|
||||
To enable this new functionality, you need to:
|
||||
|
||||
mount -t nfsd nfsd /proc/fs/nfsd
|
||||
|
||||
before running exportfs or mountd. It is recommended that all NFS
|
||||
services be protected from the internet-at-large by a firewall where
|
||||
that is possible.
|
||||
|
||||
Getting updated software
|
||||
========================
|
||||
|
||||
Kernel compilation
|
||||
******************
|
||||
|
||||
gcc
|
||||
---
|
||||
o <ftp://ftp.gnu.org/gnu/gcc/>
|
||||
|
||||
Make
|
||||
----
|
||||
o <ftp://ftp.gnu.org/gnu/make/>
|
||||
|
||||
Binutils
|
||||
--------
|
||||
o <ftp://ftp.kernel.org/pub/linux/devel/binutils/>
|
||||
|
||||
System utilities
|
||||
****************
|
||||
|
||||
Util-linux
|
||||
----------
|
||||
o <ftp://ftp.kernel.org/pub/linux/utils/util-linux/>
|
||||
|
||||
Ksymoops
|
||||
--------
|
||||
o <ftp://ftp.kernel.org/pub/linux/utils/kernel/ksymoops/v2.4/>
|
||||
|
||||
Module-Init-Tools
|
||||
-----------------
|
||||
o <ftp://ftp.kernel.org/pub/linux/kernel/people/rusty/modules/>
|
||||
|
||||
Mkinitrd
|
||||
--------
|
||||
o <ftp://rawhide.redhat.com/pub/rawhide/SRPMS/SRPMS/>
|
||||
|
||||
E2fsprogs
|
||||
---------
|
||||
o <http://prdownloads.sourceforge.net/e2fsprogs/e2fsprogs-1.29.tar.gz>
|
||||
|
||||
JFSutils
|
||||
--------
|
||||
o <http://jfs.sourceforge.net/>
|
||||
|
||||
Reiserfsprogs
|
||||
-------------
|
||||
o <http://www.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.3.tar.gz>
|
||||
|
||||
Xfsprogs
|
||||
--------
|
||||
o <ftp://oss.sgi.com/projects/xfs/download/>
|
||||
|
||||
Pcmciautils
|
||||
-----------
|
||||
o <ftp://ftp.kernel.org/pub/linux/utils/kernel/pcmcia/>
|
||||
|
||||
Pcmcia-cs
|
||||
---------
|
||||
o <http://pcmcia-cs.sourceforge.net/>
|
||||
|
||||
Quota-tools
|
||||
----------
|
||||
o <http://sourceforge.net/projects/linuxquota/>
|
||||
|
||||
DocBook Stylesheets
|
||||
-------------------
|
||||
o <http://nwalsh.com/docbook/dsssl/>
|
||||
|
||||
XMLTO XSLT Frontend
|
||||
-------------------
|
||||
o <http://cyberelk.net/tim/xmlto/>
|
||||
|
||||
Intel P6 microcode
|
||||
------------------
|
||||
o <http://www.urbanmyth.org/microcode/>
|
||||
|
||||
Powertweak
|
||||
----------
|
||||
o <http://powertweak.sourceforge.net/>
|
||||
|
||||
udev
|
||||
----
|
||||
o <http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html>
|
||||
|
||||
FUSE
|
||||
----
|
||||
o <http://sourceforge.net/projects/fuse>
|
||||
|
||||
Networking
|
||||
**********
|
||||
|
||||
PPP
|
||||
---
|
||||
o <ftp://ftp.samba.org/pub/ppp/ppp-2.4.0.tar.gz>
|
||||
|
||||
Isdn4k-utils
|
||||
------------
|
||||
o <ftp://ftp.isdn4linux.de/pub/isdn4linux/utils/isdn4k-utils.v3.1pre1.tar.gz>
|
||||
|
||||
NFS-utils
|
||||
---------
|
||||
o <http://sourceforge.net/project/showfiles.php?group_id=14>
|
||||
|
||||
Iptables
|
||||
--------
|
||||
o <http://www.iptables.org/downloads.html>
|
||||
|
||||
Ip-route2
|
||||
---------
|
||||
o <ftp://ftp.tux.org/pub/net/ip-routing/iproute2-2.2.4-now-ss991023.tar.gz>
|
||||
|
||||
OProfile
|
||||
--------
|
||||
o <http://oprofile.sf.net/download/>
|
||||
|
||||
NFS-Utils
|
||||
---------
|
||||
o <http://nfs.sourceforge.net/>
|
||||
|
||||
728
Documentation/CodingStyle
Normal file
728
Documentation/CodingStyle
Normal file
@ -0,0 +1,728 @@
|
||||
|
||||
Linux kernel coding style
|
||||
|
||||
This is a short document describing the preferred coding style for the
|
||||
linux kernel. Coding style is very personal, and I won't _force_ my
|
||||
views on anybody, but this is what goes for anything that I have to be
|
||||
able to maintain, and I'd prefer it for most other things too. Please
|
||||
at least consider the points made here.
|
||||
|
||||
First off, I'd suggest printing out a copy of the GNU coding standards,
|
||||
and NOT read it. Burn them, it's a great symbolic gesture.
|
||||
|
||||
Anyway, here goes:
|
||||
|
||||
|
||||
Chapter 1: Indentation
|
||||
|
||||
Tabs are 8 characters, and thus indentations are also 8 characters.
|
||||
There are heretic movements that try to make indentations 4 (or even 2!)
|
||||
characters deep, and that is akin to trying to define the value of PI to
|
||||
be 3.
|
||||
|
||||
Rationale: The whole idea behind indentation is to clearly define where
|
||||
a block of control starts and ends. Especially when you've been looking
|
||||
at your screen for 20 straight hours, you'll find it a lot easier to see
|
||||
how the indentation works if you have large indentations.
|
||||
|
||||
Now, some people will claim that having 8-character indentations makes
|
||||
the code move too far to the right, and makes it hard to read on a
|
||||
80-character terminal screen. The answer to that is that if you need
|
||||
more than 3 levels of indentation, you're screwed anyway, and should fix
|
||||
your program.
|
||||
|
||||
In short, 8-char indents make things easier to read, and have the added
|
||||
benefit of warning you when you're nesting your functions too deep.
|
||||
Heed that warning.
|
||||
|
||||
The preferred way to ease multiple indentation levels in a switch statement is
|
||||
to align the "switch" and its subordinate "case" labels in the same column
|
||||
instead of "double-indenting" the "case" labels. E.g.:
|
||||
|
||||
switch (suffix) {
|
||||
case 'G':
|
||||
case 'g':
|
||||
mem <<= 30;
|
||||
break;
|
||||
case 'M':
|
||||
case 'm':
|
||||
mem <<= 20;
|
||||
break;
|
||||
case 'K':
|
||||
case 'k':
|
||||
mem <<= 10;
|
||||
/* fall through */
|
||||
default:
|
||||
break;
|
||||
}
|
||||
|
||||
|
||||
Don't put multiple statements on a single line unless you have
|
||||
something to hide:
|
||||
|
||||
if (condition) do_this;
|
||||
do_something_everytime;
|
||||
|
||||
Don't put multiple assignments on a single line either. Kernel coding style
|
||||
is super simple. Avoid tricky expressions.
|
||||
|
||||
Outside of comments, documentation and except in Kconfig, spaces are never
|
||||
used for indentation, and the above example is deliberately broken.
|
||||
|
||||
Get a decent editor and don't leave whitespace at the end of lines.
|
||||
|
||||
|
||||
Chapter 2: Breaking long lines and strings
|
||||
|
||||
Coding style is all about readability and maintainability using commonly
|
||||
available tools.
|
||||
|
||||
The limit on the length of lines is 80 columns and this is a hard limit.
|
||||
|
||||
Statements longer than 80 columns will be broken into sensible chunks.
|
||||
Descendants are always substantially shorter than the parent and are placed
|
||||
substantially to the right. The same applies to function headers with a long
|
||||
argument list. Long strings are as well broken into shorter strings.
|
||||
|
||||
void fun(int a, int b, int c)
|
||||
{
|
||||
if (condition)
|
||||
printk(KERN_WARNING "Warning this is a long printk with "
|
||||
"3 parameters a: %u b: %u "
|
||||
"c: %u \n", a, b, c);
|
||||
else
|
||||
next_statement;
|
||||
}
|
||||
|
||||
Chapter 3: Placing Braces and Spaces
|
||||
|
||||
The other issue that always comes up in C styling is the placement of
|
||||
braces. Unlike the indent size, there are few technical reasons to
|
||||
choose one placement strategy over the other, but the preferred way, as
|
||||
shown to us by the prophets Kernighan and Ritchie, is to put the opening
|
||||
brace last on the line, and put the closing brace first, thusly:
|
||||
|
||||
if (x is true) {
|
||||
we do y
|
||||
}
|
||||
|
||||
This applies to all non-function statement blocks (if, switch, for,
|
||||
while, do). E.g.:
|
||||
|
||||
switch (action) {
|
||||
case KOBJ_ADD:
|
||||
return "add";
|
||||
case KOBJ_REMOVE:
|
||||
return "remove";
|
||||
case KOBJ_CHANGE:
|
||||
return "change";
|
||||
default:
|
||||
return NULL;
|
||||
}
|
||||
|
||||
However, there is one special case, namely functions: they have the
|
||||
opening brace at the beginning of the next line, thus:
|
||||
|
||||
int function(int x)
|
||||
{
|
||||
body of function
|
||||
}
|
||||
|
||||
Heretic people all over the world have claimed that this inconsistency
|
||||
is ... well ... inconsistent, but all right-thinking people know that
|
||||
(a) K&R are _right_ and (b) K&R are right. Besides, functions are
|
||||
special anyway (you can't nest them in C).
|
||||
|
||||
Note that the closing brace is empty on a line of its own, _except_ in
|
||||
the cases where it is followed by a continuation of the same statement,
|
||||
ie a "while" in a do-statement or an "else" in an if-statement, like
|
||||
this:
|
||||
|
||||
do {
|
||||
body of do-loop
|
||||
} while (condition);
|
||||
|
||||
and
|
||||
|
||||
if (x == y) {
|
||||
..
|
||||
} else if (x > y) {
|
||||
...
|
||||
} else {
|
||||
....
|
||||
}
|
||||
|
||||
Rationale: K&R.
|
||||
|
||||
Also, note that this brace-placement also minimizes the number of empty
|
||||
(or almost empty) lines, without any loss of readability. Thus, as the
|
||||
supply of new-lines on your screen is not a renewable resource (think
|
||||
25-line terminal screens here), you have more empty lines to put
|
||||
comments on.
|
||||
|
||||
3.1: Spaces
|
||||
|
||||
Linux kernel style for use of spaces depends (mostly) on
|
||||
function-versus-keyword usage. Use a space after (most) keywords. The
|
||||
notable exceptions are sizeof, typeof, alignof, and __attribute__, which look
|
||||
somewhat like functions (and are usually used with parentheses in Linux,
|
||||
although they are not required in the language, as in: "sizeof info" after
|
||||
"struct fileinfo info;" is declared).
|
||||
|
||||
So use a space after these keywords:
|
||||
if, switch, case, for, do, while
|
||||
but not with sizeof, typeof, alignof, or __attribute__. E.g.,
|
||||
s = sizeof(struct file);
|
||||
|
||||
Do not add spaces around (inside) parenthesized expressions. This example is
|
||||
*bad*:
|
||||
|
||||
s = sizeof( struct file );
|
||||
|
||||
When declaring pointer data or a function that returns a pointer type, the
|
||||
preferred use of '*' is adjacent to the data name or function name and not
|
||||
adjacent to the type name. Examples:
|
||||
|
||||
char *linux_banner;
|
||||
unsigned long long memparse(char *ptr, char **retptr);
|
||||
char *match_strdup(substring_t *s);
|
||||
|
||||
Use one space around (on each side of) most binary and ternary operators,
|
||||
such as any of these:
|
||||
|
||||
= + - < > * / % | & ^ <= >= == != ? :
|
||||
|
||||
but no space after unary operators:
|
||||
& * + - ~ ! sizeof typeof alignof __attribute__ defined
|
||||
|
||||
no space before the postfix increment & decrement unary operators:
|
||||
++ --
|
||||
|
||||
no space after the prefix increment & decrement unary operators:
|
||||
++ --
|
||||
|
||||
and no space around the '.' and "->" structure member operators.
|
||||
|
||||
|
||||
Chapter 4: Naming
|
||||
|
||||
C is a Spartan language, and so should your naming be. Unlike Modula-2
|
||||
and Pascal programmers, C programmers do not use cute names like
|
||||
ThisVariableIsATemporaryCounter. A C programmer would call that
|
||||
variable "tmp", which is much easier to write, and not the least more
|
||||
difficult to understand.
|
||||
|
||||
HOWEVER, while mixed-case names are frowned upon, descriptive names for
|
||||
global variables are a must. To call a global function "foo" is a
|
||||
shooting offense.
|
||||
|
||||
GLOBAL variables (to be used only if you _really_ need them) need to
|
||||
have descriptive names, as do global functions. If you have a function
|
||||
that counts the number of active users, you should call that
|
||||
"count_active_users()" or similar, you should _not_ call it "cntusr()".
|
||||
|
||||
Encoding the type of a function into the name (so-called Hungarian
|
||||
notation) is brain damaged - the compiler knows the types anyway and can
|
||||
check those, and it only confuses the programmer. No wonder MicroSoft
|
||||
makes buggy programs.
|
||||
|
||||
LOCAL variable names should be short, and to the point. If you have
|
||||
some random integer loop counter, it should probably be called "i".
|
||||
Calling it "loop_counter" is non-productive, if there is no chance of it
|
||||
being mis-understood. Similarly, "tmp" can be just about any type of
|
||||
variable that is used to hold a temporary value.
|
||||
|
||||
If you are afraid to mix up your local variable names, you have another
|
||||
problem, which is called the function-growth-hormone-imbalance syndrome.
|
||||
See chapter 6 (Functions).
|
||||
|
||||
|
||||
Chapter 5: Typedefs
|
||||
|
||||
Please don't use things like "vps_t".
|
||||
|
||||
It's a _mistake_ to use typedef for structures and pointers. When you see a
|
||||
|
||||
vps_t a;
|
||||
|
||||
in the source, what does it mean?
|
||||
|
||||
In contrast, if it says
|
||||
|
||||
struct virtual_container *a;
|
||||
|
||||
you can actually tell what "a" is.
|
||||
|
||||
Lots of people think that typedefs "help readability". Not so. They are
|
||||
useful only for:
|
||||
|
||||
(a) totally opaque objects (where the typedef is actively used to _hide_
|
||||
what the object is).
|
||||
|
||||
Example: "pte_t" etc. opaque objects that you can only access using
|
||||
the proper accessor functions.
|
||||
|
||||
NOTE! Opaqueness and "accessor functions" are not good in themselves.
|
||||
The reason we have them for things like pte_t etc. is that there
|
||||
really is absolutely _zero_ portably accessible information there.
|
||||
|
||||
(b) Clear integer types, where the abstraction _helps_ avoid confusion
|
||||
whether it is "int" or "long".
|
||||
|
||||
u8/u16/u32 are perfectly fine typedefs, although they fit into
|
||||
category (d) better than here.
|
||||
|
||||
NOTE! Again - there needs to be a _reason_ for this. If something is
|
||||
"unsigned long", then there's no reason to do
|
||||
|
||||
typedef unsigned long myflags_t;
|
||||
|
||||
but if there is a clear reason for why it under certain circumstances
|
||||
might be an "unsigned int" and under other configurations might be
|
||||
"unsigned long", then by all means go ahead and use a typedef.
|
||||
|
||||
(c) when you use sparse to literally create a _new_ type for
|
||||
type-checking.
|
||||
|
||||
(d) New types which are identical to standard C99 types, in certain
|
||||
exceptional circumstances.
|
||||
|
||||
Although it would only take a short amount of time for the eyes and
|
||||
brain to become accustomed to the standard types like 'uint32_t',
|
||||
some people object to their use anyway.
|
||||
|
||||
Therefore, the Linux-specific 'u8/u16/u32/u64' types and their
|
||||
signed equivalents which are identical to standard types are
|
||||
permitted -- although they are not mandatory in new code of your
|
||||
own.
|
||||
|
||||
When editing existing code which already uses one or the other set
|
||||
of types, you should conform to the existing choices in that code.
|
||||
|
||||
(e) Types safe for use in userspace.
|
||||
|
||||
In certain structures which are visible to userspace, we cannot
|
||||
require C99 types and cannot use the 'u32' form above. Thus, we
|
||||
use __u32 and similar types in all structures which are shared
|
||||
with userspace.
|
||||
|
||||
Maybe there are other cases too, but the rule should basically be to NEVER
|
||||
EVER use a typedef unless you can clearly match one of those rules.
|
||||
|
||||
In general, a pointer, or a struct that has elements that can reasonably
|
||||
be directly accessed should _never_ be a typedef.
|
||||
|
||||
|
||||
Chapter 6: Functions
|
||||
|
||||
Functions should be short and sweet, and do just one thing. They should
|
||||
fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24,
|
||||
as we all know), and do one thing and do that well.
|
||||
|
||||
The maximum length of a function is inversely proportional to the
|
||||
complexity and indentation level of that function. So, if you have a
|
||||
conceptually simple function that is just one long (but simple)
|
||||
case-statement, where you have to do lots of small things for a lot of
|
||||
different cases, it's OK to have a longer function.
|
||||
|
||||
However, if you have a complex function, and you suspect that a
|
||||
less-than-gifted first-year high-school student might not even
|
||||
understand what the function is all about, you should adhere to the
|
||||
maximum limits all the more closely. Use helper functions with
|
||||
descriptive names (you can ask the compiler to in-line them if you think
|
||||
it's performance-critical, and it will probably do a better job of it
|
||||
than you would have done).
|
||||
|
||||
Another measure of the function is the number of local variables. They
|
||||
shouldn't exceed 5-10, or you're doing something wrong. Re-think the
|
||||
function, and split it into smaller pieces. A human brain can
|
||||
generally easily keep track of about 7 different things, anything more
|
||||
and it gets confused. You know you're brilliant, but maybe you'd like
|
||||
to understand what you did 2 weeks from now.
|
||||
|
||||
In source files, separate functions with one blank line. If the function is
|
||||
exported, the EXPORT* macro for it should follow immediately after the closing
|
||||
function brace line. E.g.:
|
||||
|
||||
int system_is_up(void)
|
||||
{
|
||||
return system_state == SYSTEM_RUNNING;
|
||||
}
|
||||
EXPORT_SYMBOL(system_is_up);
|
||||
|
||||
In function prototypes, include parameter names with their data types.
|
||||
Although this is not required by the C language, it is preferred in Linux
|
||||
because it is a simple way to add valuable information for the reader.
|
||||
|
||||
|
||||
Chapter 7: Centralized exiting of functions
|
||||
|
||||
Albeit deprecated by some people, the equivalent of the goto statement is
|
||||
used frequently by compilers in form of the unconditional jump instruction.
|
||||
|
||||
The goto statement comes in handy when a function exits from multiple
|
||||
locations and some common work such as cleanup has to be done.
|
||||
|
||||
The rationale is:
|
||||
|
||||
- unconditional statements are easier to understand and follow
|
||||
- nesting is reduced
|
||||
- errors by not updating individual exit points when making
|
||||
modifications are prevented
|
||||
- saves the compiler work to optimize redundant code away ;)
|
||||
|
||||
int fun(int a)
|
||||
{
|
||||
int result = 0;
|
||||
char *buffer = kmalloc(SIZE);
|
||||
|
||||
if (buffer == NULL)
|
||||
return -ENOMEM;
|
||||
|
||||
if (condition1) {
|
||||
while (loop1) {
|
||||
...
|
||||
}
|
||||
result = 1;
|
||||
goto out;
|
||||
}
|
||||
...
|
||||
out:
|
||||
kfree(buffer);
|
||||
return result;
|
||||
}
|
||||
|
||||
Chapter 8: Commenting
|
||||
|
||||
Comments are good, but there is also a danger of over-commenting. NEVER
|
||||
try to explain HOW your code works in a comment: it's much better to
|
||||
write the code so that the _working_ is obvious, and it's a waste of
|
||||
time to explain badly written code.
|
||||
|
||||
Generally, you want your comments to tell WHAT your code does, not HOW.
|
||||
Also, try to avoid putting comments inside a function body: if the
|
||||
function is so complex that you need to separately comment parts of it,
|
||||
you should probably go back to chapter 6 for a while. You can make
|
||||
small comments to note or warn about something particularly clever (or
|
||||
ugly), but try to avoid excess. Instead, put the comments at the head
|
||||
of the function, telling people what it does, and possibly WHY it does
|
||||
it.
|
||||
|
||||
When commenting the kernel API functions, please use the kernel-doc format.
|
||||
See the files Documentation/kernel-doc-nano-HOWTO.txt and scripts/kernel-doc
|
||||
for details.
|
||||
|
||||
Linux style for comments is the C89 "/* ... */" style.
|
||||
Don't use C99-style "// ..." comments.
|
||||
|
||||
The preferred style for long (multi-line) comments is:
|
||||
|
||||
/*
|
||||
* This is the preferred style for multi-line
|
||||
* comments in the Linux kernel source code.
|
||||
* Please use it consistently.
|
||||
*
|
||||
* Description: A column of asterisks on the left side,
|
||||
* with beginning and ending almost-blank lines.
|
||||
*/
|
||||
|
||||
It's also important to comment data, whether they are basic types or derived
|
||||
types. To this end, use just one data declaration per line (no commas for
|
||||
multiple data declarations). This leaves you room for a small comment on each
|
||||
item, explaining its use.
|
||||
|
||||
|
||||
Chapter 9: You've made a mess of it
|
||||
|
||||
That's OK, we all do. You've probably been told by your long-time Unix
|
||||
user helper that "GNU emacs" automatically formats the C sources for
|
||||
you, and you've noticed that yes, it does do that, but the defaults it
|
||||
uses are less than desirable (in fact, they are worse than random
|
||||
typing - an infinite number of monkeys typing into GNU emacs would never
|
||||
make a good program).
|
||||
|
||||
So, you can either get rid of GNU emacs, or change it to use saner
|
||||
values. To do the latter, you can stick the following in your .emacs file:
|
||||
|
||||
(defun linux-c-mode ()
|
||||
"C mode with adjusted defaults for use with the Linux kernel."
|
||||
(interactive)
|
||||
(c-mode)
|
||||
(c-set-style "K&R")
|
||||
(setq tab-width 8)
|
||||
(setq indent-tabs-mode t)
|
||||
(setq c-basic-offset 8))
|
||||
|
||||
This will define the M-x linux-c-mode command. When hacking on a
|
||||
module, if you put the string -*- linux-c -*- somewhere on the first
|
||||
two lines, this mode will be automatically invoked. Also, you may want
|
||||
to add
|
||||
|
||||
(setq auto-mode-alist (cons '("/usr/src/linux.*/.*\\.[ch]$" . linux-c-mode)
|
||||
auto-mode-alist))
|
||||
|
||||
to your .emacs file if you want to have linux-c-mode switched on
|
||||
automagically when you edit source files under /usr/src/linux.
|
||||
|
||||
But even if you fail in getting emacs to do sane formatting, not
|
||||
everything is lost: use "indent".
|
||||
|
||||
Now, again, GNU indent has the same brain-dead settings that GNU emacs
|
||||
has, which is why you need to give it a few command line options.
|
||||
However, that's not too bad, because even the makers of GNU indent
|
||||
recognize the authority of K&R (the GNU people aren't evil, they are
|
||||
just severely misguided in this matter), so you just give indent the
|
||||
options "-kr -i8" (stands for "K&R, 8 character indents"), or use
|
||||
"scripts/Lindent", which indents in the latest style.
|
||||
|
||||
"indent" has a lot of options, and especially when it comes to comment
|
||||
re-formatting you may want to take a look at the man page. But
|
||||
remember: "indent" is not a fix for bad programming.
|
||||
|
||||
|
||||
Chapter 10: Configuration-files
|
||||
|
||||
For configuration options (arch/xxx/Kconfig, and all the Kconfig files),
|
||||
somewhat different indentation is used.
|
||||
|
||||
Help text is indented with 2 spaces.
|
||||
|
||||
if CONFIG_EXPERIMENTAL
|
||||
tristate CONFIG_BOOM
|
||||
default n
|
||||
help
|
||||
Apply nitroglycerine inside the keyboard (DANGEROUS)
|
||||
bool CONFIG_CHEER
|
||||
depends on CONFIG_BOOM
|
||||
default y
|
||||
help
|
||||
Output nice messages when you explode
|
||||
endif
|
||||
|
||||
Generally, CONFIG_EXPERIMENTAL should surround all options not considered
|
||||
stable. All options that are known to trash data (experimental write-
|
||||
support for file-systems, for instance) should be denoted (DANGEROUS), other
|
||||
experimental options should be denoted (EXPERIMENTAL).
|
||||
|
||||
|
||||
Chapter 11: Data structures
|
||||
|
||||
Data structures that have visibility outside the single-threaded
|
||||
environment they are created and destroyed in should always have
|
||||
reference counts. In the kernel, garbage collection doesn't exist (and
|
||||
outside the kernel garbage collection is slow and inefficient), which
|
||||
means that you absolutely _have_ to reference count all your uses.
|
||||
|
||||
Reference counting means that you can avoid locking, and allows multiple
|
||||
users to have access to the data structure in parallel - and not having
|
||||
to worry about the structure suddenly going away from under them just
|
||||
because they slept or did something else for a while.
|
||||
|
||||
Note that locking is _not_ a replacement for reference counting.
|
||||
Locking is used to keep data structures coherent, while reference
|
||||
counting is a memory management technique. Usually both are needed, and
|
||||
they are not to be confused with each other.
|
||||
|
||||
Many data structures can indeed have two levels of reference counting,
|
||||
when there are users of different "classes". The subclass count counts
|
||||
the number of subclass users, and decrements the global count just once
|
||||
when the subclass count goes to zero.
|
||||
|
||||
Examples of this kind of "multi-level-reference-counting" can be found in
|
||||
memory management ("struct mm_struct": mm_users and mm_count), and in
|
||||
filesystem code ("struct super_block": s_count and s_active).
|
||||
|
||||
Remember: if another thread can find your data structure, and you don't
|
||||
have a reference count on it, you almost certainly have a bug.
|
||||
|
||||
|
||||
Chapter 12: Macros, Enums and RTL
|
||||
|
||||
Names of macros defining constants and labels in enums are capitalized.
|
||||
|
||||
#define CONSTANT 0x12345
|
||||
|
||||
Enums are preferred when defining several related constants.
|
||||
|
||||
CAPITALIZED macro names are appreciated but macros resembling functions
|
||||
may be named in lower case.
|
||||
|
||||
Generally, inline functions are preferable to macros resembling functions.
|
||||
|
||||
Macros with multiple statements should be enclosed in a do - while block:
|
||||
|
||||
#define macrofun(a, b, c) \
|
||||
do { \
|
||||
if (a == 5) \
|
||||
do_this(b, c); \
|
||||
} while (0)
|
||||
|
||||
Things to avoid when using macros:
|
||||
|
||||
1) macros that affect control flow:
|
||||
|
||||
#define FOO(x) \
|
||||
do { \
|
||||
if (blah(x) < 0) \
|
||||
return -EBUGGERED; \
|
||||
} while(0)
|
||||
|
||||
is a _very_ bad idea. It looks like a function call but exits the "calling"
|
||||
function; don't break the internal parsers of those who will read the code.
|
||||
|
||||
2) macros that depend on having a local variable with a magic name:
|
||||
|
||||
#define FOO(val) bar(index, val)
|
||||
|
||||
might look like a good thing, but it's confusing as hell when one reads the
|
||||
code and it's prone to breakage from seemingly innocent changes.
|
||||
|
||||
3) macros with arguments that are used as l-values: FOO(x) = y; will
|
||||
bite you if somebody e.g. turns FOO into an inline function.
|
||||
|
||||
4) forgetting about precedence: macros defining constants using expressions
|
||||
must enclose the expression in parentheses. Beware of similar issues with
|
||||
macros using parameters.
|
||||
|
||||
#define CONSTANT 0x4000
|
||||
#define CONSTEXP (CONSTANT | 3)
|
||||
|
||||
The cpp manual deals with macros exhaustively. The gcc internals manual also
|
||||
covers RTL which is used frequently with assembly language in the kernel.
|
||||
|
||||
|
||||
Chapter 13: Printing kernel messages
|
||||
|
||||
Kernel developers like to be seen as literate. Do mind the spelling
|
||||
of kernel messages to make a good impression. Do not use crippled
|
||||
words like "dont" and use "do not" or "don't" instead.
|
||||
|
||||
Kernel messages do not have to be terminated with a period.
|
||||
|
||||
Printing numbers in parentheses (%d) adds no value and should be avoided.
|
||||
|
||||
|
||||
Chapter 14: Allocating memory
|
||||
|
||||
The kernel provides the following general purpose memory allocators:
|
||||
kmalloc(), kzalloc(), kcalloc(), and vmalloc(). Please refer to the API
|
||||
documentation for further information about them.
|
||||
|
||||
The preferred form for passing a size of a struct is the following:
|
||||
|
||||
p = kmalloc(sizeof(*p), ...);
|
||||
|
||||
The alternative form where struct name is spelled out hurts readability and
|
||||
introduces an opportunity for a bug when the pointer variable type is changed
|
||||
but the corresponding sizeof that is passed to a memory allocator is not.
|
||||
|
||||
Casting the return value which is a void pointer is redundant. The conversion
|
||||
from void pointer to any other pointer type is guaranteed by the C programming
|
||||
language.
|
||||
|
||||
|
||||
Chapter 15: The inline disease
|
||||
|
||||
There appears to be a common misperception that gcc has a magic "make me
|
||||
faster" speedup option called "inline". While the use of inlines can be
|
||||
appropriate (for example as a means of replacing macros, see Chapter 11), it
|
||||
very often is not. Abundant use of the inline keyword leads to a much bigger
|
||||
kernel, which in turn slows the system as a whole down, due to a bigger
|
||||
icache footprint for the CPU and simply because there is less memory
|
||||
available for the pagecache. Just think about it; a pagecache miss causes a
|
||||
disk seek, which easily takes 5 miliseconds. There are a LOT of cpu cycles
|
||||
that can go into these 5 miliseconds.
|
||||
|
||||
A reasonable rule of thumb is to not put inline at functions that have more
|
||||
than 3 lines of code in them. An exception to this rule are the cases where
|
||||
a parameter is known to be a compiletime constant, and as a result of this
|
||||
constantness you *know* the compiler will be able to optimize most of your
|
||||
function away at compile time. For a good example of this later case, see
|
||||
the kmalloc() inline function.
|
||||
|
||||
Often people argue that adding inline to functions that are static and used
|
||||
only once is always a win since there is no space tradeoff. While this is
|
||||
technically correct, gcc is capable of inlining these automatically without
|
||||
help, and the maintenance issue of removing the inline when a second user
|
||||
appears outweighs the potential value of the hint that tells gcc to do
|
||||
something it would have done anyway.
|
||||
|
||||
|
||||
Chapter 16: Function return values and names
|
||||
|
||||
Functions can return values of many different kinds, and one of the
|
||||
most common is a value indicating whether the function succeeded or
|
||||
failed. Such a value can be represented as an error-code integer
|
||||
(-Exxx = failure, 0 = success) or a "succeeded" boolean (0 = failure,
|
||||
non-zero = success).
|
||||
|
||||
Mixing up these two sorts of representations is a fertile source of
|
||||
difficult-to-find bugs. If the C language included a strong distinction
|
||||
between integers and booleans then the compiler would find these mistakes
|
||||
for us... but it doesn't. To help prevent such bugs, always follow this
|
||||
convention:
|
||||
|
||||
If the name of a function is an action or an imperative command,
|
||||
the function should return an error-code integer. If the name
|
||||
is a predicate, the function should return a "succeeded" boolean.
|
||||
|
||||
For example, "add work" is a command, and the add_work() function returns 0
|
||||
for success or -EBUSY for failure. In the same way, "PCI device present" is
|
||||
a predicate, and the pci_dev_present() function returns 1 if it succeeds in
|
||||
finding a matching device or 0 if it doesn't.
|
||||
|
||||
All EXPORTed functions must respect this convention, and so should all
|
||||
public functions. Private (static) functions need not, but it is
|
||||
recommended that they do.
|
||||
|
||||
Functions whose return value is the actual result of a computation, rather
|
||||
than an indication of whether the computation succeeded, are not subject to
|
||||
this rule. Generally they indicate failure by returning some out-of-range
|
||||
result. Typical examples would be functions that return pointers; they use
|
||||
NULL or the ERR_PTR mechanism to report failure.
|
||||
|
||||
|
||||
Chapter 17: Don't re-invent the kernel macros
|
||||
|
||||
The header file include/linux/kernel.h contains a number of macros that
|
||||
you should use, rather than explicitly coding some variant of them yourself.
|
||||
For example, if you need to calculate the length of an array, take advantage
|
||||
of the macro
|
||||
|
||||
#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
|
||||
|
||||
Similarly, if you need to calculate the size of some structure member, use
|
||||
|
||||
#define FIELD_SIZEOF(t, f) (sizeof(((t*)0)->f))
|
||||
|
||||
There are also min() and max() macros that do strict type checking if you
|
||||
need them. Feel free to peruse that header file to see what else is already
|
||||
defined that you shouldn't reproduce in your code.
|
||||
|
||||
|
||||
|
||||
Appendix I: References
|
||||
|
||||
The C Programming Language, Second Edition
|
||||
by Brian W. Kernighan and Dennis M. Ritchie.
|
||||
Prentice Hall, Inc., 1988.
|
||||
ISBN 0-13-110362-8 (paperback), 0-13-110370-9 (hardback).
|
||||
URL: http://cm.bell-labs.com/cm/cs/cbook/
|
||||
|
||||
The Practice of Programming
|
||||
by Brian W. Kernighan and Rob Pike.
|
||||
Addison-Wesley, Inc., 1999.
|
||||
ISBN 0-201-61586-X.
|
||||
URL: http://cm.bell-labs.com/cm/cs/tpop/
|
||||
|
||||
GNU manuals - where in compliance with K&R and this text - for cpp, gcc,
|
||||
gcc internals and indent, all available from http://www.gnu.org/manual/
|
||||
|
||||
WG14 is the international standardization working group for the programming
|
||||
language C, URL: http://www.open-std.org/JTC1/SC22/WG14/
|
||||
|
||||
Kernel CodingStyle, by greg@kroah.com at OLS 2002:
|
||||
http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/
|
||||
|
||||
--
|
||||
Last updated on 2006-December-06.
|
||||
549
Documentation/DMA-API.txt
Normal file
549
Documentation/DMA-API.txt
Normal file
@ -0,0 +1,549 @@
|
||||
Dynamic DMA mapping using the generic device
|
||||
============================================
|
||||
|
||||
James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
|
||||
|
||||
This document describes the DMA API. For a more gentle introduction
|
||||
phrased in terms of the pci_ equivalents (and actual examples) see
|
||||
DMA-mapping.txt
|
||||
|
||||
This API is split into two pieces. Part I describes the API and the
|
||||
corresponding pci_ API. Part II describes the extensions to the API
|
||||
for supporting non-consistent memory machines. Unless you know that
|
||||
your driver absolutely has to support non-consistent platforms (this
|
||||
is usually only legacy platforms) you should only use the API
|
||||
described in part I.
|
||||
|
||||
Part I - pci_ and dma_ Equivalent API
|
||||
-------------------------------------
|
||||
|
||||
To get the pci_ API, you must #include <linux/pci.h>
|
||||
To get the dma_ API, you must #include <linux/dma-mapping.h>
|
||||
|
||||
|
||||
Part Ia - Using large dma-coherent buffers
|
||||
------------------------------------------
|
||||
|
||||
void *
|
||||
dma_alloc_coherent(struct device *dev, size_t size,
|
||||
dma_addr_t *dma_handle, int flag)
|
||||
void *
|
||||
pci_alloc_consistent(struct pci_dev *dev, size_t size,
|
||||
dma_addr_t *dma_handle)
|
||||
|
||||
Consistent memory is memory for which a write by either the device or
|
||||
the processor can immediately be read by the processor or device
|
||||
without having to worry about caching effects. (You may however need
|
||||
to make sure to flush the processor's write buffers before telling
|
||||
devices to read that memory.)
|
||||
|
||||
This routine allocates a region of <size> bytes of consistent memory.
|
||||
it also returns a <dma_handle> which may be cast to an unsigned
|
||||
integer the same width as the bus and used as the physical address
|
||||
base of the region.
|
||||
|
||||
Returns: a pointer to the allocated region (in the processor's virtual
|
||||
address space) or NULL if the allocation failed.
|
||||
|
||||
Note: consistent memory can be expensive on some platforms, and the
|
||||
minimum allocation length may be as big as a page, so you should
|
||||
consolidate your requests for consistent memory as much as possible.
|
||||
The simplest way to do that is to use the dma_pool calls (see below).
|
||||
|
||||
The flag parameter (dma_alloc_coherent only) allows the caller to
|
||||
specify the GFP_ flags (see kmalloc) for the allocation (the
|
||||
implementation may chose to ignore flags that affect the location of
|
||||
the returned memory, like GFP_DMA). For pci_alloc_consistent, you
|
||||
must assume GFP_ATOMIC behaviour.
|
||||
|
||||
void
|
||||
dma_free_coherent(struct device *dev, size_t size, void *cpu_addr
|
||||
dma_addr_t dma_handle)
|
||||
void
|
||||
pci_free_consistent(struct pci_dev *dev, size_t size, void *cpu_addr
|
||||
dma_addr_t dma_handle)
|
||||
|
||||
Free the region of consistent memory you previously allocated. dev,
|
||||
size and dma_handle must all be the same as those passed into the
|
||||
consistent allocate. cpu_addr must be the virtual address returned by
|
||||
the consistent allocate
|
||||
|
||||
|
||||
Part Ib - Using small dma-coherent buffers
|
||||
------------------------------------------
|
||||
|
||||
To get this part of the dma_ API, you must #include <linux/dmapool.h>
|
||||
|
||||
Many drivers need lots of small dma-coherent memory regions for DMA
|
||||
descriptors or I/O buffers. Rather than allocating in units of a page
|
||||
or more using dma_alloc_coherent(), you can use DMA pools. These work
|
||||
much like a struct kmem_cache, except that they use the dma-coherent allocator
|
||||
not __get_free_pages(). Also, they understand common hardware constraints
|
||||
for alignment, like queue heads needing to be aligned on N byte boundaries.
|
||||
|
||||
|
||||
struct dma_pool *
|
||||
dma_pool_create(const char *name, struct device *dev,
|
||||
size_t size, size_t align, size_t alloc);
|
||||
|
||||
struct pci_pool *
|
||||
pci_pool_create(const char *name, struct pci_device *dev,
|
||||
size_t size, size_t align, size_t alloc);
|
||||
|
||||
The pool create() routines initialize a pool of dma-coherent buffers
|
||||
for use with a given device. It must be called in a context which
|
||||
can sleep.
|
||||
|
||||
The "name" is for diagnostics (like a struct kmem_cache name); dev and size
|
||||
are like what you'd pass to dma_alloc_coherent(). The device's hardware
|
||||
alignment requirement for this type of data is "align" (which is expressed
|
||||
in bytes, and must be a power of two). If your device has no boundary
|
||||
crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated
|
||||
from this pool must not cross 4KByte boundaries.
|
||||
|
||||
|
||||
void *dma_pool_alloc(struct dma_pool *pool, int gfp_flags,
|
||||
dma_addr_t *dma_handle);
|
||||
|
||||
void *pci_pool_alloc(struct pci_pool *pool, int gfp_flags,
|
||||
dma_addr_t *dma_handle);
|
||||
|
||||
This allocates memory from the pool; the returned memory will meet the size
|
||||
and alignment requirements specified at creation time. Pass GFP_ATOMIC to
|
||||
prevent blocking, or if it's permitted (not in_interrupt, not holding SMP locks)
|
||||
pass GFP_KERNEL to allow blocking. Like dma_alloc_coherent(), this returns
|
||||
two values: an address usable by the cpu, and the dma address usable by the
|
||||
pool's device.
|
||||
|
||||
|
||||
void dma_pool_free(struct dma_pool *pool, void *vaddr,
|
||||
dma_addr_t addr);
|
||||
|
||||
void pci_pool_free(struct pci_pool *pool, void *vaddr,
|
||||
dma_addr_t addr);
|
||||
|
||||
This puts memory back into the pool. The pool is what was passed to
|
||||
the pool allocation routine; the cpu and dma addresses are what
|
||||
were returned when that routine allocated the memory being freed.
|
||||
|
||||
|
||||
void dma_pool_destroy(struct dma_pool *pool);
|
||||
|
||||
void pci_pool_destroy(struct pci_pool *pool);
|
||||
|
||||
The pool destroy() routines free the resources of the pool. They must be
|
||||
called in a context which can sleep. Make sure you've freed all allocated
|
||||
memory back to the pool before you destroy it.
|
||||
|
||||
|
||||
Part Ic - DMA addressing limitations
|
||||
------------------------------------
|
||||
|
||||
int
|
||||
dma_supported(struct device *dev, u64 mask)
|
||||
int
|
||||
pci_dma_supported(struct device *dev, u64 mask)
|
||||
|
||||
Checks to see if the device can support DMA to the memory described by
|
||||
mask.
|
||||
|
||||
Returns: 1 if it can and 0 if it can't.
|
||||
|
||||
Notes: This routine merely tests to see if the mask is possible. It
|
||||
won't change the current mask settings. It is more intended as an
|
||||
internal API for use by the platform than an external API for use by
|
||||
driver writers.
|
||||
|
||||
int
|
||||
dma_set_mask(struct device *dev, u64 mask)
|
||||
int
|
||||
pci_set_dma_mask(struct pci_device *dev, u64 mask)
|
||||
|
||||
Checks to see if the mask is possible and updates the device
|
||||
parameters if it is.
|
||||
|
||||
Returns: 0 if successful and a negative error if not.
|
||||
|
||||
u64
|
||||
dma_get_required_mask(struct device *dev)
|
||||
|
||||
After setting the mask with dma_set_mask(), this API returns the
|
||||
actual mask (within that already set) that the platform actually
|
||||
requires to operate efficiently. Usually this means the returned mask
|
||||
is the minimum required to cover all of memory. Examining the
|
||||
required mask gives drivers with variable descriptor sizes the
|
||||
opportunity to use smaller descriptors as necessary.
|
||||
|
||||
Requesting the required mask does not alter the current mask. If you
|
||||
wish to take advantage of it, you should issue another dma_set_mask()
|
||||
call to lower the mask again.
|
||||
|
||||
|
||||
Part Id - Streaming DMA mappings
|
||||
--------------------------------
|
||||
|
||||
dma_addr_t
|
||||
dma_map_single(struct device *dev, void *cpu_addr, size_t size,
|
||||
enum dma_data_direction direction)
|
||||
dma_addr_t
|
||||
pci_map_single(struct device *dev, void *cpu_addr, size_t size,
|
||||
int direction)
|
||||
|
||||
Maps a piece of processor virtual memory so it can be accessed by the
|
||||
device and returns the physical handle of the memory.
|
||||
|
||||
The direction for both api's may be converted freely by casting.
|
||||
However the dma_ API uses a strongly typed enumerator for its
|
||||
direction:
|
||||
|
||||
DMA_NONE = PCI_DMA_NONE no direction (used for
|
||||
debugging)
|
||||
DMA_TO_DEVICE = PCI_DMA_TODEVICE data is going from the
|
||||
memory to the device
|
||||
DMA_FROM_DEVICE = PCI_DMA_FROMDEVICE data is coming from
|
||||
the device to the
|
||||
memory
|
||||
DMA_BIDIRECTIONAL = PCI_DMA_BIDIRECTIONAL direction isn't known
|
||||
|
||||
Notes: Not all memory regions in a machine can be mapped by this
|
||||
API. Further, regions that appear to be physically contiguous in
|
||||
kernel virtual space may not be contiguous as physical memory. Since
|
||||
this API does not provide any scatter/gather capability, it will fail
|
||||
if the user tries to map a non physically contiguous piece of memory.
|
||||
For this reason, it is recommended that memory mapped by this API be
|
||||
obtained only from sources which guarantee to be physically contiguous
|
||||
(like kmalloc).
|
||||
|
||||
Further, the physical address of the memory must be within the
|
||||
dma_mask of the device (the dma_mask represents a bit mask of the
|
||||
addressable region for the device. i.e. if the physical address of
|
||||
the memory anded with the dma_mask is still equal to the physical
|
||||
address, then the device can perform DMA to the memory). In order to
|
||||
ensure that the memory allocated by kmalloc is within the dma_mask,
|
||||
the driver may specify various platform dependent flags to restrict
|
||||
the physical memory range of the allocation (e.g. on x86, GFP_DMA
|
||||
guarantees to be within the first 16Mb of available physical memory,
|
||||
as required by ISA devices).
|
||||
|
||||
Note also that the above constraints on physical contiguity and
|
||||
dma_mask may not apply if the platform has an IOMMU (a device which
|
||||
supplies a physical to virtual mapping between the I/O memory bus and
|
||||
the device). However, to be portable, device driver writers may *not*
|
||||
assume that such an IOMMU exists.
|
||||
|
||||
Warnings: Memory coherency operates at a granularity called the cache
|
||||
line width. In order for memory mapped by this API to operate
|
||||
correctly, the mapped region must begin exactly on a cache line
|
||||
boundary and end exactly on one (to prevent two separately mapped
|
||||
regions from sharing a single cache line). Since the cache line size
|
||||
may not be known at compile time, the API will not enforce this
|
||||
requirement. Therefore, it is recommended that driver writers who
|
||||
don't take special care to determine the cache line size at run time
|
||||
only map virtual regions that begin and end on page boundaries (which
|
||||
are guaranteed also to be cache line boundaries).
|
||||
|
||||
DMA_TO_DEVICE synchronisation must be done after the last modification
|
||||
of the memory region by the software and before it is handed off to
|
||||
the driver. Once this primitive is used. Memory covered by this
|
||||
primitive should be treated as read only by the device. If the device
|
||||
may write to it at any point, it should be DMA_BIDIRECTIONAL (see
|
||||
below).
|
||||
|
||||
DMA_FROM_DEVICE synchronisation must be done before the driver
|
||||
accesses data that may be changed by the device. This memory should
|
||||
be treated as read only by the driver. If the driver needs to write
|
||||
to it at any point, it should be DMA_BIDIRECTIONAL (see below).
|
||||
|
||||
DMA_BIDIRECTIONAL requires special handling: it means that the driver
|
||||
isn't sure if the memory was modified before being handed off to the
|
||||
device and also isn't sure if the device will also modify it. Thus,
|
||||
you must always sync bidirectional memory twice: once before the
|
||||
memory is handed off to the device (to make sure all memory changes
|
||||
are flushed from the processor) and once before the data may be
|
||||
accessed after being used by the device (to make sure any processor
|
||||
cache lines are updated with data that the device may have changed.
|
||||
|
||||
void
|
||||
dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size,
|
||||
enum dma_data_direction direction)
|
||||
void
|
||||
pci_unmap_single(struct pci_dev *hwdev, dma_addr_t dma_addr,
|
||||
size_t size, int direction)
|
||||
|
||||
Unmaps the region previously mapped. All the parameters passed in
|
||||
must be identical to those passed in (and returned) by the mapping
|
||||
API.
|
||||
|
||||
dma_addr_t
|
||||
dma_map_page(struct device *dev, struct page *page,
|
||||
unsigned long offset, size_t size,
|
||||
enum dma_data_direction direction)
|
||||
dma_addr_t
|
||||
pci_map_page(struct pci_dev *hwdev, struct page *page,
|
||||
unsigned long offset, size_t size, int direction)
|
||||
void
|
||||
dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size,
|
||||
enum dma_data_direction direction)
|
||||
void
|
||||
pci_unmap_page(struct pci_dev *hwdev, dma_addr_t dma_address,
|
||||
size_t size, int direction)
|
||||
|
||||
API for mapping and unmapping for pages. All the notes and warnings
|
||||
for the other mapping APIs apply here. Also, although the <offset>
|
||||
and <size> parameters are provided to do partial page mapping, it is
|
||||
recommended that you never use these unless you really know what the
|
||||
cache width is.
|
||||
|
||||
int
|
||||
dma_mapping_error(dma_addr_t dma_addr)
|
||||
|
||||
int
|
||||
pci_dma_mapping_error(dma_addr_t dma_addr)
|
||||
|
||||
In some circumstances dma_map_single and dma_map_page will fail to create
|
||||
a mapping. A driver can check for these errors by testing the returned
|
||||
dma address with dma_mapping_error(). A non zero return value means the mapping
|
||||
could not be created and the driver should take appropriate action (eg
|
||||
reduce current DMA mapping usage or delay and try again later).
|
||||
|
||||
int
|
||||
dma_map_sg(struct device *dev, struct scatterlist *sg,
|
||||
int nents, enum dma_data_direction direction)
|
||||
int
|
||||
pci_map_sg(struct pci_dev *hwdev, struct scatterlist *sg,
|
||||
int nents, int direction)
|
||||
|
||||
Maps a scatter gather list from the block layer.
|
||||
|
||||
Returns: the number of physical segments mapped (this may be shorted
|
||||
than <nents> passed in if the block layer determines that some
|
||||
elements of the scatter/gather list are physically adjacent and thus
|
||||
may be mapped with a single entry).
|
||||
|
||||
Please note that the sg cannot be mapped again if it has been mapped once.
|
||||
The mapping process is allowed to destroy information in the sg.
|
||||
|
||||
As with the other mapping interfaces, dma_map_sg can fail. When it
|
||||
does, 0 is returned and a driver must take appropriate action. It is
|
||||
critical that the driver do something, in the case of a block driver
|
||||
aborting the request or even oopsing is better than doing nothing and
|
||||
corrupting the filesystem.
|
||||
|
||||
With scatterlists, you use the resulting mapping like this:
|
||||
|
||||
int i, count = dma_map_sg(dev, sglist, nents, direction);
|
||||
struct scatterlist *sg;
|
||||
|
||||
for (i = 0, sg = sglist; i < count; i++, sg++) {
|
||||
hw_address[i] = sg_dma_address(sg);
|
||||
hw_len[i] = sg_dma_len(sg);
|
||||
}
|
||||
|
||||
where nents is the number of entries in the sglist.
|
||||
|
||||
The implementation is free to merge several consecutive sglist entries
|
||||
into one (e.g. with an IOMMU, or if several pages just happen to be
|
||||
physically contiguous) and returns the actual number of sg entries it
|
||||
mapped them to. On failure 0, is returned.
|
||||
|
||||
Then you should loop count times (note: this can be less than nents times)
|
||||
and use sg_dma_address() and sg_dma_len() macros where you previously
|
||||
accessed sg->address and sg->length as shown above.
|
||||
|
||||
void
|
||||
dma_unmap_sg(struct device *dev, struct scatterlist *sg,
|
||||
int nhwentries, enum dma_data_direction direction)
|
||||
void
|
||||
pci_unmap_sg(struct pci_dev *hwdev, struct scatterlist *sg,
|
||||
int nents, int direction)
|
||||
|
||||
unmap the previously mapped scatter/gather list. All the parameters
|
||||
must be the same as those and passed in to the scatter/gather mapping
|
||||
API.
|
||||
|
||||
Note: <nents> must be the number you passed in, *not* the number of
|
||||
physical entries returned.
|
||||
|
||||
void
|
||||
dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size,
|
||||
enum dma_data_direction direction)
|
||||
void
|
||||
pci_dma_sync_single(struct pci_dev *hwdev, dma_addr_t dma_handle,
|
||||
size_t size, int direction)
|
||||
void
|
||||
dma_sync_sg(struct device *dev, struct scatterlist *sg, int nelems,
|
||||
enum dma_data_direction direction)
|
||||
void
|
||||
pci_dma_sync_sg(struct pci_dev *hwdev, struct scatterlist *sg,
|
||||
int nelems, int direction)
|
||||
|
||||
synchronise a single contiguous or scatter/gather mapping. All the
|
||||
parameters must be the same as those passed into the single mapping
|
||||
API.
|
||||
|
||||
Notes: You must do this:
|
||||
|
||||
- Before reading values that have been written by DMA from the device
|
||||
(use the DMA_FROM_DEVICE direction)
|
||||
- After writing values that will be written to the device using DMA
|
||||
(use the DMA_TO_DEVICE) direction
|
||||
- before *and* after handing memory to the device if the memory is
|
||||
DMA_BIDIRECTIONAL
|
||||
|
||||
See also dma_map_single().
|
||||
|
||||
|
||||
Part II - Advanced dma_ usage
|
||||
-----------------------------
|
||||
|
||||
Warning: These pieces of the DMA API have no PCI equivalent. They
|
||||
should also not be used in the majority of cases, since they cater for
|
||||
unlikely corner cases that don't belong in usual drivers.
|
||||
|
||||
If you don't understand how cache line coherency works between a
|
||||
processor and an I/O device, you should not be using this part of the
|
||||
API at all.
|
||||
|
||||
void *
|
||||
dma_alloc_noncoherent(struct device *dev, size_t size,
|
||||
dma_addr_t *dma_handle, int flag)
|
||||
|
||||
Identical to dma_alloc_coherent() except that the platform will
|
||||
choose to return either consistent or non-consistent memory as it sees
|
||||
fit. By using this API, you are guaranteeing to the platform that you
|
||||
have all the correct and necessary sync points for this memory in the
|
||||
driver should it choose to return non-consistent memory.
|
||||
|
||||
Note: where the platform can return consistent memory, it will
|
||||
guarantee that the sync points become nops.
|
||||
|
||||
Warning: Handling non-consistent memory is a real pain. You should
|
||||
only ever use this API if you positively know your driver will be
|
||||
required to work on one of the rare (usually non-PCI) architectures
|
||||
that simply cannot make consistent memory.
|
||||
|
||||
void
|
||||
dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr,
|
||||
dma_addr_t dma_handle)
|
||||
|
||||
free memory allocated by the nonconsistent API. All parameters must
|
||||
be identical to those passed in (and returned by
|
||||
dma_alloc_noncoherent()).
|
||||
|
||||
int
|
||||
dma_is_consistent(struct device *dev, dma_addr_t dma_handle)
|
||||
|
||||
returns true if the device dev is performing consistent DMA on the memory
|
||||
area pointed to by the dma_handle.
|
||||
|
||||
int
|
||||
dma_get_cache_alignment(void)
|
||||
|
||||
returns the processor cache alignment. This is the absolute minimum
|
||||
alignment *and* width that you must observe when either mapping
|
||||
memory or doing partial flushes.
|
||||
|
||||
Notes: This API may return a number *larger* than the actual cache
|
||||
line, but it will guarantee that one or more cache lines fit exactly
|
||||
into the width returned by this call. It will also always be a power
|
||||
of two for easy alignment
|
||||
|
||||
void
|
||||
dma_sync_single_range(struct device *dev, dma_addr_t dma_handle,
|
||||
unsigned long offset, size_t size,
|
||||
enum dma_data_direction direction)
|
||||
|
||||
does a partial sync. starting at offset and continuing for size. You
|
||||
must be careful to observe the cache alignment and width when doing
|
||||
anything like this. You must also be extra careful about accessing
|
||||
memory you intend to sync partially.
|
||||
|
||||
void
|
||||
dma_cache_sync(struct device *dev, void *vaddr, size_t size,
|
||||
enum dma_data_direction direction)
|
||||
|
||||
Do a partial sync of memory that was allocated by
|
||||
dma_alloc_noncoherent(), starting at virtual address vaddr and
|
||||
continuing on for size. Again, you *must* observe the cache line
|
||||
boundaries when doing this.
|
||||
|
||||
int
|
||||
dma_declare_coherent_memory(struct device *dev, dma_addr_t bus_addr,
|
||||
dma_addr_t device_addr, size_t size, int
|
||||
flags)
|
||||
|
||||
|
||||
Declare region of memory to be handed out by dma_alloc_coherent when
|
||||
it's asked for coherent memory for this device.
|
||||
|
||||
bus_addr is the physical address to which the memory is currently
|
||||
assigned in the bus responding region (this will be used by the
|
||||
platform to perform the mapping)
|
||||
|
||||
device_addr is the physical address the device needs to be programmed
|
||||
with actually to address this memory (this will be handed out as the
|
||||
dma_addr_t in dma_alloc_coherent())
|
||||
|
||||
size is the size of the area (must be multiples of PAGE_SIZE).
|
||||
|
||||
flags can be or'd together and are
|
||||
|
||||
DMA_MEMORY_MAP - request that the memory returned from
|
||||
dma_alloc_coherent() be directly writable.
|
||||
|
||||
DMA_MEMORY_IO - request that the memory returned from
|
||||
dma_alloc_coherent() be addressable using read/write/memcpy_toio etc.
|
||||
|
||||
One or both of these flags must be present
|
||||
|
||||
DMA_MEMORY_INCLUDES_CHILDREN - make the declared memory be allocated by
|
||||
dma_alloc_coherent of any child devices of this one (for memory residing
|
||||
on a bridge).
|
||||
|
||||
DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions.
|
||||
Do not allow dma_alloc_coherent() to fall back to system memory when
|
||||
it's out of memory in the declared region.
|
||||
|
||||
The return value will be either DMA_MEMORY_MAP or DMA_MEMORY_IO and
|
||||
must correspond to a passed in flag (i.e. no returning DMA_MEMORY_IO
|
||||
if only DMA_MEMORY_MAP were passed in) for success or zero for
|
||||
failure.
|
||||
|
||||
Note, for DMA_MEMORY_IO returns, all subsequent memory returned by
|
||||
dma_alloc_coherent() may no longer be accessed directly, but instead
|
||||
must be accessed using the correct bus functions. If your driver
|
||||
isn't prepared to handle this contingency, it should not specify
|
||||
DMA_MEMORY_IO in the input flags.
|
||||
|
||||
As a simplification for the platforms, only *one* such region of
|
||||
memory may be declared per device.
|
||||
|
||||
For reasons of efficiency, most platforms choose to track the declared
|
||||
region only at the granularity of a page. For smaller allocations,
|
||||
you should use the dma_pool() API.
|
||||
|
||||
void
|
||||
dma_release_declared_memory(struct device *dev)
|
||||
|
||||
Remove the memory region previously declared from the system. This
|
||||
API performs *no* in-use checking for this region and will return
|
||||
unconditionally having removed all the required structures. It is the
|
||||
drivers job to ensure that no parts of this memory region are
|
||||
currently in use.
|
||||
|
||||
void *
|
||||
dma_mark_declared_memory_occupied(struct device *dev,
|
||||
dma_addr_t device_addr, size_t size)
|
||||
|
||||
This is used to occupy specific regions of the declared space
|
||||
(dma_alloc_coherent() will hand out the first free region it finds).
|
||||
|
||||
device_addr is the *device* address of the region requested
|
||||
|
||||
size is the size (and should be a page sized multiple).
|
||||
|
||||
The return value will be either a pointer to the processor virtual
|
||||
address of the memory, or an error (via PTR_ERR()) if any part of the
|
||||
region is occupied.
|
||||
|
||||
|
||||
151
Documentation/DMA-ISA-LPC.txt
Normal file
151
Documentation/DMA-ISA-LPC.txt
Normal file
@ -0,0 +1,151 @@
|
||||
DMA with ISA and LPC devices
|
||||
============================
|
||||
|
||||
Pierre Ossman <drzeus@drzeus.cx>
|
||||
|
||||
This document describes how to do DMA transfers using the old ISA DMA
|
||||
controller. Even though ISA is more or less dead today the LPC bus
|
||||
uses the same DMA system so it will be around for quite some time.
|
||||
|
||||
Part I - Headers and dependencies
|
||||
---------------------------------
|
||||
|
||||
To do ISA style DMA you need to include two headers:
|
||||
|
||||
#include <linux/dma-mapping.h>
|
||||
#include <asm/dma.h>
|
||||
|
||||
The first is the generic DMA API used to convert virtual addresses to
|
||||
physical addresses (see Documentation/DMA-API.txt for details).
|
||||
|
||||
The second contains the routines specific to ISA DMA transfers. Since
|
||||
this is not present on all platforms make sure you construct your
|
||||
Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries
|
||||
to build your driver on unsupported platforms.
|
||||
|
||||
Part II - Buffer allocation
|
||||
---------------------------
|
||||
|
||||
The ISA DMA controller has some very strict requirements on which
|
||||
memory it can access so extra care must be taken when allocating
|
||||
buffers.
|
||||
|
||||
(You usually need a special buffer for DMA transfers instead of
|
||||
transferring directly to and from your normal data structures.)
|
||||
|
||||
The DMA-able address space is the lowest 16 MB of _physical_ memory.
|
||||
Also the transfer block may not cross page boundaries (which are 64
|
||||
or 128 KiB depending on which channel you use).
|
||||
|
||||
In order to allocate a piece of memory that satisfies all these
|
||||
requirements you pass the flag GFP_DMA to kmalloc.
|
||||
|
||||
Unfortunately the memory available for ISA DMA is scarce so unless you
|
||||
allocate the memory during boot-up it's a good idea to also pass
|
||||
__GFP_REPEAT and __GFP_NOWARN to make the allocater try a bit harder.
|
||||
|
||||
(This scarcity also means that you should allocate the buffer as
|
||||
early as possible and not release it until the driver is unloaded.)
|
||||
|
||||
Part III - Address translation
|
||||
------------------------------
|
||||
|
||||
To translate the virtual address to a physical use the normal DMA
|
||||
API. Do _not_ use isa_virt_to_phys() even though it does the same
|
||||
thing. The reason for this is that the function isa_virt_to_phys()
|
||||
will require a Kconfig dependency to ISA, not just ISA_DMA_API which
|
||||
is really all you need. Remember that even though the DMA controller
|
||||
has its origins in ISA it is used elsewhere.
|
||||
|
||||
Note: x86_64 had a broken DMA API when it came to ISA but has since
|
||||
been fixed. If your arch has problems then fix the DMA API instead of
|
||||
reverting to the ISA functions.
|
||||
|
||||
Part IV - Channels
|
||||
------------------
|
||||
|
||||
A normal ISA DMA controller has 8 channels. The lower four are for
|
||||
8-bit transfers and the upper four are for 16-bit transfers.
|
||||
|
||||
(Actually the DMA controller is really two separate controllers where
|
||||
channel 4 is used to give DMA access for the second controller (0-3).
|
||||
This means that of the four 16-bits channels only three are usable.)
|
||||
|
||||
You allocate these in a similar fashion as all basic resources:
|
||||
|
||||
extern int request_dma(unsigned int dmanr, const char * device_id);
|
||||
extern void free_dma(unsigned int dmanr);
|
||||
|
||||
The ability to use 16-bit or 8-bit transfers is _not_ up to you as a
|
||||
driver author but depends on what the hardware supports. Check your
|
||||
specs or test different channels.
|
||||
|
||||
Part V - Transfer data
|
||||
----------------------
|
||||
|
||||
Now for the good stuff, the actual DMA transfer. :)
|
||||
|
||||
Before you use any ISA DMA routines you need to claim the DMA lock
|
||||
using claim_dma_lock(). The reason is that some DMA operations are
|
||||
not atomic so only one driver may fiddle with the registers at a
|
||||
time.
|
||||
|
||||
The first time you use the DMA controller you should call
|
||||
clear_dma_ff(). This clears an internal register in the DMA
|
||||
controller that is used for the non-atomic operations. As long as you
|
||||
(and everyone else) uses the locking functions then you only need to
|
||||
reset this once.
|
||||
|
||||
Next, you tell the controller in which direction you intend to do the
|
||||
transfer using set_dma_mode(). Currently you have the options
|
||||
DMA_MODE_READ and DMA_MODE_WRITE.
|
||||
|
||||
Set the address from where the transfer should start (this needs to
|
||||
be 16-bit aligned for 16-bit transfers) and how many bytes to
|
||||
transfer. Note that it's _bytes_. The DMA routines will do all the
|
||||
required translation to values that the DMA controller understands.
|
||||
|
||||
The final step is enabling the DMA channel and releasing the DMA
|
||||
lock.
|
||||
|
||||
Once the DMA transfer is finished (or timed out) you should disable
|
||||
the channel again. You should also check get_dma_residue() to make
|
||||
sure that all data has been transferred.
|
||||
|
||||
Example:
|
||||
|
||||
int flags, residue;
|
||||
|
||||
flags = claim_dma_lock();
|
||||
|
||||
clear_dma_ff();
|
||||
|
||||
set_dma_mode(channel, DMA_MODE_WRITE);
|
||||
set_dma_addr(channel, phys_addr);
|
||||
set_dma_count(channel, num_bytes);
|
||||
|
||||
dma_enable(channel);
|
||||
|
||||
release_dma_lock(flags);
|
||||
|
||||
while (!device_done());
|
||||
|
||||
flags = claim_dma_lock();
|
||||
|
||||
dma_disable(channel);
|
||||
|
||||
residue = dma_get_residue(channel);
|
||||
if (residue != 0)
|
||||
printk(KERN_ERR "driver: Incomplete DMA transfer!"
|
||||
" %d bytes left!\n", residue);
|
||||
|
||||
release_dma_lock(flags);
|
||||
|
||||
Part VI - Suspend/resume
|
||||
------------------------
|
||||
|
||||
It is the driver's responsibility to make sure that the machine isn't
|
||||
suspended while a DMA transfer is in progress. Also, all DMA settings
|
||||
are lost when the system suspends so if your driver relies on the DMA
|
||||
controller being in a certain state then you have to restore these
|
||||
registers upon resume.
|
||||
889
Documentation/DMA-mapping.txt
Normal file
889
Documentation/DMA-mapping.txt
Normal file
@ -0,0 +1,889 @@
|
||||
Dynamic DMA mapping
|
||||
===================
|
||||
|
||||
David S. Miller <davem@redhat.com>
|
||||
Richard Henderson <rth@cygnus.com>
|
||||
Jakub Jelinek <jakub@redhat.com>
|
||||
|
||||
This document describes the DMA mapping system in terms of the pci_
|
||||
API. For a similar API that works for generic devices, see
|
||||
DMA-API.txt.
|
||||
|
||||
Most of the 64bit platforms have special hardware that translates bus
|
||||
addresses (DMA addresses) into physical addresses. This is similar to
|
||||
how page tables and/or a TLB translates virtual addresses to physical
|
||||
addresses on a CPU. This is needed so that e.g. PCI devices can
|
||||
access with a Single Address Cycle (32bit DMA address) any page in the
|
||||
64bit physical address space. Previously in Linux those 64bit
|
||||
platforms had to set artificial limits on the maximum RAM size in the
|
||||
system, so that the virt_to_bus() static scheme works (the DMA address
|
||||
translation tables were simply filled on bootup to map each bus
|
||||
address to the physical page __pa(bus_to_virt())).
|
||||
|
||||
So that Linux can use the dynamic DMA mapping, it needs some help from the
|
||||
drivers, namely it has to take into account that DMA addresses should be
|
||||
mapped only for the time they are actually used and unmapped after the DMA
|
||||
transfer.
|
||||
|
||||
The following API will work of course even on platforms where no such
|
||||
hardware exists, see e.g. include/asm-i386/pci.h for how it is implemented on
|
||||
top of the virt_to_bus interface.
|
||||
|
||||
First of all, you should make sure
|
||||
|
||||
#include <linux/pci.h>
|
||||
|
||||
is in your driver. This file will obtain for you the definition of the
|
||||
dma_addr_t (which can hold any valid DMA address for the platform)
|
||||
type which should be used everywhere you hold a DMA (bus) address
|
||||
returned from the DMA mapping functions.
|
||||
|
||||
What memory is DMA'able?
|
||||
|
||||
The first piece of information you must know is what kernel memory can
|
||||
be used with the DMA mapping facilities. There has been an unwritten
|
||||
set of rules regarding this, and this text is an attempt to finally
|
||||
write them down.
|
||||
|
||||
If you acquired your memory via the page allocator
|
||||
(i.e. __get_free_page*()) or the generic memory allocators
|
||||
(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from
|
||||
that memory using the addresses returned from those routines.
|
||||
|
||||
This means specifically that you may _not_ use the memory/addresses
|
||||
returned from vmalloc() for DMA. It is possible to DMA to the
|
||||
_underlying_ memory mapped into a vmalloc() area, but this requires
|
||||
walking page tables to get the physical addresses, and then
|
||||
translating each of those pages back to a kernel address using
|
||||
something like __va(). [ EDIT: Update this when we integrate
|
||||
Gerd Knorr's generic code which does this. ]
|
||||
|
||||
This rule also means that you may use neither kernel image addresses
|
||||
(items in data/text/bss segments), nor module image addresses, nor
|
||||
stack addresses for DMA. These could all be mapped somewhere entirely
|
||||
different than the rest of physical memory. Even if those classes of
|
||||
memory could physically work with DMA, you'd need to ensure the I/O
|
||||
buffers were cacheline-aligned. Without that, you'd see cacheline
|
||||
sharing problems (data corruption) on CPUs with DMA-incoherent caches.
|
||||
(The CPU could write to one word, DMA would write to a different one
|
||||
in the same cache line, and one of them could be overwritten.)
|
||||
|
||||
Also, this means that you cannot take the return of a kmap()
|
||||
call and DMA to/from that. This is similar to vmalloc().
|
||||
|
||||
What about block I/O and networking buffers? The block I/O and
|
||||
networking subsystems make sure that the buffers they use are valid
|
||||
for you to DMA from/to.
|
||||
|
||||
DMA addressing limitations
|
||||
|
||||
Does your device have any DMA addressing limitations? For example, is
|
||||
your device only capable of driving the low order 24-bits of address
|
||||
on the PCI bus for SAC DMA transfers? If so, you need to inform the
|
||||
PCI layer of this fact.
|
||||
|
||||
By default, the kernel assumes that your device can address the full
|
||||
32-bits in a SAC cycle. For a 64-bit DAC capable device, this needs
|
||||
to be increased. And for a device with limitations, as discussed in
|
||||
the previous paragraph, it needs to be decreased.
|
||||
|
||||
pci_alloc_consistent() by default will return 32-bit DMA addresses.
|
||||
PCI-X specification requires PCI-X devices to support 64-bit
|
||||
addressing (DAC) for all transactions. And at least one platform (SGI
|
||||
SN2) requires 64-bit consistent allocations to operate correctly when
|
||||
the IO bus is in PCI-X mode. Therefore, like with pci_set_dma_mask(),
|
||||
it's good practice to call pci_set_consistent_dma_mask() to set the
|
||||
appropriate mask even if your device only supports 32-bit DMA
|
||||
(default) and especially if it's a PCI-X device.
|
||||
|
||||
For correct operation, you must interrogate the PCI layer in your
|
||||
device probe routine to see if the PCI controller on the machine can
|
||||
properly support the DMA addressing limitation your device has. It is
|
||||
good style to do this even if your device holds the default setting,
|
||||
because this shows that you did think about these issues wrt. your
|
||||
device.
|
||||
|
||||
The query is performed via a call to pci_set_dma_mask():
|
||||
|
||||
int pci_set_dma_mask(struct pci_dev *pdev, u64 device_mask);
|
||||
|
||||
The query for consistent allocations is performed via a call to
|
||||
pci_set_consistent_dma_mask():
|
||||
|
||||
int pci_set_consistent_dma_mask(struct pci_dev *pdev, u64 device_mask);
|
||||
|
||||
Here, pdev is a pointer to the PCI device struct of your device, and
|
||||
device_mask is a bit mask describing which bits of a PCI address your
|
||||
device supports. It returns zero if your card can perform DMA
|
||||
properly on the machine given the address mask you provided.
|
||||
|
||||
If it returns non-zero, your device cannot perform DMA properly on
|
||||
this platform, and attempting to do so will result in undefined
|
||||
behavior. You must either use a different mask, or not use DMA.
|
||||
|
||||
This means that in the failure case, you have three options:
|
||||
|
||||
1) Use another DMA mask, if possible (see below).
|
||||
2) Use some non-DMA mode for data transfer, if possible.
|
||||
3) Ignore this device and do not initialize it.
|
||||
|
||||
It is recommended that your driver print a kernel KERN_WARNING message
|
||||
when you end up performing either #2 or #3. In this manner, if a user
|
||||
of your driver reports that performance is bad or that the device is not
|
||||
even detected, you can ask them for the kernel messages to find out
|
||||
exactly why.
|
||||
|
||||
The standard 32-bit addressing PCI device would do something like
|
||||
this:
|
||||
|
||||
if (pci_set_dma_mask(pdev, DMA_32BIT_MASK)) {
|
||||
printk(KERN_WARNING
|
||||
"mydev: No suitable DMA available.\n");
|
||||
goto ignore_this_device;
|
||||
}
|
||||
|
||||
Another common scenario is a 64-bit capable device. The approach
|
||||
here is to try for 64-bit DAC addressing, but back down to a
|
||||
32-bit mask should that fail. The PCI platform code may fail the
|
||||
64-bit mask not because the platform is not capable of 64-bit
|
||||
addressing. Rather, it may fail in this case simply because
|
||||
32-bit SAC addressing is done more efficiently than DAC addressing.
|
||||
Sparc64 is one platform which behaves in this way.
|
||||
|
||||
Here is how you would handle a 64-bit capable device which can drive
|
||||
all 64-bits when accessing streaming DMA:
|
||||
|
||||
int using_dac;
|
||||
|
||||
if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) {
|
||||
using_dac = 1;
|
||||
} else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) {
|
||||
using_dac = 0;
|
||||
} else {
|
||||
printk(KERN_WARNING
|
||||
"mydev: No suitable DMA available.\n");
|
||||
goto ignore_this_device;
|
||||
}
|
||||
|
||||
If a card is capable of using 64-bit consistent allocations as well,
|
||||
the case would look like this:
|
||||
|
||||
int using_dac, consistent_using_dac;
|
||||
|
||||
if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) {
|
||||
using_dac = 1;
|
||||
consistent_using_dac = 1;
|
||||
pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK);
|
||||
} else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) {
|
||||
using_dac = 0;
|
||||
consistent_using_dac = 0;
|
||||
pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK);
|
||||
} else {
|
||||
printk(KERN_WARNING
|
||||
"mydev: No suitable DMA available.\n");
|
||||
goto ignore_this_device;
|
||||
}
|
||||
|
||||
pci_set_consistent_dma_mask() will always be able to set the same or a
|
||||
smaller mask as pci_set_dma_mask(). However for the rare case that a
|
||||
device driver only uses consistent allocations, one would have to
|
||||
check the return value from pci_set_consistent_dma_mask().
|
||||
|
||||
If your 64-bit device is going to be an enormous consumer of DMA
|
||||
mappings, this can be problematic since the DMA mappings are a
|
||||
finite resource on many platforms. Please see the "DAC Addressing
|
||||
for Address Space Hungry Devices" section near the end of this
|
||||
document for how to handle this case.
|
||||
|
||||
Finally, if your device can only drive the low 24-bits of
|
||||
address during PCI bus mastering you might do something like:
|
||||
|
||||
if (pci_set_dma_mask(pdev, DMA_24BIT_MASK)) {
|
||||
printk(KERN_WARNING
|
||||
"mydev: 24-bit DMA addressing not available.\n");
|
||||
goto ignore_this_device;
|
||||
}
|
||||
[Better use DMA_24BIT_MASK instead of 0x00ffffff.
|
||||
See linux/include/dma-mapping.h for reference.]
|
||||
|
||||
When pci_set_dma_mask() is successful, and returns zero, the PCI layer
|
||||
saves away this mask you have provided. The PCI layer will use this
|
||||
information later when you make DMA mappings.
|
||||
|
||||
There is a case which we are aware of at this time, which is worth
|
||||
mentioning in this documentation. If your device supports multiple
|
||||
functions (for example a sound card provides playback and record
|
||||
functions) and the various different functions have _different_
|
||||
DMA addressing limitations, you may wish to probe each mask and
|
||||
only provide the functionality which the machine can handle. It
|
||||
is important that the last call to pci_set_dma_mask() be for the
|
||||
most specific mask.
|
||||
|
||||
Here is pseudo-code showing how this might be done:
|
||||
|
||||
#define PLAYBACK_ADDRESS_BITS DMA_32BIT_MASK
|
||||
#define RECORD_ADDRESS_BITS 0x00ffffff
|
||||
|
||||
struct my_sound_card *card;
|
||||
struct pci_dev *pdev;
|
||||
|
||||
...
|
||||
if (!pci_set_dma_mask(pdev, PLAYBACK_ADDRESS_BITS)) {
|
||||
card->playback_enabled = 1;
|
||||
} else {
|
||||
card->playback_enabled = 0;
|
||||
printk(KERN_WARN "%s: Playback disabled due to DMA limitations.\n",
|
||||
card->name);
|
||||
}
|
||||
if (!pci_set_dma_mask(pdev, RECORD_ADDRESS_BITS)) {
|
||||
card->record_enabled = 1;
|
||||
} else {
|
||||
card->record_enabled = 0;
|
||||
printk(KERN_WARN "%s: Record disabled due to DMA limitations.\n",
|
||||
card->name);
|
||||
}
|
||||
|
||||
A sound card was used as an example here because this genre of PCI
|
||||
devices seems to be littered with ISA chips given a PCI front end,
|
||||
and thus retaining the 16MB DMA addressing limitations of ISA.
|
||||
|
||||
Types of DMA mappings
|
||||
|
||||
There are two types of DMA mappings:
|
||||
|
||||
- Consistent DMA mappings which are usually mapped at driver
|
||||
initialization, unmapped at the end and for which the hardware should
|
||||
guarantee that the device and the CPU can access the data
|
||||
in parallel and will see updates made by each other without any
|
||||
explicit software flushing.
|
||||
|
||||
Think of "consistent" as "synchronous" or "coherent".
|
||||
|
||||
The current default is to return consistent memory in the low 32
|
||||
bits of the PCI bus space. However, for future compatibility you
|
||||
should set the consistent mask even if this default is fine for your
|
||||
driver.
|
||||
|
||||
Good examples of what to use consistent mappings for are:
|
||||
|
||||
- Network card DMA ring descriptors.
|
||||
- SCSI adapter mailbox command data structures.
|
||||
- Device firmware microcode executed out of
|
||||
main memory.
|
||||
|
||||
The invariant these examples all require is that any CPU store
|
||||
to memory is immediately visible to the device, and vice
|
||||
versa. Consistent mappings guarantee this.
|
||||
|
||||
IMPORTANT: Consistent DMA memory does not preclude the usage of
|
||||
proper memory barriers. The CPU may reorder stores to
|
||||
consistent memory just as it may normal memory. Example:
|
||||
if it is important for the device to see the first word
|
||||
of a descriptor updated before the second, you must do
|
||||
something like:
|
||||
|
||||
desc->word0 = address;
|
||||
wmb();
|
||||
desc->word1 = DESC_VALID;
|
||||
|
||||
in order to get correct behavior on all platforms.
|
||||
|
||||
Also, on some platforms your driver may need to flush CPU write
|
||||
buffers in much the same way as it needs to flush write buffers
|
||||
found in PCI bridges (such as by reading a register's value
|
||||
after writing it).
|
||||
|
||||
- Streaming DMA mappings which are usually mapped for one DMA transfer,
|
||||
unmapped right after it (unless you use pci_dma_sync_* below) and for which
|
||||
hardware can optimize for sequential accesses.
|
||||
|
||||
This of "streaming" as "asynchronous" or "outside the coherency
|
||||
domain".
|
||||
|
||||
Good examples of what to use streaming mappings for are:
|
||||
|
||||
- Networking buffers transmitted/received by a device.
|
||||
- Filesystem buffers written/read by a SCSI device.
|
||||
|
||||
The interfaces for using this type of mapping were designed in
|
||||
such a way that an implementation can make whatever performance
|
||||
optimizations the hardware allows. To this end, when using
|
||||
such mappings you must be explicit about what you want to happen.
|
||||
|
||||
Neither type of DMA mapping has alignment restrictions that come
|
||||
from PCI, although some devices may have such restrictions.
|
||||
Also, systems with caches that aren't DMA-coherent will work better
|
||||
when the underlying buffers don't share cache lines with other data.
|
||||
|
||||
|
||||
Using Consistent DMA mappings.
|
||||
|
||||
To allocate and map large (PAGE_SIZE or so) consistent DMA regions,
|
||||
you should do:
|
||||
|
||||
dma_addr_t dma_handle;
|
||||
|
||||
cpu_addr = pci_alloc_consistent(dev, size, &dma_handle);
|
||||
|
||||
where dev is a struct pci_dev *. You should pass NULL for PCI like buses
|
||||
where devices don't have struct pci_dev (like ISA, EISA). This may be
|
||||
called in interrupt context.
|
||||
|
||||
This argument is needed because the DMA translations may be bus
|
||||
specific (and often is private to the bus which the device is attached
|
||||
to).
|
||||
|
||||
Size is the length of the region you want to allocate, in bytes.
|
||||
|
||||
This routine will allocate RAM for that region, so it acts similarly to
|
||||
__get_free_pages (but takes size instead of a page order). If your
|
||||
driver needs regions sized smaller than a page, you may prefer using
|
||||
the pci_pool interface, described below.
|
||||
|
||||
The consistent DMA mapping interfaces, for non-NULL dev, will by
|
||||
default return a DMA address which is SAC (Single Address Cycle)
|
||||
addressable. Even if the device indicates (via PCI dma mask) that it
|
||||
may address the upper 32-bits and thus perform DAC cycles, consistent
|
||||
allocation will only return > 32-bit PCI addresses for DMA if the
|
||||
consistent dma mask has been explicitly changed via
|
||||
pci_set_consistent_dma_mask(). This is true of the pci_pool interface
|
||||
as well.
|
||||
|
||||
pci_alloc_consistent returns two values: the virtual address which you
|
||||
can use to access it from the CPU and dma_handle which you pass to the
|
||||
card.
|
||||
|
||||
The cpu return address and the DMA bus master address are both
|
||||
guaranteed to be aligned to the smallest PAGE_SIZE order which
|
||||
is greater than or equal to the requested size. This invariant
|
||||
exists (for example) to guarantee that if you allocate a chunk
|
||||
which is smaller than or equal to 64 kilobytes, the extent of the
|
||||
buffer you receive will not cross a 64K boundary.
|
||||
|
||||
To unmap and free such a DMA region, you call:
|
||||
|
||||
pci_free_consistent(dev, size, cpu_addr, dma_handle);
|
||||
|
||||
where dev, size are the same as in the above call and cpu_addr and
|
||||
dma_handle are the values pci_alloc_consistent returned to you.
|
||||
This function may not be called in interrupt context.
|
||||
|
||||
If your driver needs lots of smaller memory regions, you can write
|
||||
custom code to subdivide pages returned by pci_alloc_consistent,
|
||||
or you can use the pci_pool API to do that. A pci_pool is like
|
||||
a kmem_cache, but it uses pci_alloc_consistent not __get_free_pages.
|
||||
Also, it understands common hardware constraints for alignment,
|
||||
like queue heads needing to be aligned on N byte boundaries.
|
||||
|
||||
Create a pci_pool like this:
|
||||
|
||||
struct pci_pool *pool;
|
||||
|
||||
pool = pci_pool_create(name, dev, size, align, alloc);
|
||||
|
||||
The "name" is for diagnostics (like a kmem_cache name); dev and size
|
||||
are as above. The device's hardware alignment requirement for this
|
||||
type of data is "align" (which is expressed in bytes, and must be a
|
||||
power of two). If your device has no boundary crossing restrictions,
|
||||
pass 0 for alloc; passing 4096 says memory allocated from this pool
|
||||
must not cross 4KByte boundaries (but at that time it may be better to
|
||||
go for pci_alloc_consistent directly instead).
|
||||
|
||||
Allocate memory from a pci pool like this:
|
||||
|
||||
cpu_addr = pci_pool_alloc(pool, flags, &dma_handle);
|
||||
|
||||
flags are SLAB_KERNEL if blocking is permitted (not in_interrupt nor
|
||||
holding SMP locks), SLAB_ATOMIC otherwise. Like pci_alloc_consistent,
|
||||
this returns two values, cpu_addr and dma_handle.
|
||||
|
||||
Free memory that was allocated from a pci_pool like this:
|
||||
|
||||
pci_pool_free(pool, cpu_addr, dma_handle);
|
||||
|
||||
where pool is what you passed to pci_pool_alloc, and cpu_addr and
|
||||
dma_handle are the values pci_pool_alloc returned. This function
|
||||
may be called in interrupt context.
|
||||
|
||||
Destroy a pci_pool by calling:
|
||||
|
||||
pci_pool_destroy(pool);
|
||||
|
||||
Make sure you've called pci_pool_free for all memory allocated
|
||||
from a pool before you destroy the pool. This function may not
|
||||
be called in interrupt context.
|
||||
|
||||
DMA Direction
|
||||
|
||||
The interfaces described in subsequent portions of this document
|
||||
take a DMA direction argument, which is an integer and takes on
|
||||
one of the following values:
|
||||
|
||||
PCI_DMA_BIDIRECTIONAL
|
||||
PCI_DMA_TODEVICE
|
||||
PCI_DMA_FROMDEVICE
|
||||
PCI_DMA_NONE
|
||||
|
||||
One should provide the exact DMA direction if you know it.
|
||||
|
||||
PCI_DMA_TODEVICE means "from main memory to the PCI device"
|
||||
PCI_DMA_FROMDEVICE means "from the PCI device to main memory"
|
||||
It is the direction in which the data moves during the DMA
|
||||
transfer.
|
||||
|
||||
You are _strongly_ encouraged to specify this as precisely
|
||||
as you possibly can.
|
||||
|
||||
If you absolutely cannot know the direction of the DMA transfer,
|
||||
specify PCI_DMA_BIDIRECTIONAL. It means that the DMA can go in
|
||||
either direction. The platform guarantees that you may legally
|
||||
specify this, and that it will work, but this may be at the
|
||||
cost of performance for example.
|
||||
|
||||
The value PCI_DMA_NONE is to be used for debugging. One can
|
||||
hold this in a data structure before you come to know the
|
||||
precise direction, and this will help catch cases where your
|
||||
direction tracking logic has failed to set things up properly.
|
||||
|
||||
Another advantage of specifying this value precisely (outside of
|
||||
potential platform-specific optimizations of such) is for debugging.
|
||||
Some platforms actually have a write permission boolean which DMA
|
||||
mappings can be marked with, much like page protections in the user
|
||||
program address space. Such platforms can and do report errors in the
|
||||
kernel logs when the PCI controller hardware detects violation of the
|
||||
permission setting.
|
||||
|
||||
Only streaming mappings specify a direction, consistent mappings
|
||||
implicitly have a direction attribute setting of
|
||||
PCI_DMA_BIDIRECTIONAL.
|
||||
|
||||
The SCSI subsystem tells you the direction to use in the
|
||||
'sc_data_direction' member of the SCSI command your driver is
|
||||
working on.
|
||||
|
||||
For Networking drivers, it's a rather simple affair. For transmit
|
||||
packets, map/unmap them with the PCI_DMA_TODEVICE direction
|
||||
specifier. For receive packets, just the opposite, map/unmap them
|
||||
with the PCI_DMA_FROMDEVICE direction specifier.
|
||||
|
||||
Using Streaming DMA mappings
|
||||
|
||||
The streaming DMA mapping routines can be called from interrupt
|
||||
context. There are two versions of each map/unmap, one which will
|
||||
map/unmap a single memory region, and one which will map/unmap a
|
||||
scatterlist.
|
||||
|
||||
To map a single region, you do:
|
||||
|
||||
struct pci_dev *pdev = mydev->pdev;
|
||||
dma_addr_t dma_handle;
|
||||
void *addr = buffer->ptr;
|
||||
size_t size = buffer->len;
|
||||
|
||||
dma_handle = pci_map_single(dev, addr, size, direction);
|
||||
|
||||
and to unmap it:
|
||||
|
||||
pci_unmap_single(dev, dma_handle, size, direction);
|
||||
|
||||
You should call pci_unmap_single when the DMA activity is finished, e.g.
|
||||
from the interrupt which told you that the DMA transfer is done.
|
||||
|
||||
Using cpu pointers like this for single mappings has a disadvantage,
|
||||
you cannot reference HIGHMEM memory in this way. Thus, there is a
|
||||
map/unmap interface pair akin to pci_{map,unmap}_single. These
|
||||
interfaces deal with page/offset pairs instead of cpu pointers.
|
||||
Specifically:
|
||||
|
||||
struct pci_dev *pdev = mydev->pdev;
|
||||
dma_addr_t dma_handle;
|
||||
struct page *page = buffer->page;
|
||||
unsigned long offset = buffer->offset;
|
||||
size_t size = buffer->len;
|
||||
|
||||
dma_handle = pci_map_page(dev, page, offset, size, direction);
|
||||
|
||||
...
|
||||
|
||||
pci_unmap_page(dev, dma_handle, size, direction);
|
||||
|
||||
Here, "offset" means byte offset within the given page.
|
||||
|
||||
With scatterlists, you map a region gathered from several regions by:
|
||||
|
||||
int i, count = pci_map_sg(dev, sglist, nents, direction);
|
||||
struct scatterlist *sg;
|
||||
|
||||
for (i = 0, sg = sglist; i < count; i++, sg++) {
|
||||
hw_address[i] = sg_dma_address(sg);
|
||||
hw_len[i] = sg_dma_len(sg);
|
||||
}
|
||||
|
||||
where nents is the number of entries in the sglist.
|
||||
|
||||
The implementation is free to merge several consecutive sglist entries
|
||||
into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any
|
||||
consecutive sglist entries can be merged into one provided the first one
|
||||
ends and the second one starts on a page boundary - in fact this is a huge
|
||||
advantage for cards which either cannot do scatter-gather or have very
|
||||
limited number of scatter-gather entries) and returns the actual number
|
||||
of sg entries it mapped them to. On failure 0 is returned.
|
||||
|
||||
Then you should loop count times (note: this can be less than nents times)
|
||||
and use sg_dma_address() and sg_dma_len() macros where you previously
|
||||
accessed sg->address and sg->length as shown above.
|
||||
|
||||
To unmap a scatterlist, just call:
|
||||
|
||||
pci_unmap_sg(dev, sglist, nents, direction);
|
||||
|
||||
Again, make sure DMA activity has already finished.
|
||||
|
||||
PLEASE NOTE: The 'nents' argument to the pci_unmap_sg call must be
|
||||
the _same_ one you passed into the pci_map_sg call,
|
||||
it should _NOT_ be the 'count' value _returned_ from the
|
||||
pci_map_sg call.
|
||||
|
||||
Every pci_map_{single,sg} call should have its pci_unmap_{single,sg}
|
||||
counterpart, because the bus address space is a shared resource (although
|
||||
in some ports the mapping is per each BUS so less devices contend for the
|
||||
same bus address space) and you could render the machine unusable by eating
|
||||
all bus addresses.
|
||||
|
||||
If you need to use the same streaming DMA region multiple times and touch
|
||||
the data in between the DMA transfers, the buffer needs to be synced
|
||||
properly in order for the cpu and device to see the most uptodate and
|
||||
correct copy of the DMA buffer.
|
||||
|
||||
So, firstly, just map it with pci_map_{single,sg}, and after each DMA
|
||||
transfer call either:
|
||||
|
||||
pci_dma_sync_single_for_cpu(dev, dma_handle, size, direction);
|
||||
|
||||
or:
|
||||
|
||||
pci_dma_sync_sg_for_cpu(dev, sglist, nents, direction);
|
||||
|
||||
as appropriate.
|
||||
|
||||
Then, if you wish to let the device get at the DMA area again,
|
||||
finish accessing the data with the cpu, and then before actually
|
||||
giving the buffer to the hardware call either:
|
||||
|
||||
pci_dma_sync_single_for_device(dev, dma_handle, size, direction);
|
||||
|
||||
or:
|
||||
|
||||
pci_dma_sync_sg_for_device(dev, sglist, nents, direction);
|
||||
|
||||
as appropriate.
|
||||
|
||||
After the last DMA transfer call one of the DMA unmap routines
|
||||
pci_unmap_{single,sg}. If you don't touch the data from the first pci_map_*
|
||||
call till pci_unmap_*, then you don't have to call the pci_dma_sync_*
|
||||
routines at all.
|
||||
|
||||
Here is pseudo code which shows a situation in which you would need
|
||||
to use the pci_dma_sync_*() interfaces.
|
||||
|
||||
my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len)
|
||||
{
|
||||
dma_addr_t mapping;
|
||||
|
||||
mapping = pci_map_single(cp->pdev, buffer, len, PCI_DMA_FROMDEVICE);
|
||||
|
||||
cp->rx_buf = buffer;
|
||||
cp->rx_len = len;
|
||||
cp->rx_dma = mapping;
|
||||
|
||||
give_rx_buf_to_card(cp);
|
||||
}
|
||||
|
||||
...
|
||||
|
||||
my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs)
|
||||
{
|
||||
struct my_card *cp = devid;
|
||||
|
||||
...
|
||||
if (read_card_status(cp) == RX_BUF_TRANSFERRED) {
|
||||
struct my_card_header *hp;
|
||||
|
||||
/* Examine the header to see if we wish
|
||||
* to accept the data. But synchronize
|
||||
* the DMA transfer with the CPU first
|
||||
* so that we see updated contents.
|
||||
*/
|
||||
pci_dma_sync_single_for_cpu(cp->pdev, cp->rx_dma,
|
||||
cp->rx_len,
|
||||
PCI_DMA_FROMDEVICE);
|
||||
|
||||
/* Now it is safe to examine the buffer. */
|
||||
hp = (struct my_card_header *) cp->rx_buf;
|
||||
if (header_is_ok(hp)) {
|
||||
pci_unmap_single(cp->pdev, cp->rx_dma, cp->rx_len,
|
||||
PCI_DMA_FROMDEVICE);
|
||||
pass_to_upper_layers(cp->rx_buf);
|
||||
make_and_setup_new_rx_buf(cp);
|
||||
} else {
|
||||
/* Just sync the buffer and give it back
|
||||
* to the card.
|
||||
*/
|
||||
pci_dma_sync_single_for_device(cp->pdev,
|
||||
cp->rx_dma,
|
||||
cp->rx_len,
|
||||
PCI_DMA_FROMDEVICE);
|
||||
give_rx_buf_to_card(cp);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Drivers converted fully to this interface should not use virt_to_bus any
|
||||
longer, nor should they use bus_to_virt. Some drivers have to be changed a
|
||||
little bit, because there is no longer an equivalent to bus_to_virt in the
|
||||
dynamic DMA mapping scheme - you have to always store the DMA addresses
|
||||
returned by the pci_alloc_consistent, pci_pool_alloc, and pci_map_single
|
||||
calls (pci_map_sg stores them in the scatterlist itself if the platform
|
||||
supports dynamic DMA mapping in hardware) in your driver structures and/or
|
||||
in the card registers.
|
||||
|
||||
All PCI drivers should be using these interfaces with no exceptions.
|
||||
It is planned to completely remove virt_to_bus() and bus_to_virt() as
|
||||
they are entirely deprecated. Some ports already do not provide these
|
||||
as it is impossible to correctly support them.
|
||||
|
||||
64-bit DMA and DAC cycle support
|
||||
|
||||
Do you understand all of the text above? Great, then you already
|
||||
know how to use 64-bit DMA addressing under Linux. Simply make
|
||||
the appropriate pci_set_dma_mask() calls based upon your cards
|
||||
capabilities, then use the mapping APIs above.
|
||||
|
||||
It is that simple.
|
||||
|
||||
Well, not for some odd devices. See the next section for information
|
||||
about that.
|
||||
|
||||
DAC Addressing for Address Space Hungry Devices
|
||||
|
||||
There exists a class of devices which do not mesh well with the PCI
|
||||
DMA mapping API. By definition these "mappings" are a finite
|
||||
resource. The number of total available mappings per bus is platform
|
||||
specific, but there will always be a reasonable amount.
|
||||
|
||||
What is "reasonable"? Reasonable means that networking and block I/O
|
||||
devices need not worry about using too many mappings.
|
||||
|
||||
As an example of a problematic device, consider compute cluster cards.
|
||||
They can potentially need to access gigabytes of memory at once via
|
||||
DMA. Dynamic mappings are unsuitable for this kind of access pattern.
|
||||
|
||||
To this end we've provided a small API by which a device driver
|
||||
may use DAC cycles to directly address all of physical memory.
|
||||
Not all platforms support this, but most do. It is easy to determine
|
||||
whether the platform will work properly at probe time.
|
||||
|
||||
First, understand that there may be a SEVERE performance penalty for
|
||||
using these interfaces on some platforms. Therefore, you MUST only
|
||||
use these interfaces if it is absolutely required. %99 of devices can
|
||||
use the normal APIs without any problems.
|
||||
|
||||
Note that for streaming type mappings you must either use these
|
||||
interfaces, or the dynamic mapping interfaces above. You may not mix
|
||||
usage of both for the same device. Such an act is illegal and is
|
||||
guaranteed to put a banana in your tailpipe.
|
||||
|
||||
However, consistent mappings may in fact be used in conjunction with
|
||||
these interfaces. Remember that, as defined, consistent mappings are
|
||||
always going to be SAC addressable.
|
||||
|
||||
The first thing your driver needs to do is query the PCI platform
|
||||
layer if it is capable of handling your devices DAC addressing
|
||||
capabilities:
|
||||
|
||||
int pci_dac_dma_supported(struct pci_dev *hwdev, u64 mask);
|
||||
|
||||
You may not use the following interfaces if this routine fails.
|
||||
|
||||
Next, DMA addresses using this API are kept track of using the
|
||||
dma64_addr_t type. It is guaranteed to be big enough to hold any
|
||||
DAC address the platform layer will give to you from the following
|
||||
routines. If you have consistent mappings as well, you still
|
||||
use plain dma_addr_t to keep track of those.
|
||||
|
||||
All mappings obtained here will be direct. The mappings are not
|
||||
translated, and this is the purpose of this dialect of the DMA API.
|
||||
|
||||
All routines work with page/offset pairs. This is the _ONLY_ way to
|
||||
portably refer to any piece of memory. If you have a cpu pointer
|
||||
(which may be validly DMA'd too) you may easily obtain the page
|
||||
and offset using something like this:
|
||||
|
||||
struct page *page = virt_to_page(ptr);
|
||||
unsigned long offset = offset_in_page(ptr);
|
||||
|
||||
Here are the interfaces:
|
||||
|
||||
dma64_addr_t pci_dac_page_to_dma(struct pci_dev *pdev,
|
||||
struct page *page,
|
||||
unsigned long offset,
|
||||
int direction);
|
||||
|
||||
The DAC address for the tuple PAGE/OFFSET are returned. The direction
|
||||
argument is the same as for pci_{map,unmap}_single(). The same rules
|
||||
for cpu/device access apply here as for the streaming mapping
|
||||
interfaces. To reiterate:
|
||||
|
||||
The cpu may touch the buffer before pci_dac_page_to_dma.
|
||||
The device may touch the buffer after pci_dac_page_to_dma
|
||||
is made, but the cpu may NOT.
|
||||
|
||||
When the DMA transfer is complete, invoke:
|
||||
|
||||
void pci_dac_dma_sync_single_for_cpu(struct pci_dev *pdev,
|
||||
dma64_addr_t dma_addr,
|
||||
size_t len, int direction);
|
||||
|
||||
This must be done before the CPU looks at the buffer again.
|
||||
This interface behaves identically to pci_dma_sync_{single,sg}_for_cpu().
|
||||
|
||||
And likewise, if you wish to let the device get back at the buffer after
|
||||
the cpu has read/written it, invoke:
|
||||
|
||||
void pci_dac_dma_sync_single_for_device(struct pci_dev *pdev,
|
||||
dma64_addr_t dma_addr,
|
||||
size_t len, int direction);
|
||||
|
||||
before letting the device access the DMA area again.
|
||||
|
||||
If you need to get back to the PAGE/OFFSET tuple from a dma64_addr_t
|
||||
the following interfaces are provided:
|
||||
|
||||
struct page *pci_dac_dma_to_page(struct pci_dev *pdev,
|
||||
dma64_addr_t dma_addr);
|
||||
unsigned long pci_dac_dma_to_offset(struct pci_dev *pdev,
|
||||
dma64_addr_t dma_addr);
|
||||
|
||||
This is possible with the DAC interfaces purely because they are
|
||||
not translated in any way.
|
||||
|
||||
Optimizing Unmap State Space Consumption
|
||||
|
||||
On many platforms, pci_unmap_{single,page}() is simply a nop.
|
||||
Therefore, keeping track of the mapping address and length is a waste
|
||||
of space. Instead of filling your drivers up with ifdefs and the like
|
||||
to "work around" this (which would defeat the whole purpose of a
|
||||
portable API) the following facilities are provided.
|
||||
|
||||
Actually, instead of describing the macros one by one, we'll
|
||||
transform some example code.
|
||||
|
||||
1) Use DECLARE_PCI_UNMAP_{ADDR,LEN} in state saving structures.
|
||||
Example, before:
|
||||
|
||||
struct ring_state {
|
||||
struct sk_buff *skb;
|
||||
dma_addr_t mapping;
|
||||
__u32 len;
|
||||
};
|
||||
|
||||
after:
|
||||
|
||||
struct ring_state {
|
||||
struct sk_buff *skb;
|
||||
DECLARE_PCI_UNMAP_ADDR(mapping)
|
||||
DECLARE_PCI_UNMAP_LEN(len)
|
||||
};
|
||||
|
||||
NOTE: DO NOT put a semicolon at the end of the DECLARE_*()
|
||||
macro.
|
||||
|
||||
2) Use pci_unmap_{addr,len}_set to set these values.
|
||||
Example, before:
|
||||
|
||||
ringp->mapping = FOO;
|
||||
ringp->len = BAR;
|
||||
|
||||
after:
|
||||
|
||||
pci_unmap_addr_set(ringp, mapping, FOO);
|
||||
pci_unmap_len_set(ringp, len, BAR);
|
||||
|
||||
3) Use pci_unmap_{addr,len} to access these values.
|
||||
Example, before:
|
||||
|
||||
pci_unmap_single(pdev, ringp->mapping, ringp->len,
|
||||
PCI_DMA_FROMDEVICE);
|
||||
|
||||
after:
|
||||
|
||||
pci_unmap_single(pdev,
|
||||
pci_unmap_addr(ringp, mapping),
|
||||
pci_unmap_len(ringp, len),
|
||||
PCI_DMA_FROMDEVICE);
|
||||
|
||||
It really should be self-explanatory. We treat the ADDR and LEN
|
||||
separately, because it is possible for an implementation to only
|
||||
need the address in order to perform the unmap operation.
|
||||
|
||||
Platform Issues
|
||||
|
||||
If you are just writing drivers for Linux and do not maintain
|
||||
an architecture port for the kernel, you can safely skip down
|
||||
to "Closing".
|
||||
|
||||
1) Struct scatterlist requirements.
|
||||
|
||||
Struct scatterlist must contain, at a minimum, the following
|
||||
members:
|
||||
|
||||
struct page *page;
|
||||
unsigned int offset;
|
||||
unsigned int length;
|
||||
|
||||
The base address is specified by a "page+offset" pair.
|
||||
|
||||
Previous versions of struct scatterlist contained a "void *address"
|
||||
field that was sometimes used instead of page+offset. As of Linux
|
||||
2.5., page+offset is always used, and the "address" field has been
|
||||
deleted.
|
||||
|
||||
2) More to come...
|
||||
|
||||
Handling Errors
|
||||
|
||||
DMA address space is limited on some architectures and an allocation
|
||||
failure can be determined by:
|
||||
|
||||
- checking if pci_alloc_consistent returns NULL or pci_map_sg returns 0
|
||||
|
||||
- checking the returned dma_addr_t of pci_map_single and pci_map_page
|
||||
by using pci_dma_mapping_error():
|
||||
|
||||
dma_addr_t dma_handle;
|
||||
|
||||
dma_handle = pci_map_single(dev, addr, size, direction);
|
||||
if (pci_dma_mapping_error(dma_handle)) {
|
||||
/*
|
||||
* reduce current DMA mapping usage,
|
||||
* delay and try again later or
|
||||
* reset driver.
|
||||
*/
|
||||
}
|
||||
|
||||
Closing
|
||||
|
||||
This document, and the API itself, would not be in it's current
|
||||
form without the feedback and suggestions from numerous individuals.
|
||||
We would like to specifically mention, in no particular order, the
|
||||
following people:
|
||||
|
||||
Russell King <rmk@arm.linux.org.uk>
|
||||
Leo Dagum <dagum@barrel.engr.sgi.com>
|
||||
Ralf Baechle <ralf@oss.sgi.com>
|
||||
Grant Grundler <grundler@cup.hp.com>
|
||||
Jay Estabrook <Jay.Estabrook@compaq.com>
|
||||
Thomas Sailer <sailer@ife.ee.ethz.ch>
|
||||
Andrea Arcangeli <andrea@suse.de>
|
||||
Jens Axboe <axboe@suse.de>
|
||||
David Mosberger-Tang <davidm@hpl.hp.com>
|
||||
224
Documentation/DocBook/Makefile
Normal file
224
Documentation/DocBook/Makefile
Normal file
@ -0,0 +1,224 @@
|
||||
###
|
||||
# This makefile is used to generate the kernel documentation,
|
||||
# primarily based on in-line comments in various source files.
|
||||
# See Documentation/kernel-doc-nano-HOWTO.txt for instruction in how
|
||||
# to document the SRC - and how to read it.
|
||||
# To add a new book the only step required is to add the book to the
|
||||
# list of DOCBOOKS.
|
||||
|
||||
DOCBOOKS := wanbook.xml z8530book.xml mcabook.xml videobook.xml \
|
||||
kernel-hacking.xml kernel-locking.xml deviceiobook.xml \
|
||||
procfs-guide.xml writing_usb_driver.xml \
|
||||
kernel-api.xml filesystems.xml lsm.xml usb.xml \
|
||||
gadget.xml libata.xml mtdnand.xml librs.xml rapidio.xml \
|
||||
genericirq.xml
|
||||
|
||||
###
|
||||
# The build process is as follows (targets):
|
||||
# (xmldocs)
|
||||
# file.tmpl --> file.xml +--> file.ps (psdocs)
|
||||
# +--> file.pdf (pdfdocs)
|
||||
# +--> DIR=file (htmldocs)
|
||||
# +--> man/ (mandocs)
|
||||
|
||||
|
||||
# for PDF and PS output you can choose between xmlto and docbook-utils tools
|
||||
PDF_METHOD = $(prefer-db2x)
|
||||
PS_METHOD = $(prefer-db2x)
|
||||
|
||||
|
||||
###
|
||||
# The targets that may be used.
|
||||
PHONY += xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs
|
||||
|
||||
BOOKS := $(addprefix $(obj)/,$(DOCBOOKS))
|
||||
xmldocs: $(BOOKS)
|
||||
sgmldocs: xmldocs
|
||||
|
||||
PS := $(patsubst %.xml, %.ps, $(BOOKS))
|
||||
psdocs: $(PS)
|
||||
|
||||
PDF := $(patsubst %.xml, %.pdf, $(BOOKS))
|
||||
pdfdocs: $(PDF)
|
||||
|
||||
HTML := $(patsubst %.xml, %.html, $(BOOKS))
|
||||
htmldocs: $(HTML)
|
||||
|
||||
MAN := $(patsubst %.xml, %.9, $(BOOKS))
|
||||
mandocs: $(MAN)
|
||||
|
||||
installmandocs: mandocs
|
||||
mkdir -p /usr/local/man/man9/
|
||||
install Documentation/DocBook/man/*.9.gz /usr/local/man/man9/
|
||||
|
||||
###
|
||||
#External programs used
|
||||
KERNELDOC = $(srctree)/scripts/kernel-doc
|
||||
DOCPROC = $(objtree)/scripts/basic/docproc
|
||||
|
||||
XMLTOFLAGS = -m $(srctree)/Documentation/DocBook/stylesheet.xsl
|
||||
#XMLTOFLAGS += --skip-validation
|
||||
|
||||
###
|
||||
# DOCPROC is used for two purposes:
|
||||
# 1) To generate a dependency list for a .tmpl file
|
||||
# 2) To preprocess a .tmpl file and call kernel-doc with
|
||||
# appropriate parameters.
|
||||
# The following rules are used to generate the .xml documentation
|
||||
# required to generate the final targets. (ps, pdf, html).
|
||||
quiet_cmd_docproc = DOCPROC $@
|
||||
cmd_docproc = SRCTREE=$(srctree)/ $(DOCPROC) doc $< >$@
|
||||
define rule_docproc
|
||||
set -e; \
|
||||
$(if $($(quiet)cmd_$(1)),echo ' $($(quiet)cmd_$(1))';) \
|
||||
$(cmd_$(1)); \
|
||||
( \
|
||||
echo 'cmd_$@ := $(cmd_$(1))'; \
|
||||
echo $@: `SRCTREE=$(srctree) $(DOCPROC) depend $<`; \
|
||||
) > $(dir $@).$(notdir $@).cmd
|
||||
endef
|
||||
|
||||
%.xml: %.tmpl FORCE
|
||||
$(call if_changed_rule,docproc)
|
||||
|
||||
###
|
||||
#Read in all saved dependency files
|
||||
cmd_files := $(wildcard $(foreach f,$(BOOKS),$(dir $(f)).$(notdir $(f)).cmd))
|
||||
|
||||
ifneq ($(cmd_files),)
|
||||
include $(cmd_files)
|
||||
endif
|
||||
|
||||
###
|
||||
# Changes in kernel-doc force a rebuild of all documentation
|
||||
$(BOOKS): $(KERNELDOC)
|
||||
|
||||
###
|
||||
# procfs guide uses a .c file as example code.
|
||||
# This requires an explicit dependency
|
||||
C-procfs-example = procfs_example.xml
|
||||
C-procfs-example2 = $(addprefix $(obj)/,$(C-procfs-example))
|
||||
$(obj)/procfs-guide.xml: $(C-procfs-example2)
|
||||
|
||||
notfoundtemplate = echo "*** You have to install docbook-utils or xmlto ***"; \
|
||||
exit 1
|
||||
db2xtemplate = db2TYPE -o $(dir $@) $<
|
||||
xmltotemplate = xmlto TYPE $(XMLTOFLAGS) -o $(dir $@) $<
|
||||
|
||||
# determine which methods are available
|
||||
ifeq ($(shell which db2ps >/dev/null 2>&1 && echo found),found)
|
||||
use-db2x = db2x
|
||||
prefer-db2x = db2x
|
||||
else
|
||||
use-db2x = notfound
|
||||
prefer-db2x = $(use-xmlto)
|
||||
endif
|
||||
ifeq ($(shell which xmlto >/dev/null 2>&1 && echo found),found)
|
||||
use-xmlto = xmlto
|
||||
prefer-xmlto = xmlto
|
||||
else
|
||||
use-xmlto = notfound
|
||||
prefer-xmlto = $(use-db2x)
|
||||
endif
|
||||
|
||||
# the commands, generated from the chosen template
|
||||
quiet_cmd_db2ps = PS $@
|
||||
cmd_db2ps = $(subst TYPE,ps, $($(PS_METHOD)template))
|
||||
%.ps : %.xml
|
||||
$(call cmd,db2ps)
|
||||
|
||||
quiet_cmd_db2pdf = PDF $@
|
||||
cmd_db2pdf = $(subst TYPE,pdf, $($(PDF_METHOD)template))
|
||||
%.pdf : %.xml
|
||||
$(call cmd,db2pdf)
|
||||
|
||||
quiet_cmd_db2html = HTML $@
|
||||
cmd_db2html = xmlto xhtml $(XMLTOFLAGS) -o $(patsubst %.html,%,$@) $< && \
|
||||
echo '<a HREF="$(patsubst %.html,%,$(notdir $@))/index.html"> \
|
||||
Goto $(patsubst %.html,%,$(notdir $@))</a><p>' > $@
|
||||
|
||||
%.html: %.xml
|
||||
@(which xmlto > /dev/null 2>&1) || \
|
||||
(echo "*** You need to install xmlto ***"; \
|
||||
exit 1)
|
||||
@rm -rf $@ $(patsubst %.html,%,$@)
|
||||
$(call cmd,db2html)
|
||||
@if [ ! -z "$(PNG-$(basename $(notdir $@)))" ]; then \
|
||||
cp $(PNG-$(basename $(notdir $@))) $(patsubst %.html,%,$@); fi
|
||||
|
||||
quiet_cmd_db2man = MAN $@
|
||||
cmd_db2man = if grep -q refentry $<; then xmlto man $(XMLTOFLAGS) -o $(obj)/man $< ; gzip -f $(obj)/man/*.9; fi
|
||||
%.9 : %.xml
|
||||
@(which xmlto > /dev/null 2>&1) || \
|
||||
(echo "*** You need to install xmlto ***"; \
|
||||
exit 1)
|
||||
$(call cmd,db2man)
|
||||
@touch $@
|
||||
|
||||
###
|
||||
# Rules to generate postscripts and PNG imgages from .fig format files
|
||||
quiet_cmd_fig2eps = FIG2EPS $@
|
||||
cmd_fig2eps = fig2dev -Leps $< $@
|
||||
|
||||
%.eps: %.fig
|
||||
@(which fig2dev > /dev/null 2>&1) || \
|
||||
(echo "*** You need to install transfig ***"; \
|
||||
exit 1)
|
||||
$(call cmd,fig2eps)
|
||||
|
||||
quiet_cmd_fig2png = FIG2PNG $@
|
||||
cmd_fig2png = fig2dev -Lpng $< $@
|
||||
|
||||
%.png: %.fig
|
||||
@(which fig2dev > /dev/null 2>&1) || \
|
||||
(echo "*** You need to install transfig ***"; \
|
||||
exit 1)
|
||||
$(call cmd,fig2png)
|
||||
|
||||
###
|
||||
# Rule to convert a .c file to inline XML documentation
|
||||
%.xml: %.c
|
||||
@echo ' GEN $@'
|
||||
@( \
|
||||
echo "<programlisting>"; \
|
||||
expand --tabs=8 < $< | \
|
||||
sed -e "s/&/\\&/g" \
|
||||
-e "s/</\\</g" \
|
||||
-e "s/>/\\>/g"; \
|
||||
echo "</programlisting>") > $@
|
||||
|
||||
###
|
||||
# Help targets as used by the top-level makefile
|
||||
dochelp:
|
||||
@echo ' Linux kernel internal documentation in different formats:'
|
||||
@echo ' htmldocs - HTML'
|
||||
@echo ' installmandocs - install man pages generated by mandocs'
|
||||
@echo ' mandocs - man pages'
|
||||
@echo ' pdfdocs - PDF'
|
||||
@echo ' psdocs - Postscript'
|
||||
@echo ' xmldocs - XML DocBook'
|
||||
|
||||
###
|
||||
# Temporary files left by various tools
|
||||
clean-files := $(DOCBOOKS) \
|
||||
$(patsubst %.xml, %.dvi, $(DOCBOOKS)) \
|
||||
$(patsubst %.xml, %.aux, $(DOCBOOKS)) \
|
||||
$(patsubst %.xml, %.tex, $(DOCBOOKS)) \
|
||||
$(patsubst %.xml, %.log, $(DOCBOOKS)) \
|
||||
$(patsubst %.xml, %.out, $(DOCBOOKS)) \
|
||||
$(patsubst %.xml, %.ps, $(DOCBOOKS)) \
|
||||
$(patsubst %.xml, %.pdf, $(DOCBOOKS)) \
|
||||
$(patsubst %.xml, %.html, $(DOCBOOKS)) \
|
||||
$(patsubst %.xml, %.9, $(DOCBOOKS)) \
|
||||
$(C-procfs-example)
|
||||
|
||||
clean-dirs := $(patsubst %.xml,%,$(DOCBOOKS))
|
||||
|
||||
#man put files in man subdir - traverse down
|
||||
subdir- := man/
|
||||
|
||||
|
||||
# Declare the contents of the .PHONY variable as phony. We keep that
|
||||
# information in a variable se we can use it in if_changed and friends.
|
||||
|
||||
.PHONY: $(PHONY)
|
||||
322
Documentation/DocBook/deviceiobook.tmpl
Normal file
322
Documentation/DocBook/deviceiobook.tmpl
Normal file
@ -0,0 +1,322 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="DoingIO">
|
||||
<bookinfo>
|
||||
<title>Bus-Independent Device Accesses</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Matthew</firstname>
|
||||
<surname>Wilcox</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>matthew@wil.cx</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Alan</firstname>
|
||||
<surname>Cox</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>alan@redhat.com</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2001</year>
|
||||
<holder>Matthew Wilcox</holder>
|
||||
</copyright>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License as published by the Free Software Foundation; either
|
||||
version 2 of the License, or (at your option) any later
|
||||
version.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter id="intro">
|
||||
<title>Introduction</title>
|
||||
<para>
|
||||
Linux provides an API which abstracts performing IO across all busses
|
||||
and devices, allowing device drivers to be written independently of
|
||||
bus type.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="bugs">
|
||||
<title>Known Bugs And Assumptions</title>
|
||||
<para>
|
||||
None.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="mmio">
|
||||
<title>Memory Mapped IO</title>
|
||||
<sect1>
|
||||
<title>Getting Access to the Device</title>
|
||||
<para>
|
||||
The most widely supported form of IO is memory mapped IO.
|
||||
That is, a part of the CPU's address space is interpreted
|
||||
not as accesses to memory, but as accesses to a device. Some
|
||||
architectures define devices to be at a fixed address, but most
|
||||
have some method of discovering devices. The PCI bus walk is a
|
||||
good example of such a scheme. This document does not cover how
|
||||
to receive such an address, but assumes you are starting with one.
|
||||
Physical addresses are of type unsigned long.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This address should not be used directly. Instead, to get an
|
||||
address suitable for passing to the accessor functions described
|
||||
below, you should call <function>ioremap</function>.
|
||||
An address suitable for accessing the device will be returned to you.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
After you've finished using the device (say, in your module's
|
||||
exit routine), call <function>iounmap</function> in order to return
|
||||
the address space to the kernel. Most architectures allocate new
|
||||
address space each time you call <function>ioremap</function>, and
|
||||
they can run out unless you call <function>iounmap</function>.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>Accessing the device</title>
|
||||
<para>
|
||||
The part of the interface most used by drivers is reading and
|
||||
writing memory-mapped registers on the device. Linux provides
|
||||
interfaces to read and write 8-bit, 16-bit, 32-bit and 64-bit
|
||||
quantities. Due to a historical accident, these are named byte,
|
||||
word, long and quad accesses. Both read and write accesses are
|
||||
supported; there is no prefetch support at this time.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The functions are named <function>readb</function>,
|
||||
<function>readw</function>, <function>readl</function>,
|
||||
<function>readq</function>, <function>readb_relaxed</function>,
|
||||
<function>readw_relaxed</function>, <function>readl_relaxed</function>,
|
||||
<function>readq_relaxed</function>, <function>writeb</function>,
|
||||
<function>writew</function>, <function>writel</function> and
|
||||
<function>writeq</function>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Some devices (such as framebuffers) would like to use larger
|
||||
transfers than 8 bytes at a time. For these devices, the
|
||||
<function>memcpy_toio</function>, <function>memcpy_fromio</function>
|
||||
and <function>memset_io</function> functions are provided.
|
||||
Do not use memset or memcpy on IO addresses; they
|
||||
are not guaranteed to copy data in order.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The read and write functions are defined to be ordered. That is the
|
||||
compiler is not permitted to reorder the I/O sequence. When the
|
||||
ordering can be compiler optimised, you can use <function>
|
||||
__readb</function> and friends to indicate the relaxed ordering. Use
|
||||
this with care.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
While the basic functions are defined to be synchronous with respect
|
||||
to each other and ordered with respect to each other the busses the
|
||||
devices sit on may themselves have asynchronicity. In particular many
|
||||
authors are burned by the fact that PCI bus writes are posted
|
||||
asynchronously. A driver author must issue a read from the same
|
||||
device to ensure that writes have occurred in the specific cases the
|
||||
author cares. This kind of property cannot be hidden from driver
|
||||
writers in the API. In some cases, the read used to flush the device
|
||||
may be expected to fail (if the card is resetting, for example). In
|
||||
that case, the read should be done from config space, which is
|
||||
guaranteed to soft-fail if the card doesn't respond.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The following is an example of flushing a write to a device when
|
||||
the driver would like to ensure the write's effects are visible prior
|
||||
to continuing execution.
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
static inline void
|
||||
qla1280_disable_intrs(struct scsi_qla_host *ha)
|
||||
{
|
||||
struct device_reg *reg;
|
||||
|
||||
reg = ha->iobase;
|
||||
/* disable risc and host interrupts */
|
||||
WRT_REG_WORD(&reg->ictrl, 0);
|
||||
/*
|
||||
* The following read will ensure that the above write
|
||||
* has been received by the device before we return from this
|
||||
* function.
|
||||
*/
|
||||
RD_REG_WORD(&reg->ictrl);
|
||||
ha->flags.ints_enabled = 0;
|
||||
}
|
||||
</programlisting>
|
||||
|
||||
<para>
|
||||
In addition to write posting, on some large multiprocessing systems
|
||||
(e.g. SGI Challenge, Origin and Altix machines) posted writes won't
|
||||
be strongly ordered coming from different CPUs. Thus it's important
|
||||
to properly protect parts of your driver that do memory-mapped writes
|
||||
with locks and use the <function>mmiowb</function> to make sure they
|
||||
arrive in the order intended. Issuing a regular <function>readX
|
||||
</function> will also ensure write ordering, but should only be used
|
||||
when the driver has to be sure that the write has actually arrived
|
||||
at the device (not that it's simply ordered with respect to other
|
||||
writes), since a full <function>readX</function> is a relatively
|
||||
expensive operation.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Generally, one should use <function>mmiowb</function> prior to
|
||||
releasing a spinlock that protects regions using <function>writeb
|
||||
</function> or similar functions that aren't surrounded by <function>
|
||||
readb</function> calls, which will ensure ordering and flushing. The
|
||||
following pseudocode illustrates what might occur if write ordering
|
||||
isn't guaranteed via <function>mmiowb</function> or one of the
|
||||
<function>readX</function> functions.
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
CPU A: spin_lock_irqsave(&dev_lock, flags)
|
||||
CPU A: ...
|
||||
CPU A: writel(newval, ring_ptr);
|
||||
CPU A: spin_unlock_irqrestore(&dev_lock, flags)
|
||||
...
|
||||
CPU B: spin_lock_irqsave(&dev_lock, flags)
|
||||
CPU B: writel(newval2, ring_ptr);
|
||||
CPU B: ...
|
||||
CPU B: spin_unlock_irqrestore(&dev_lock, flags)
|
||||
</programlisting>
|
||||
|
||||
<para>
|
||||
In the case above, newval2 could be written to ring_ptr before
|
||||
newval. Fixing it is easy though:
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
CPU A: spin_lock_irqsave(&dev_lock, flags)
|
||||
CPU A: ...
|
||||
CPU A: writel(newval, ring_ptr);
|
||||
CPU A: mmiowb(); /* ensure no other writes beat us to the device */
|
||||
CPU A: spin_unlock_irqrestore(&dev_lock, flags)
|
||||
...
|
||||
CPU B: spin_lock_irqsave(&dev_lock, flags)
|
||||
CPU B: writel(newval2, ring_ptr);
|
||||
CPU B: ...
|
||||
CPU B: mmiowb();
|
||||
CPU B: spin_unlock_irqrestore(&dev_lock, flags)
|
||||
</programlisting>
|
||||
|
||||
<para>
|
||||
See tg3.c for a real world example of how to use <function>mmiowb
|
||||
</function>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
PCI ordering rules also guarantee that PIO read responses arrive
|
||||
after any outstanding DMA writes from that bus, since for some devices
|
||||
the result of a <function>readb</function> call may signal to the
|
||||
driver that a DMA transaction is complete. In many cases, however,
|
||||
the driver may want to indicate that the next
|
||||
<function>readb</function> call has no relation to any previous DMA
|
||||
writes performed by the device. The driver can use
|
||||
<function>readb_relaxed</function> for these cases, although only
|
||||
some platforms will honor the relaxed semantics. Using the relaxed
|
||||
read functions will provide significant performance benefits on
|
||||
platforms that support it. The qla2xxx driver provides examples
|
||||
of how to use <function>readX_relaxed</function>. In many cases,
|
||||
a majority of the driver's <function>readX</function> calls can
|
||||
safely be converted to <function>readX_relaxed</function> calls, since
|
||||
only a few will indicate or depend on DMA completion.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter>
|
||||
<title>Port Space Accesses</title>
|
||||
<sect1>
|
||||
<title>Port Space Explained</title>
|
||||
|
||||
<para>
|
||||
Another form of IO commonly supported is Port Space. This is a
|
||||
range of addresses separate to the normal memory address space.
|
||||
Access to these addresses is generally not as fast as accesses
|
||||
to the memory mapped addresses, and it also has a potentially
|
||||
smaller address space.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Unlike memory mapped IO, no preparation is required
|
||||
to access port space.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
<sect1>
|
||||
<title>Accessing Port Space</title>
|
||||
<para>
|
||||
Accesses to this space are provided through a set of functions
|
||||
which allow 8-bit, 16-bit and 32-bit accesses; also
|
||||
known as byte, word and long. These functions are
|
||||
<function>inb</function>, <function>inw</function>,
|
||||
<function>inl</function>, <function>outb</function>,
|
||||
<function>outw</function> and <function>outl</function>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Some variants are provided for these functions. Some devices
|
||||
require that accesses to their ports are slowed down. This
|
||||
functionality is provided by appending a <function>_p</function>
|
||||
to the end of the function. There are also equivalents to memcpy.
|
||||
The <function>ins</function> and <function>outs</function>
|
||||
functions copy bytes, words or longs to the given port.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter id="pubfunctions">
|
||||
<title>Public Functions Provided</title>
|
||||
!Einclude/asm-i386/io.h
|
||||
</chapter>
|
||||
|
||||
</book>
|
||||
401
Documentation/DocBook/filesystems.tmpl
Normal file
401
Documentation/DocBook/filesystems.tmpl
Normal file
@ -0,0 +1,401 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="Linux-filesystems-API">
|
||||
<bookinfo>
|
||||
<title>Linux Filesystems API</title>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License as published by the Free Software Foundation; either
|
||||
version 2 of the License, or (at your option) any later
|
||||
version.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter id="vfs">
|
||||
<title>The Linux VFS</title>
|
||||
<sect1><title>The Filesystem types</title>
|
||||
!Iinclude/linux/fs.h
|
||||
</sect1>
|
||||
<sect1><title>The Directory Cache</title>
|
||||
!Efs/dcache.c
|
||||
!Iinclude/linux/dcache.h
|
||||
</sect1>
|
||||
<sect1><title>Inode Handling</title>
|
||||
!Efs/inode.c
|
||||
!Efs/bad_inode.c
|
||||
</sect1>
|
||||
<sect1><title>Registration and Superblocks</title>
|
||||
!Efs/super.c
|
||||
</sect1>
|
||||
<sect1><title>File Locks</title>
|
||||
!Efs/locks.c
|
||||
!Ifs/locks.c
|
||||
</sect1>
|
||||
<sect1><title>Other Functions</title>
|
||||
!Efs/mpage.c
|
||||
!Efs/namei.c
|
||||
!Efs/buffer.c
|
||||
!Efs/bio.c
|
||||
!Efs/seq_file.c
|
||||
!Efs/filesystems.c
|
||||
!Efs/fs-writeback.c
|
||||
!Efs/block_dev.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="proc">
|
||||
<title>The proc filesystem</title>
|
||||
|
||||
<sect1><title>sysctl interface</title>
|
||||
!Ekernel/sysctl.c
|
||||
</sect1>
|
||||
|
||||
<sect1><title>proc filesystem interface</title>
|
||||
!Ifs/proc/base.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="sysfs">
|
||||
<title>The Filesystem for Exporting Kernel Objects</title>
|
||||
!Efs/sysfs/file.c
|
||||
!Efs/sysfs/symlink.c
|
||||
!Efs/sysfs/bin.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="debugfs">
|
||||
<title>The debugfs filesystem</title>
|
||||
|
||||
<sect1><title>debugfs interface</title>
|
||||
!Efs/debugfs/inode.c
|
||||
!Efs/debugfs/file.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="LinuxJDBAPI">
|
||||
<chapterinfo>
|
||||
<title>The Linux Journalling API</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Roger</firstname>
|
||||
<surname>Gammans</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>rgammans@computer-surgery.co.uk</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Stephen</firstname>
|
||||
<surname>Tweedie</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>sct@redhat.com</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2002</year>
|
||||
<holder>Roger Gammans</holder>
|
||||
</copyright>
|
||||
</chapterinfo>
|
||||
|
||||
<title>The Linux Journalling API</title>
|
||||
|
||||
<sect1>
|
||||
<title>Overview</title>
|
||||
<sect2>
|
||||
<title>Details</title>
|
||||
<para>
|
||||
The journalling layer is easy to use. You need to
|
||||
first of all create a journal_t data structure. There are
|
||||
two calls to do this dependent on how you decide to allocate the physical
|
||||
media on which the journal resides. The journal_init_inode() call
|
||||
is for journals stored in filesystem inodes, or the journal_init_dev()
|
||||
call can be use for journal stored on a raw device (in a continuous range
|
||||
of blocks). A journal_t is a typedef for a struct pointer, so when
|
||||
you are finally finished make sure you call journal_destroy() on it
|
||||
to free up any used kernel memory.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Once you have got your journal_t object you need to 'mount' or load the journal
|
||||
file, unless of course you haven't initialised it yet - in which case you
|
||||
need to call journal_create().
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Most of the time however your journal file will already have been created, but
|
||||
before you load it you must call journal_wipe() to empty the journal file.
|
||||
Hang on, you say , what if the filesystem wasn't cleanly umount()'d . Well, it is the
|
||||
job of the client file system to detect this and skip the call to journal_wipe().
|
||||
</para>
|
||||
|
||||
<para>
|
||||
In either case the next call should be to journal_load() which prepares the
|
||||
journal file for use. Note that journal_wipe(..,0) calls journal_skip_recovery()
|
||||
for you if it detects any outstanding transactions in the journal and similarly
|
||||
journal_load() will call journal_recover() if necessary.
|
||||
I would advise reading fs/ext3/super.c for examples on this stage.
|
||||
[RGG: Why is the journal_wipe() call necessary - doesn't this needlessly
|
||||
complicate the API. Or isn't a good idea for the journal layer to hide
|
||||
dirty mounts from the client fs]
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Now you can go ahead and start modifying the underlying
|
||||
filesystem. Almost.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
||||
You still need to actually journal your filesystem changes, this
|
||||
is done by wrapping them into transactions. Additionally you
|
||||
also need to wrap the modification of each of the buffers
|
||||
with calls to the journal layer, so it knows what the modifications
|
||||
you are actually making are. To do this use journal_start() which
|
||||
returns a transaction handle.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
journal_start()
|
||||
and its counterpart journal_stop(), which indicates the end of a transaction
|
||||
are nestable calls, so you can reenter a transaction if necessary,
|
||||
but remember you must call journal_stop() the same number of times as
|
||||
journal_start() before the transaction is completed (or more accurately
|
||||
leaves the update phase). Ext3/VFS makes use of this feature to simplify
|
||||
quota support.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Inside each transaction you need to wrap the modifications to the
|
||||
individual buffers (blocks). Before you start to modify a buffer you
|
||||
need to call journal_get_{create,write,undo}_access() as appropriate,
|
||||
this allows the journalling layer to copy the unmodified data if it
|
||||
needs to. After all the buffer may be part of a previously uncommitted
|
||||
transaction.
|
||||
At this point you are at last ready to modify a buffer, and once
|
||||
you are have done so you need to call journal_dirty_{meta,}data().
|
||||
Or if you've asked for access to a buffer you now know is now longer
|
||||
required to be pushed back on the device you can call journal_forget()
|
||||
in much the same way as you might have used bforget() in the past.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
A journal_flush() may be called at any time to commit and checkpoint
|
||||
all your transactions.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Then at umount time , in your put_super() (2.4) or write_super() (2.5)
|
||||
you can then call journal_destroy() to clean up your in-core journal object.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Unfortunately there a couple of ways the journal layer can cause a deadlock.
|
||||
The first thing to note is that each task can only have
|
||||
a single outstanding transaction at any one time, remember nothing
|
||||
commits until the outermost journal_stop(). This means
|
||||
you must complete the transaction at the end of each file/inode/address
|
||||
etc. operation you perform, so that the journalling system isn't re-entered
|
||||
on another journal. Since transactions can't be nested/batched
|
||||
across differing journals, and another filesystem other than
|
||||
yours (say ext3) may be modified in a later syscall.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The second case to bear in mind is that journal_start() can
|
||||
block if there isn't enough space in the journal for your transaction
|
||||
(based on the passed nblocks param) - when it blocks it merely(!) needs to
|
||||
wait for transactions to complete and be committed from other tasks,
|
||||
so essentially we are waiting for journal_stop(). So to avoid
|
||||
deadlocks you must treat journal_start/stop() as if they
|
||||
were semaphores and include them in your semaphore ordering rules to prevent
|
||||
deadlocks. Note that journal_extend() has similar blocking behaviour to
|
||||
journal_start() so you can deadlock here just as easily as on journal_start().
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Try to reserve the right number of blocks the first time. ;-). This will
|
||||
be the maximum number of blocks you are going to touch in this transaction.
|
||||
I advise having a look at at least ext3_jbd.h to see the basis on which
|
||||
ext3 uses to make these decisions.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Another wriggle to watch out for is your on-disk block allocation strategy.
|
||||
why? Because, if you undo a delete, you need to ensure you haven't reused any
|
||||
of the freed blocks in a later transaction. One simple way of doing this
|
||||
is make sure any blocks you allocate only have checkpointed transactions
|
||||
listed against them. Ext3 does this in ext3_test_allocatable().
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Lock is also providing through journal_{un,}lock_updates(),
|
||||
ext3 uses this when it wants a window with a clean and stable fs for a moment.
|
||||
eg.
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
|
||||
journal_lock_updates() //stop new stuff happening..
|
||||
journal_flush() // checkpoint everything.
|
||||
..do stuff on stable fs
|
||||
journal_unlock_updates() // carry on with filesystem use.
|
||||
</programlisting>
|
||||
|
||||
<para>
|
||||
The opportunities for abuse and DOS attacks with this should be obvious,
|
||||
if you allow unprivileged userspace to trigger codepaths containing these
|
||||
calls.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
A new feature of jbd since 2.5.25 is commit callbacks with the new
|
||||
journal_callback_set() function you can now ask the journalling layer
|
||||
to call you back when the transaction is finally committed to disk, so that
|
||||
you can do some of your own management. The key to this is the journal_callback
|
||||
struct, this maintains the internal callback information but you can
|
||||
extend it like this:-
|
||||
</para>
|
||||
<programlisting>
|
||||
struct myfs_callback_s {
|
||||
//Data structure element required by jbd..
|
||||
struct journal_callback for_jbd;
|
||||
// Stuff for myfs allocated together.
|
||||
myfs_inode* i_commited;
|
||||
|
||||
}
|
||||
</programlisting>
|
||||
|
||||
<para>
|
||||
this would be useful if you needed to know when data was committed to a
|
||||
particular inode.
|
||||
</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>Summary</title>
|
||||
<para>
|
||||
Using the journal is a matter of wrapping the different context changes,
|
||||
being each mount, each modification (transaction) and each changed buffer
|
||||
to tell the journalling layer about them.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Here is a some pseudo code to give you an idea of how it works, as
|
||||
an example.
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
journal_t* my_jnrl = journal_create();
|
||||
journal_init_{dev,inode}(jnrl,...)
|
||||
if (clean) journal_wipe();
|
||||
journal_load();
|
||||
|
||||
foreach(transaction) { /*transactions must be
|
||||
completed before
|
||||
a syscall returns to
|
||||
userspace*/
|
||||
|
||||
handle_t * xct=journal_start(my_jnrl);
|
||||
foreach(bh) {
|
||||
journal_get_{create,write,undo}_access(xact,bh);
|
||||
if ( myfs_modify(bh) ) { /* returns true
|
||||
if makes changes */
|
||||
journal_dirty_{meta,}data(xact,bh);
|
||||
} else {
|
||||
journal_forget(bh);
|
||||
}
|
||||
}
|
||||
journal_stop(xct);
|
||||
}
|
||||
journal_destroy(my_jrnl);
|
||||
</programlisting>
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>Data Types</title>
|
||||
<para>
|
||||
The journalling layer uses typedefs to 'hide' the concrete definitions
|
||||
of the structures used. As a client of the JBD layer you can
|
||||
just rely on the using the pointer as a magic cookie of some sort.
|
||||
|
||||
Obviously the hiding is not enforced as this is 'C'.
|
||||
</para>
|
||||
<sect2><title>Structures</title>
|
||||
!Iinclude/linux/jbd.h
|
||||
</sect2>
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>Functions</title>
|
||||
<para>
|
||||
The functions here are split into two groups those that
|
||||
affect a journal as a whole, and those which are used to
|
||||
manage transactions
|
||||
</para>
|
||||
<sect2><title>Journal Level</title>
|
||||
!Efs/jbd/journal.c
|
||||
!Ifs/jbd/recovery.c
|
||||
</sect2>
|
||||
<sect2><title>Transasction Level</title>
|
||||
!Efs/jbd/transaction.c
|
||||
</sect2>
|
||||
</sect1>
|
||||
<sect1>
|
||||
<title>See also</title>
|
||||
<para>
|
||||
<citation>
|
||||
<ulink url="ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/journal-design.ps.gz">
|
||||
Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen Tweedie
|
||||
</ulink>
|
||||
</citation>
|
||||
</para>
|
||||
<para>
|
||||
<citation>
|
||||
<ulink url="http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html">
|
||||
Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen Tweedie
|
||||
</ulink>
|
||||
</citation>
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
</chapter>
|
||||
|
||||
</book>
|
||||
752
Documentation/DocBook/gadget.tmpl
Normal file
752
Documentation/DocBook/gadget.tmpl
Normal file
@ -0,0 +1,752 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="USB-Gadget-API">
|
||||
<bookinfo>
|
||||
<title>USB Gadget API for Linux</title>
|
||||
<date>20 August 2004</date>
|
||||
<edition>20 August 2004</edition>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License as published by the Free Software Foundation; either
|
||||
version 2 of the License, or (at your option) any later
|
||||
version.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
<copyright>
|
||||
<year>2003-2004</year>
|
||||
<holder>David Brownell</holder>
|
||||
</copyright>
|
||||
|
||||
<author>
|
||||
<firstname>David</firstname>
|
||||
<surname>Brownell</surname>
|
||||
<affiliation>
|
||||
<address><email>dbrownell@users.sourceforge.net</email></address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter><title>Introduction</title>
|
||||
|
||||
<para>This document presents a Linux-USB "Gadget"
|
||||
kernel mode
|
||||
API, for use within peripherals and other USB devices
|
||||
that embed Linux.
|
||||
It provides an overview of the API structure,
|
||||
and shows how that fits into a system development project.
|
||||
This is the first such API released on Linux to address
|
||||
a number of important problems, including: </para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem><para>Supports USB 2.0, for high speed devices which
|
||||
can stream data at several dozen megabytes per second.
|
||||
</para></listitem>
|
||||
<listitem><para>Handles devices with dozens of endpoints just as
|
||||
well as ones with just two fixed-function ones. Gadget drivers
|
||||
can be written so they're easy to port to new hardware.
|
||||
</para></listitem>
|
||||
<listitem><para>Flexible enough to expose more complex USB device
|
||||
capabilities such as multiple configurations, multiple interfaces,
|
||||
composite devices,
|
||||
and alternate interface settings.
|
||||
</para></listitem>
|
||||
<listitem><para>USB "On-The-Go" (OTG) support, in conjunction
|
||||
with updates to the Linux-USB host side.
|
||||
</para></listitem>
|
||||
<listitem><para>Sharing data structures and API models with the
|
||||
Linux-USB host side API. This helps the OTG support, and
|
||||
looks forward to more-symmetric frameworks (where the same
|
||||
I/O model is used by both host and device side drivers).
|
||||
</para></listitem>
|
||||
<listitem><para>Minimalist, so it's easier to support new device
|
||||
controller hardware. I/O processing doesn't imply large
|
||||
demands for memory or CPU resources.
|
||||
</para></listitem>
|
||||
</itemizedlist>
|
||||
|
||||
|
||||
<para>Most Linux developers will not be able to use this API, since they
|
||||
have USB "host" hardware in a PC, workstation, or server.
|
||||
Linux users with embedded systems are more likely to
|
||||
have USB peripheral hardware.
|
||||
To distinguish drivers running inside such hardware from the
|
||||
more familiar Linux "USB device drivers",
|
||||
which are host side proxies for the real USB devices,
|
||||
a different term is used:
|
||||
the drivers inside the peripherals are "USB gadget drivers".
|
||||
In USB protocol interactions, the device driver is the master
|
||||
(or "client driver")
|
||||
and the gadget driver is the slave (or "function driver").
|
||||
</para>
|
||||
|
||||
<para>The gadget API resembles the host side Linux-USB API in that both
|
||||
use queues of request objects to package I/O buffers, and those requests
|
||||
may be submitted or canceled.
|
||||
They share common definitions for the standard USB
|
||||
<emphasis>Chapter 9</emphasis> messages, structures, and constants.
|
||||
Also, both APIs bind and unbind drivers to devices.
|
||||
The APIs differ in detail, since the host side's current
|
||||
URB framework exposes a number of implementation details
|
||||
and assumptions that are inappropriate for a gadget API.
|
||||
While the model for control transfers and configuration
|
||||
management is necessarily different (one side is a hardware-neutral master,
|
||||
the other is a hardware-aware slave), the endpoint I/0 API used here
|
||||
should also be usable for an overhead-reduced host side API.
|
||||
</para>
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter id="structure"><title>Structure of Gadget Drivers</title>
|
||||
|
||||
<para>A system running inside a USB peripheral
|
||||
normally has at least three layers inside the kernel to handle
|
||||
USB protocol processing, and may have additional layers in
|
||||
user space code.
|
||||
The "gadget" API is used by the middle layer to interact
|
||||
with the lowest level (which directly handles hardware).
|
||||
</para>
|
||||
|
||||
<para>In Linux, from the bottom up, these layers are:
|
||||
</para>
|
||||
|
||||
<variablelist>
|
||||
|
||||
<varlistentry>
|
||||
<term><emphasis>USB Controller Driver</emphasis></term>
|
||||
|
||||
<listitem>
|
||||
<para>This is the lowest software level.
|
||||
It is the only layer that talks to hardware,
|
||||
through registers, fifos, dma, irqs, and the like.
|
||||
The <filename><linux/usb_gadget.h></filename> API abstracts
|
||||
the peripheral controller endpoint hardware.
|
||||
That hardware is exposed through endpoint objects, which accept
|
||||
streams of IN/OUT buffers, and through callbacks that interact
|
||||
with gadget drivers.
|
||||
Since normal USB devices only have one upstream
|
||||
port, they only have one of these drivers.
|
||||
The controller driver can support any number of different
|
||||
gadget drivers, but only one of them can be used at a time.
|
||||
</para>
|
||||
|
||||
<para>Examples of such controller hardware include
|
||||
the PCI-based NetChip 2280 USB 2.0 high speed controller,
|
||||
the SA-11x0 or PXA-25x UDC (found within many PDAs),
|
||||
and a variety of other products.
|
||||
</para>
|
||||
|
||||
</listitem></varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><emphasis>Gadget Driver</emphasis></term>
|
||||
|
||||
<listitem>
|
||||
<para>The lower boundary of this driver implements hardware-neutral
|
||||
USB functions, using calls to the controller driver.
|
||||
Because such hardware varies widely in capabilities and restrictions,
|
||||
and is used in embedded environments where space is at a premium,
|
||||
the gadget driver is often configured at compile time
|
||||
to work with endpoints supported by one particular controller.
|
||||
Gadget drivers may be portable to several different controllers,
|
||||
using conditional compilation.
|
||||
(Recent kernels substantially simplify the work involved in
|
||||
supporting new hardware, by <emphasis>autoconfiguring</emphasis>
|
||||
endpoints automatically for many bulk-oriented drivers.)
|
||||
Gadget driver responsibilities include:
|
||||
</para>
|
||||
<itemizedlist>
|
||||
<listitem><para>handling setup requests (ep0 protocol responses)
|
||||
possibly including class-specific functionality
|
||||
</para></listitem>
|
||||
<listitem><para>returning configuration and string descriptors
|
||||
</para></listitem>
|
||||
<listitem><para>(re)setting configurations and interface
|
||||
altsettings, including enabling and configuring endpoints
|
||||
</para></listitem>
|
||||
<listitem><para>handling life cycle events, such as managing
|
||||
bindings to hardware,
|
||||
USB suspend/resume, remote wakeup,
|
||||
and disconnection from the USB host.
|
||||
</para></listitem>
|
||||
<listitem><para>managing IN and OUT transfers on all currently
|
||||
enabled endpoints
|
||||
</para></listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>
|
||||
Such drivers may be modules of proprietary code, although
|
||||
that approach is discouraged in the Linux community.
|
||||
</para>
|
||||
</listitem></varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><emphasis>Upper Level</emphasis></term>
|
||||
|
||||
<listitem>
|
||||
<para>Most gadget drivers have an upper boundary that connects
|
||||
to some Linux driver or framework in Linux.
|
||||
Through that boundary flows the data which the gadget driver
|
||||
produces and/or consumes through protocol transfers over USB.
|
||||
Examples include:
|
||||
</para>
|
||||
<itemizedlist>
|
||||
<listitem><para>user mode code, using generic (gadgetfs)
|
||||
or application specific files in
|
||||
<filename>/dev</filename>
|
||||
</para></listitem>
|
||||
<listitem><para>networking subsystem (for network gadgets,
|
||||
like the CDC Ethernet Model gadget driver)
|
||||
</para></listitem>
|
||||
<listitem><para>data capture drivers, perhaps video4Linux or
|
||||
a scanner driver; or test and measurement hardware.
|
||||
</para></listitem>
|
||||
<listitem><para>input subsystem (for HID gadgets)
|
||||
</para></listitem>
|
||||
<listitem><para>sound subsystem (for audio gadgets)
|
||||
</para></listitem>
|
||||
<listitem><para>file system (for PTP gadgets)
|
||||
</para></listitem>
|
||||
<listitem><para>block i/o subsystem (for usb-storage gadgets)
|
||||
</para></listitem>
|
||||
<listitem><para>... and more </para></listitem>
|
||||
</itemizedlist>
|
||||
</listitem></varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><emphasis>Additional Layers</emphasis></term>
|
||||
|
||||
<listitem>
|
||||
<para>Other layers may exist.
|
||||
These could include kernel layers, such as network protocol stacks,
|
||||
as well as user mode applications building on standard POSIX
|
||||
system call APIs such as
|
||||
<emphasis>open()</emphasis>, <emphasis>close()</emphasis>,
|
||||
<emphasis>read()</emphasis> and <emphasis>write()</emphasis>.
|
||||
On newer systems, POSIX Async I/O calls may be an option.
|
||||
Such user mode code will not necessarily be subject to
|
||||
the GNU General Public License (GPL).
|
||||
</para>
|
||||
</listitem></varlistentry>
|
||||
|
||||
|
||||
</variablelist>
|
||||
|
||||
<para>OTG-capable systems will also need to include a standard Linux-USB
|
||||
host side stack,
|
||||
with <emphasis>usbcore</emphasis>,
|
||||
one or more <emphasis>Host Controller Drivers</emphasis> (HCDs),
|
||||
<emphasis>USB Device Drivers</emphasis> to support
|
||||
the OTG "Targeted Peripheral List",
|
||||
and so forth.
|
||||
There will also be an <emphasis>OTG Controller Driver</emphasis>,
|
||||
which is visible to gadget and device driver developers only indirectly.
|
||||
That helps the host and device side USB controllers implement the
|
||||
two new OTG protocols (HNP and SRP).
|
||||
Roles switch (host to peripheral, or vice versa) using HNP
|
||||
during USB suspend processing, and SRP can be viewed as a
|
||||
more battery-friendly kind of device wakeup protocol.
|
||||
</para>
|
||||
|
||||
<para>Over time, reusable utilities are evolving to help make some
|
||||
gadget driver tasks simpler.
|
||||
For example, building configuration descriptors from vectors of
|
||||
descriptors for the configurations interfaces and endpoints is
|
||||
now automated, and many drivers now use autoconfiguration to
|
||||
choose hardware endpoints and initialize their descriptors.
|
||||
|
||||
A potential example of particular interest
|
||||
is code implementing standard USB-IF protocols for
|
||||
HID, networking, storage, or audio classes.
|
||||
Some developers are interested in KDB or KGDB hooks, to let
|
||||
target hardware be remotely debugged.
|
||||
Most such USB protocol code doesn't need to be hardware-specific,
|
||||
any more than network protocols like X11, HTTP, or NFS are.
|
||||
Such gadget-side interface drivers should eventually be combined,
|
||||
to implement composite devices.
|
||||
</para>
|
||||
|
||||
</chapter>
|
||||
|
||||
|
||||
<chapter id="api"><title>Kernel Mode Gadget API</title>
|
||||
|
||||
<para>Gadget drivers declare themselves through a
|
||||
<emphasis>struct usb_gadget_driver</emphasis>, which is responsible for
|
||||
most parts of enumeration for a <emphasis>struct usb_gadget</emphasis>.
|
||||
The response to a set_configuration usually involves
|
||||
enabling one or more of the <emphasis>struct usb_ep</emphasis> objects
|
||||
exposed by the gadget, and submitting one or more
|
||||
<emphasis>struct usb_request</emphasis> buffers to transfer data.
|
||||
Understand those four data types, and their operations, and
|
||||
you will understand how this API works.
|
||||
</para>
|
||||
|
||||
<note><title>Incomplete Data Type Descriptions</title>
|
||||
|
||||
<para>This documentation was prepared using the standard Linux
|
||||
kernel <filename>docproc</filename> tool, which turns text
|
||||
and in-code comments into SGML DocBook and then into usable
|
||||
formats such as HTML or PDF.
|
||||
Other than the "Chapter 9" data types, most of the significant
|
||||
data types and functions are described here.
|
||||
</para>
|
||||
|
||||
<para>However, docproc does not understand all the C constructs
|
||||
that are used, so some relevant information is likely omitted from
|
||||
what you are reading.
|
||||
One example of such information is endpoint autoconfiguration.
|
||||
You'll have to read the header file, and use example source
|
||||
code (such as that for "Gadget Zero"), to fully understand the API.
|
||||
</para>
|
||||
|
||||
<para>The part of the API implementing some basic
|
||||
driver capabilities is specific to the version of the
|
||||
Linux kernel that's in use.
|
||||
The 2.6 kernel includes a <emphasis>driver model</emphasis>
|
||||
framework that has no analogue on earlier kernels;
|
||||
so those parts of the gadget API are not fully portable.
|
||||
(They are implemented on 2.4 kernels, but in a different way.)
|
||||
The driver model state is another part of this API that is
|
||||
ignored by the kerneldoc tools.
|
||||
</para>
|
||||
</note>
|
||||
|
||||
<para>The core API does not expose
|
||||
every possible hardware feature, only the most widely available ones.
|
||||
There are significant hardware features, such as device-to-device DMA
|
||||
(without temporary storage in a memory buffer)
|
||||
that would be added using hardware-specific APIs.
|
||||
</para>
|
||||
|
||||
<para>This API allows drivers to use conditional compilation to handle
|
||||
endpoint capabilities of different hardware, but doesn't require that.
|
||||
Hardware tends to have arbitrary restrictions, relating to
|
||||
transfer types, addressing, packet sizes, buffering, and availability.
|
||||
As a rule, such differences only matter for "endpoint zero" logic
|
||||
that handles device configuration and management.
|
||||
The API supports limited run-time
|
||||
detection of capabilities, through naming conventions for endpoints.
|
||||
Many drivers will be able to at least partially autoconfigure
|
||||
themselves.
|
||||
In particular, driver init sections will often have endpoint
|
||||
autoconfiguration logic that scans the hardware's list of endpoints
|
||||
to find ones matching the driver requirements
|
||||
(relying on those conventions), to eliminate some of the most
|
||||
common reasons for conditional compilation.
|
||||
</para>
|
||||
|
||||
<para>Like the Linux-USB host side API, this API exposes
|
||||
the "chunky" nature of USB messages: I/O requests are in terms
|
||||
of one or more "packets", and packet boundaries are visible to drivers.
|
||||
Compared to RS-232 serial protocols, USB resembles
|
||||
synchronous protocols like HDLC
|
||||
(N bytes per frame, multipoint addressing, host as the primary
|
||||
station and devices as secondary stations)
|
||||
more than asynchronous ones
|
||||
(tty style: 8 data bits per frame, no parity, one stop bit).
|
||||
So for example the controller drivers won't buffer
|
||||
two single byte writes into a single two-byte USB IN packet,
|
||||
although gadget drivers may do so when they implement
|
||||
protocols where packet boundaries (and "short packets")
|
||||
are not significant.
|
||||
</para>
|
||||
|
||||
<sect1 id="lifecycle"><title>Driver Life Cycle</title>
|
||||
|
||||
<para>Gadget drivers make endpoint I/O requests to hardware without
|
||||
needing to know many details of the hardware, but driver
|
||||
setup/configuration code needs to handle some differences.
|
||||
Use the API like this:
|
||||
</para>
|
||||
|
||||
<orderedlist numeration='arabic'>
|
||||
|
||||
<listitem><para>Register a driver for the particular device side
|
||||
usb controller hardware,
|
||||
such as the net2280 on PCI (USB 2.0),
|
||||
sa11x0 or pxa25x as found in Linux PDAs,
|
||||
and so on.
|
||||
At this point the device is logically in the USB ch9 initial state
|
||||
("attached"), drawing no power and not usable
|
||||
(since it does not yet support enumeration).
|
||||
Any host should not see the device, since it's not
|
||||
activated the data line pullup used by the host to
|
||||
detect a device, even if VBUS power is available.
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>Register a gadget driver that implements some higher level
|
||||
device function. That will then bind() to a usb_gadget, which
|
||||
activates the data line pullup sometime after detecting VBUS.
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>The hardware driver can now start enumerating.
|
||||
The steps it handles are to accept USB power and set_address requests.
|
||||
Other steps are handled by the gadget driver.
|
||||
If the gadget driver module is unloaded before the host starts to
|
||||
enumerate, steps before step 7 are skipped.
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>The gadget driver's setup() call returns usb descriptors,
|
||||
based both on what the bus interface hardware provides and on the
|
||||
functionality being implemented.
|
||||
That can involve alternate settings or configurations,
|
||||
unless the hardware prevents such operation.
|
||||
For OTG devices, each configuration descriptor includes
|
||||
an OTG descriptor.
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>The gadget driver handles the last step of enumeration,
|
||||
when the USB host issues a set_configuration call.
|
||||
It enables all endpoints used in that configuration,
|
||||
with all interfaces in their default settings.
|
||||
That involves using a list of the hardware's endpoints, enabling each
|
||||
endpoint according to its descriptor.
|
||||
It may also involve using <function>usb_gadget_vbus_draw</function>
|
||||
to let more power be drawn from VBUS, as allowed by that configuration.
|
||||
For OTG devices, setting a configuration may also involve reporting
|
||||
HNP capabilities through a user interface.
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>Do real work and perform data transfers, possibly involving
|
||||
changes to interface settings or switching to new configurations, until the
|
||||
device is disconnect()ed from the host.
|
||||
Queue any number of transfer requests to each endpoint.
|
||||
It may be suspended and resumed several times before being disconnected.
|
||||
On disconnect, the drivers go back to step 3 (above).
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>When the gadget driver module is being unloaded,
|
||||
the driver unbind() callback is issued. That lets the controller
|
||||
driver be unloaded.
|
||||
</para></listitem>
|
||||
|
||||
</orderedlist>
|
||||
|
||||
<para>Drivers will normally be arranged so that just loading the
|
||||
gadget driver module (or statically linking it into a Linux kernel)
|
||||
allows the peripheral device to be enumerated, but some drivers
|
||||
will defer enumeration until some higher level component (like
|
||||
a user mode daemon) enables it.
|
||||
Note that at this lowest level there are no policies about how
|
||||
ep0 configuration logic is implemented,
|
||||
except that it should obey USB specifications.
|
||||
Such issues are in the domain of gadget drivers,
|
||||
including knowing about implementation constraints
|
||||
imposed by some USB controllers
|
||||
or understanding that composite devices might happen to
|
||||
be built by integrating reusable components.
|
||||
</para>
|
||||
|
||||
<para>Note that the lifecycle above can be slightly different
|
||||
for OTG devices.
|
||||
Other than providing an additional OTG descriptor in each
|
||||
configuration, only the HNP-related differences are particularly
|
||||
visible to driver code.
|
||||
They involve reporting requirements during the SET_CONFIGURATION
|
||||
request, and the option to invoke HNP during some suspend callbacks.
|
||||
Also, SRP changes the semantics of
|
||||
<function>usb_gadget_wakeup</function>
|
||||
slightly.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="ch9"><title>USB 2.0 Chapter 9 Types and Constants</title>
|
||||
|
||||
<para>Gadget drivers
|
||||
rely on common USB structures and constants
|
||||
defined in the
|
||||
<filename><linux/usb/ch9.h></filename>
|
||||
header file, which is standard in Linux 2.6 kernels.
|
||||
These are the same types and constants used by host
|
||||
side drivers (and usbcore).
|
||||
</para>
|
||||
|
||||
!Iinclude/linux/usb/ch9.h
|
||||
</sect1>
|
||||
|
||||
<sect1 id="core"><title>Core Objects and Methods</title>
|
||||
|
||||
<para>These are declared in
|
||||
<filename><linux/usb_gadget.h></filename>,
|
||||
and are used by gadget drivers to interact with
|
||||
USB peripheral controller drivers.
|
||||
</para>
|
||||
|
||||
<!-- yeech, this is ugly in nsgmls PDF output.
|
||||
|
||||
the PDF bookmark and refentry output nesting is wrong,
|
||||
and the member/argument documentation indents ugly.
|
||||
|
||||
plus something (docproc?) adds whitespace before the
|
||||
descriptive paragraph text, so it can't line up right
|
||||
unless the explanations are trivial.
|
||||
-->
|
||||
|
||||
!Iinclude/linux/usb_gadget.h
|
||||
</sect1>
|
||||
|
||||
<sect1 id="utils"><title>Optional Utilities</title>
|
||||
|
||||
<para>The core API is sufficient for writing a USB Gadget Driver,
|
||||
but some optional utilities are provided to simplify common tasks.
|
||||
These utilities include endpoint autoconfiguration.
|
||||
</para>
|
||||
|
||||
!Edrivers/usb/gadget/usbstring.c
|
||||
!Edrivers/usb/gadget/config.c
|
||||
<!-- !Edrivers/usb/gadget/epautoconf.c -->
|
||||
</sect1>
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter id="controllers"><title>Peripheral Controller Drivers</title>
|
||||
|
||||
<para>The first hardware supporting this API was the NetChip 2280
|
||||
controller, which supports USB 2.0 high speed and is based on PCI.
|
||||
This is the <filename>net2280</filename> driver module.
|
||||
The driver supports Linux kernel versions 2.4 and 2.6;
|
||||
contact NetChip Technologies for development boards and product
|
||||
information.
|
||||
</para>
|
||||
|
||||
<para>Other hardware working in the "gadget" framework includes:
|
||||
Intel's PXA 25x and IXP42x series processors
|
||||
(<filename>pxa2xx_udc</filename>),
|
||||
Toshiba TC86c001 "Goku-S" (<filename>goku_udc</filename>),
|
||||
Renesas SH7705/7727 (<filename>sh_udc</filename>),
|
||||
MediaQ 11xx (<filename>mq11xx_udc</filename>),
|
||||
Hynix HMS30C7202 (<filename>h7202_udc</filename>),
|
||||
National 9303/4 (<filename>n9604_udc</filename>),
|
||||
Texas Instruments OMAP (<filename>omap_udc</filename>),
|
||||
Sharp LH7A40x (<filename>lh7a40x_udc</filename>),
|
||||
and more.
|
||||
Most of those are full speed controllers.
|
||||
</para>
|
||||
|
||||
<para>At this writing, there are people at work on drivers in
|
||||
this framework for several other USB device controllers,
|
||||
with plans to make many of them be widely available.
|
||||
</para>
|
||||
|
||||
<!-- !Edrivers/usb/gadget/net2280.c -->
|
||||
|
||||
<para>A partial USB simulator,
|
||||
the <filename>dummy_hcd</filename> driver, is available.
|
||||
It can act like a net2280, a pxa25x, or an sa11x0 in terms
|
||||
of available endpoints and device speeds; and it simulates
|
||||
control, bulk, and to some extent interrupt transfers.
|
||||
That lets you develop some parts of a gadget driver on a normal PC,
|
||||
without any special hardware, and perhaps with the assistance
|
||||
of tools such as GDB running with User Mode Linux.
|
||||
At least one person has expressed interest in adapting that
|
||||
approach, hooking it up to a simulator for a microcontroller.
|
||||
Such simulators can help debug subsystems where the runtime hardware
|
||||
is unfriendly to software development, or is not yet available.
|
||||
</para>
|
||||
|
||||
<para>Support for other controllers is expected to be developed
|
||||
and contributed
|
||||
over time, as this driver framework evolves.
|
||||
</para>
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter id="gadget"><title>Gadget Drivers</title>
|
||||
|
||||
<para>In addition to <emphasis>Gadget Zero</emphasis>
|
||||
(used primarily for testing and development with drivers
|
||||
for usb controller hardware), other gadget drivers exist.
|
||||
</para>
|
||||
|
||||
<para>There's an <emphasis>ethernet</emphasis> gadget
|
||||
driver, which implements one of the most useful
|
||||
<emphasis>Communications Device Class</emphasis> (CDC) models.
|
||||
One of the standards for cable modem interoperability even
|
||||
specifies the use of this ethernet model as one of two
|
||||
mandatory options.
|
||||
Gadgets using this code look to a USB host as if they're
|
||||
an Ethernet adapter.
|
||||
It provides access to a network where the gadget's CPU is one host,
|
||||
which could easily be bridging, routing, or firewalling
|
||||
access to other networks.
|
||||
Since some hardware can't fully implement the CDC Ethernet
|
||||
requirements, this driver also implements a "good parts only"
|
||||
subset of CDC Ethernet.
|
||||
(That subset doesn't advertise itself as CDC Ethernet,
|
||||
to avoid creating problems.)
|
||||
</para>
|
||||
|
||||
<para>Support for Microsoft's <emphasis>RNDIS</emphasis>
|
||||
protocol has been contributed by Pengutronix and Auerswald GmbH.
|
||||
This is like CDC Ethernet, but it runs on more slightly USB hardware
|
||||
(but less than the CDC subset).
|
||||
However, its main claim to fame is being able to connect directly to
|
||||
recent versions of Windows, using drivers that Microsoft bundles
|
||||
and supports, making it much simpler to network with Windows.
|
||||
</para>
|
||||
|
||||
<para>There is also support for user mode gadget drivers,
|
||||
using <emphasis>gadgetfs</emphasis>.
|
||||
This provides a <emphasis>User Mode API</emphasis> that presents
|
||||
each endpoint as a single file descriptor. I/O is done using
|
||||
normal <emphasis>read()</emphasis> and <emphasis>read()</emphasis> calls.
|
||||
Familiar tools like GDB and pthreads can be used to
|
||||
develop and debug user mode drivers, so that once a robust
|
||||
controller driver is available many applications for it
|
||||
won't require new kernel mode software.
|
||||
Linux 2.6 <emphasis>Async I/O (AIO)</emphasis>
|
||||
support is available, so that user mode software
|
||||
can stream data with only slightly more overhead
|
||||
than a kernel driver.
|
||||
</para>
|
||||
|
||||
<para>There's a USB Mass Storage class driver, which provides
|
||||
a different solution for interoperability with systems such
|
||||
as MS-Windows and MacOS.
|
||||
That <emphasis>File-backed Storage</emphasis> driver uses a
|
||||
file or block device as backing store for a drive,
|
||||
like the <filename>loop</filename> driver.
|
||||
The USB host uses the BBB, CB, or CBI versions of the mass
|
||||
storage class specification, using transparent SCSI commands
|
||||
to access the data from the backing store.
|
||||
</para>
|
||||
|
||||
<para>There's a "serial line" driver, useful for TTY style
|
||||
operation over USB.
|
||||
The latest version of that driver supports CDC ACM style
|
||||
operation, like a USB modem, and so on most hardware it can
|
||||
interoperate easily with MS-Windows.
|
||||
One interesting use of that driver is in boot firmware (like a BIOS),
|
||||
which can sometimes use that model with very small systems without
|
||||
real serial lines.
|
||||
</para>
|
||||
|
||||
<para>Support for other kinds of gadget is expected to
|
||||
be developed and contributed
|
||||
over time, as this driver framework evolves.
|
||||
</para>
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter id="otg"><title>USB On-The-GO (OTG)</title>
|
||||
|
||||
<para>USB OTG support on Linux 2.6 was initially developed
|
||||
by Texas Instruments for
|
||||
<ulink url="http://www.omap.com">OMAP</ulink> 16xx and 17xx
|
||||
series processors.
|
||||
Other OTG systems should work in similar ways, but the
|
||||
hardware level details could be very different.
|
||||
</para>
|
||||
|
||||
<para>Systems need specialized hardware support to implement OTG,
|
||||
notably including a special <emphasis>Mini-AB</emphasis> jack
|
||||
and associated transciever to support <emphasis>Dual-Role</emphasis>
|
||||
operation:
|
||||
they can act either as a host, using the standard
|
||||
Linux-USB host side driver stack,
|
||||
or as a peripheral, using this "gadget" framework.
|
||||
To do that, the system software relies on small additions
|
||||
to those programming interfaces,
|
||||
and on a new internal component (here called an "OTG Controller")
|
||||
affecting which driver stack connects to the OTG port.
|
||||
In each role, the system can re-use the existing pool of
|
||||
hardware-neutral drivers, layered on top of the controller
|
||||
driver interfaces (<emphasis>usb_bus</emphasis> or
|
||||
<emphasis>usb_gadget</emphasis>).
|
||||
Such drivers need at most minor changes, and most of the calls
|
||||
added to support OTG can also benefit non-OTG products.
|
||||
</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem><para>Gadget drivers test the <emphasis>is_otg</emphasis>
|
||||
flag, and use it to determine whether or not to include
|
||||
an OTG descriptor in each of their configurations.
|
||||
</para></listitem>
|
||||
<listitem><para>Gadget drivers may need changes to support the
|
||||
two new OTG protocols, exposed in new gadget attributes
|
||||
such as <emphasis>b_hnp_enable</emphasis> flag.
|
||||
HNP support should be reported through a user interface
|
||||
(two LEDs could suffice), and is triggered in some cases
|
||||
when the host suspends the peripheral.
|
||||
SRP support can be user-initiated just like remote wakeup,
|
||||
probably by pressing the same button.
|
||||
</para></listitem>
|
||||
<listitem><para>On the host side, USB device drivers need
|
||||
to be taught to trigger HNP at appropriate moments, using
|
||||
<function>usb_suspend_device()</function>.
|
||||
That also conserves battery power, which is useful even
|
||||
for non-OTG configurations.
|
||||
</para></listitem>
|
||||
<listitem><para>Also on the host side, a driver must support the
|
||||
OTG "Targeted Peripheral List". That's just a whitelist,
|
||||
used to reject peripherals not supported with a given
|
||||
Linux OTG host.
|
||||
<emphasis>This whitelist is product-specific;
|
||||
each product must modify <filename>otg_whitelist.h</filename>
|
||||
to match its interoperability specification.
|
||||
</emphasis>
|
||||
</para>
|
||||
<para>Non-OTG Linux hosts, like PCs and workstations,
|
||||
normally have some solution for adding drivers, so that
|
||||
peripherals that aren't recognized can eventually be supported.
|
||||
That approach is unreasonable for consumer products that may
|
||||
never have their firmware upgraded, and where it's usually
|
||||
unrealistic to expect traditional PC/workstation/server kinds
|
||||
of support model to work.
|
||||
For example, it's often impractical to change device firmware
|
||||
once the product has been distributed, so driver bugs can't
|
||||
normally be fixed if they're found after shipment.
|
||||
</para></listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>
|
||||
Additional changes are needed below those hardware-neutral
|
||||
<emphasis>usb_bus</emphasis> and <emphasis>usb_gadget</emphasis>
|
||||
driver interfaces; those aren't discussed here in any detail.
|
||||
Those affect the hardware-specific code for each USB Host or Peripheral
|
||||
controller, and how the HCD initializes (since OTG can be active only
|
||||
on a single port).
|
||||
They also involve what may be called an <emphasis>OTG Controller
|
||||
Driver</emphasis>, managing the OTG transceiver and the OTG state
|
||||
machine logic as well as much of the root hub behavior for the
|
||||
OTG port.
|
||||
The OTG controller driver needs to activate and deactivate USB
|
||||
controllers depending on the relevant device role.
|
||||
Some related changes were needed inside usbcore, so that it
|
||||
can identify OTG-capable devices and respond appropriately
|
||||
to HNP or SRP protocols.
|
||||
</para>
|
||||
|
||||
</chapter>
|
||||
|
||||
</book>
|
||||
<!--
|
||||
vim:syntax=sgml:sw=4
|
||||
-->
|
||||
474
Documentation/DocBook/genericirq.tmpl
Normal file
474
Documentation/DocBook/genericirq.tmpl
Normal file
@ -0,0 +1,474 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="Generic-IRQ-Guide">
|
||||
<bookinfo>
|
||||
<title>Linux generic IRQ handling</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Thomas</firstname>
|
||||
<surname>Gleixner</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>tglx@linutronix.de</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
<author>
|
||||
<firstname>Ingo</firstname>
|
||||
<surname>Molnar</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>mingo@elte.hu</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2005-2006</year>
|
||||
<holder>Thomas Gleixner</holder>
|
||||
</copyright>
|
||||
<copyright>
|
||||
<year>2005-2006</year>
|
||||
<holder>Ingo Molnar</holder>
|
||||
</copyright>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License version 2 as published by the Free Software Foundation.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter id="intro">
|
||||
<title>Introduction</title>
|
||||
<para>
|
||||
The generic interrupt handling layer is designed to provide a
|
||||
complete abstraction of interrupt handling for device drivers.
|
||||
It is able to handle all the different types of interrupt controller
|
||||
hardware. Device drivers use generic API functions to request, enable,
|
||||
disable and free interrupts. The drivers do not have to know anything
|
||||
about interrupt hardware details, so they can be used on different
|
||||
platforms without code changes.
|
||||
</para>
|
||||
<para>
|
||||
This documentation is provided to developers who want to implement
|
||||
an interrupt subsystem based for their architecture, with the help
|
||||
of the generic IRQ handling layer.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="rationale">
|
||||
<title>Rationale</title>
|
||||
<para>
|
||||
The original implementation of interrupt handling in Linux is using
|
||||
the __do_IRQ() super-handler, which is able to deal with every
|
||||
type of interrupt logic.
|
||||
</para>
|
||||
<para>
|
||||
Originally, Russell King identified different types of handlers to
|
||||
build a quite universal set for the ARM interrupt handler
|
||||
implementation in Linux 2.5/2.6. He distinguished between:
|
||||
<itemizedlist>
|
||||
<listitem><para>Level type</para></listitem>
|
||||
<listitem><para>Edge type</para></listitem>
|
||||
<listitem><para>Simple type</para></listitem>
|
||||
</itemizedlist>
|
||||
In the SMP world of the __do_IRQ() super-handler another type
|
||||
was identified:
|
||||
<itemizedlist>
|
||||
<listitem><para>Per CPU type</para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
<para>
|
||||
This split implementation of highlevel IRQ handlers allows us to
|
||||
optimize the flow of the interrupt handling for each specific
|
||||
interrupt type. This reduces complexity in that particular codepath
|
||||
and allows the optimized handling of a given type.
|
||||
</para>
|
||||
<para>
|
||||
The original general IRQ implementation used hw_interrupt_type
|
||||
structures and their ->ack(), ->end() [etc.] callbacks to
|
||||
differentiate the flow control in the super-handler. This leads to
|
||||
a mix of flow logic and lowlevel hardware logic, and it also leads
|
||||
to unnecessary code duplication: for example in i386, there is a
|
||||
ioapic_level_irq and a ioapic_edge_irq irq-type which share many
|
||||
of the lowlevel details but have different flow handling.
|
||||
</para>
|
||||
<para>
|
||||
A more natural abstraction is the clean separation of the
|
||||
'irq flow' and the 'chip details'.
|
||||
</para>
|
||||
<para>
|
||||
Analysing a couple of architecture's IRQ subsystem implementations
|
||||
reveals that most of them can use a generic set of 'irq flow'
|
||||
methods and only need to add the chip level specific code.
|
||||
The separation is also valuable for (sub)architectures
|
||||
which need specific quirks in the irq flow itself but not in the
|
||||
chip-details - and thus provides a more transparent IRQ subsystem
|
||||
design.
|
||||
</para>
|
||||
<para>
|
||||
Each interrupt descriptor is assigned its own highlevel flow
|
||||
handler, which is normally one of the generic
|
||||
implementations. (This highlevel flow handler implementation also
|
||||
makes it simple to provide demultiplexing handlers which can be
|
||||
found in embedded platforms on various architectures.)
|
||||
</para>
|
||||
<para>
|
||||
The separation makes the generic interrupt handling layer more
|
||||
flexible and extensible. For example, an (sub)architecture can
|
||||
use a generic irq-flow implementation for 'level type' interrupts
|
||||
and add a (sub)architecture specific 'edge type' implementation.
|
||||
</para>
|
||||
<para>
|
||||
To make the transition to the new model easier and prevent the
|
||||
breakage of existing implementations, the __do_IRQ() super-handler
|
||||
is still available. This leads to a kind of duality for the time
|
||||
being. Over time the new model should be used in more and more
|
||||
architectures, as it enables smaller and cleaner IRQ subsystems.
|
||||
</para>
|
||||
</chapter>
|
||||
<chapter id="bugs">
|
||||
<title>Known Bugs And Assumptions</title>
|
||||
<para>
|
||||
None (knock on wood).
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="Abstraction">
|
||||
<title>Abstraction layers</title>
|
||||
<para>
|
||||
There are three main levels of abstraction in the interrupt code:
|
||||
<orderedlist>
|
||||
<listitem><para>Highlevel driver API</para></listitem>
|
||||
<listitem><para>Highlevel IRQ flow handlers</para></listitem>
|
||||
<listitem><para>Chiplevel hardware encapsulation</para></listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
<sect1>
|
||||
<title>Interrupt control flow</title>
|
||||
<para>
|
||||
Each interrupt is described by an interrupt descriptor structure
|
||||
irq_desc. The interrupt is referenced by an 'unsigned int' numeric
|
||||
value which selects the corresponding interrupt decription structure
|
||||
in the descriptor structures array.
|
||||
The descriptor structure contains status information and pointers
|
||||
to the interrupt flow method and the interrupt chip structure
|
||||
which are assigned to this interrupt.
|
||||
</para>
|
||||
<para>
|
||||
Whenever an interrupt triggers, the lowlevel arch code calls into
|
||||
the generic interrupt code by calling desc->handle_irq().
|
||||
This highlevel IRQ handling function only uses desc->chip primitives
|
||||
referenced by the assigned chip descriptor structure.
|
||||
</para>
|
||||
</sect1>
|
||||
<sect1>
|
||||
<title>Highlevel Driver API</title>
|
||||
<para>
|
||||
The highlevel Driver API consists of following functions:
|
||||
<itemizedlist>
|
||||
<listitem><para>request_irq()</para></listitem>
|
||||
<listitem><para>free_irq()</para></listitem>
|
||||
<listitem><para>disable_irq()</para></listitem>
|
||||
<listitem><para>enable_irq()</para></listitem>
|
||||
<listitem><para>disable_irq_nosync() (SMP only)</para></listitem>
|
||||
<listitem><para>synchronize_irq() (SMP only)</para></listitem>
|
||||
<listitem><para>set_irq_type()</para></listitem>
|
||||
<listitem><para>set_irq_wake()</para></listitem>
|
||||
<listitem><para>set_irq_data()</para></listitem>
|
||||
<listitem><para>set_irq_chip()</para></listitem>
|
||||
<listitem><para>set_irq_chip_data()</para></listitem>
|
||||
</itemizedlist>
|
||||
See the autogenerated function documentation for details.
|
||||
</para>
|
||||
</sect1>
|
||||
<sect1>
|
||||
<title>Highlevel IRQ flow handlers</title>
|
||||
<para>
|
||||
The generic layer provides a set of pre-defined irq-flow methods:
|
||||
<itemizedlist>
|
||||
<listitem><para>handle_level_irq</para></listitem>
|
||||
<listitem><para>handle_edge_irq</para></listitem>
|
||||
<listitem><para>handle_simple_irq</para></listitem>
|
||||
<listitem><para>handle_percpu_irq</para></listitem>
|
||||
</itemizedlist>
|
||||
The interrupt flow handlers (either predefined or architecture
|
||||
specific) are assigned to specific interrupts by the architecture
|
||||
either during bootup or during device initialization.
|
||||
</para>
|
||||
<sect2>
|
||||
<title>Default flow implementations</title>
|
||||
<sect3>
|
||||
<title>Helper functions</title>
|
||||
<para>
|
||||
The helper functions call the chip primitives and
|
||||
are used by the default flow implementations.
|
||||
The following helper functions are implemented (simplified excerpt):
|
||||
<programlisting>
|
||||
default_enable(irq)
|
||||
{
|
||||
desc->chip->unmask(irq);
|
||||
}
|
||||
|
||||
default_disable(irq)
|
||||
{
|
||||
if (!delay_disable(irq))
|
||||
desc->chip->mask(irq);
|
||||
}
|
||||
|
||||
default_ack(irq)
|
||||
{
|
||||
chip->ack(irq);
|
||||
}
|
||||
|
||||
default_mask_ack(irq)
|
||||
{
|
||||
if (chip->mask_ack) {
|
||||
chip->mask_ack(irq);
|
||||
} else {
|
||||
chip->mask(irq);
|
||||
chip->ack(irq);
|
||||
}
|
||||
}
|
||||
|
||||
noop(irq)
|
||||
{
|
||||
}
|
||||
|
||||
</programlisting>
|
||||
</para>
|
||||
</sect3>
|
||||
</sect2>
|
||||
<sect2>
|
||||
<title>Default flow handler implementations</title>
|
||||
<sect3>
|
||||
<title>Default Level IRQ flow handler</title>
|
||||
<para>
|
||||
handle_level_irq provides a generic implementation
|
||||
for level-triggered interrupts.
|
||||
</para>
|
||||
<para>
|
||||
The following control flow is implemented (simplified excerpt):
|
||||
<programlisting>
|
||||
desc->chip->start();
|
||||
handle_IRQ_event(desc->action);
|
||||
desc->chip->end();
|
||||
</programlisting>
|
||||
</para>
|
||||
</sect3>
|
||||
<sect3>
|
||||
<title>Default Edge IRQ flow handler</title>
|
||||
<para>
|
||||
handle_edge_irq provides a generic implementation
|
||||
for edge-triggered interrupts.
|
||||
</para>
|
||||
<para>
|
||||
The following control flow is implemented (simplified excerpt):
|
||||
<programlisting>
|
||||
if (desc->status & running) {
|
||||
desc->chip->hold();
|
||||
desc->status |= pending | masked;
|
||||
return;
|
||||
}
|
||||
desc->chip->start();
|
||||
desc->status |= running;
|
||||
do {
|
||||
if (desc->status & masked)
|
||||
desc->chip->enable();
|
||||
desc->status &= ~pending;
|
||||
handle_IRQ_event(desc->action);
|
||||
} while (status & pending);
|
||||
desc->status &= ~running;
|
||||
desc->chip->end();
|
||||
</programlisting>
|
||||
</para>
|
||||
</sect3>
|
||||
<sect3>
|
||||
<title>Default simple IRQ flow handler</title>
|
||||
<para>
|
||||
handle_simple_irq provides a generic implementation
|
||||
for simple interrupts.
|
||||
</para>
|
||||
<para>
|
||||
Note: The simple flow handler does not call any
|
||||
handler/chip primitives.
|
||||
</para>
|
||||
<para>
|
||||
The following control flow is implemented (simplified excerpt):
|
||||
<programlisting>
|
||||
handle_IRQ_event(desc->action);
|
||||
</programlisting>
|
||||
</para>
|
||||
</sect3>
|
||||
<sect3>
|
||||
<title>Default per CPU flow handler</title>
|
||||
<para>
|
||||
handle_percpu_irq provides a generic implementation
|
||||
for per CPU interrupts.
|
||||
</para>
|
||||
<para>
|
||||
Per CPU interrupts are only available on SMP and
|
||||
the handler provides a simplified version without
|
||||
locking.
|
||||
</para>
|
||||
<para>
|
||||
The following control flow is implemented (simplified excerpt):
|
||||
<programlisting>
|
||||
desc->chip->start();
|
||||
handle_IRQ_event(desc->action);
|
||||
desc->chip->end();
|
||||
</programlisting>
|
||||
</para>
|
||||
</sect3>
|
||||
</sect2>
|
||||
<sect2>
|
||||
<title>Quirks and optimizations</title>
|
||||
<para>
|
||||
The generic functions are intended for 'clean' architectures and chips,
|
||||
which have no platform-specific IRQ handling quirks. If an architecture
|
||||
needs to implement quirks on the 'flow' level then it can do so by
|
||||
overriding the highlevel irq-flow handler.
|
||||
</para>
|
||||
</sect2>
|
||||
<sect2>
|
||||
<title>Delayed interrupt disable</title>
|
||||
<para>
|
||||
This per interrupt selectable feature, which was introduced by Russell
|
||||
King in the ARM interrupt implementation, does not mask an interrupt
|
||||
at the hardware level when disable_irq() is called. The interrupt is
|
||||
kept enabled and is masked in the flow handler when an interrupt event
|
||||
happens. This prevents losing edge interrupts on hardware which does
|
||||
not store an edge interrupt event while the interrupt is disabled at
|
||||
the hardware level. When an interrupt arrives while the IRQ_DISABLED
|
||||
flag is set, then the interrupt is masked at the hardware level and
|
||||
the IRQ_PENDING bit is set. When the interrupt is re-enabled by
|
||||
enable_irq() the pending bit is checked and if it is set, the
|
||||
interrupt is resent either via hardware or by a software resend
|
||||
mechanism. (It's necessary to enable CONFIG_HARDIRQS_SW_RESEND when
|
||||
you want to use the delayed interrupt disable feature and your
|
||||
hardware is not capable of retriggering an interrupt.)
|
||||
The delayed interrupt disable can be runtime enabled, per interrupt,
|
||||
by setting the IRQ_DELAYED_DISABLE flag in the irq_desc status field.
|
||||
</para>
|
||||
</sect2>
|
||||
</sect1>
|
||||
<sect1>
|
||||
<title>Chiplevel hardware encapsulation</title>
|
||||
<para>
|
||||
The chip level hardware descriptor structure irq_chip
|
||||
contains all the direct chip relevant functions, which
|
||||
can be utilized by the irq flow implementations.
|
||||
<itemizedlist>
|
||||
<listitem><para>ack()</para></listitem>
|
||||
<listitem><para>mask_ack() - Optional, recommended for performance</para></listitem>
|
||||
<listitem><para>mask()</para></listitem>
|
||||
<listitem><para>unmask()</para></listitem>
|
||||
<listitem><para>retrigger() - Optional</para></listitem>
|
||||
<listitem><para>set_type() - Optional</para></listitem>
|
||||
<listitem><para>set_wake() - Optional</para></listitem>
|
||||
</itemizedlist>
|
||||
These primitives are strictly intended to mean what they say: ack means
|
||||
ACK, masking means masking of an IRQ line, etc. It is up to the flow
|
||||
handler(s) to use these basic units of lowlevel functionality.
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="doirq">
|
||||
<title>__do_IRQ entry point</title>
|
||||
<para>
|
||||
The original implementation __do_IRQ() is an alternative entry
|
||||
point for all types of interrupts.
|
||||
</para>
|
||||
<para>
|
||||
This handler turned out to be not suitable for all
|
||||
interrupt hardware and was therefore reimplemented with split
|
||||
functionality for egde/level/simple/percpu interrupts. This is not
|
||||
only a functional optimization. It also shortens code paths for
|
||||
interrupts.
|
||||
</para>
|
||||
<para>
|
||||
To make use of the split implementation, replace the call to
|
||||
__do_IRQ by a call to desc->chip->handle_irq() and associate
|
||||
the appropriate handler function to desc->chip->handle_irq().
|
||||
In most cases the generic handler implementations should
|
||||
be sufficient.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="locking">
|
||||
<title>Locking on SMP</title>
|
||||
<para>
|
||||
The locking of chip registers is up to the architecture that
|
||||
defines the chip primitives. There is a chip->lock field that can be used
|
||||
for serialization, but the generic layer does not touch it. The per-irq
|
||||
structure is protected via desc->lock, by the generic layer.
|
||||
</para>
|
||||
</chapter>
|
||||
<chapter id="structs">
|
||||
<title>Structures</title>
|
||||
<para>
|
||||
This chapter contains the autogenerated documentation of the structures which are
|
||||
used in the generic IRQ layer.
|
||||
</para>
|
||||
!Iinclude/linux/irq.h
|
||||
</chapter>
|
||||
|
||||
<chapter id="pubfunctions">
|
||||
<title>Public Functions Provided</title>
|
||||
<para>
|
||||
This chapter contains the autogenerated documentation of the kernel API functions
|
||||
which are exported.
|
||||
</para>
|
||||
!Ekernel/irq/manage.c
|
||||
!Ekernel/irq/chip.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="intfunctions">
|
||||
<title>Internal Functions Provided</title>
|
||||
<para>
|
||||
This chapter contains the autogenerated documentation of the internal functions.
|
||||
</para>
|
||||
!Ikernel/irq/handle.c
|
||||
!Ikernel/irq/chip.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="credits">
|
||||
<title>Credits</title>
|
||||
<para>
|
||||
The following people have contributed to this document:
|
||||
<orderedlist>
|
||||
<listitem><para>Thomas Gleixner<email>tglx@linutronix.de</email></para></listitem>
|
||||
<listitem><para>Ingo Molnar<email>mingo@elte.hu</email></para></listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
</chapter>
|
||||
</book>
|
||||
573
Documentation/DocBook/kernel-api.tmpl
Normal file
573
Documentation/DocBook/kernel-api.tmpl
Normal file
@ -0,0 +1,573 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="LinuxKernelAPI">
|
||||
<bookinfo>
|
||||
<title>The Linux Kernel API</title>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License as published by the Free Software Foundation; either
|
||||
version 2 of the License, or (at your option) any later
|
||||
version.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter id="Basics">
|
||||
<title>Driver Basics</title>
|
||||
<sect1><title>Driver Entry and Exit points</title>
|
||||
!Iinclude/linux/init.h
|
||||
</sect1>
|
||||
|
||||
<sect1><title>Atomic and pointer manipulation</title>
|
||||
!Iinclude/asm-i386/atomic.h
|
||||
!Iinclude/asm-i386/unaligned.h
|
||||
</sect1>
|
||||
|
||||
<sect1><title>Delaying, scheduling, and timer routines</title>
|
||||
!Iinclude/linux/sched.h
|
||||
!Ekernel/sched.c
|
||||
!Ekernel/timer.c
|
||||
</sect1>
|
||||
<sect1><title>High-resolution timers</title>
|
||||
!Iinclude/linux/ktime.h
|
||||
!Iinclude/linux/hrtimer.h
|
||||
!Ekernel/hrtimer.c
|
||||
</sect1>
|
||||
<sect1><title>Workqueues and Kevents</title>
|
||||
!Ekernel/workqueue.c
|
||||
</sect1>
|
||||
<sect1><title>Internal Functions</title>
|
||||
!Ikernel/exit.c
|
||||
!Ikernel/signal.c
|
||||
!Iinclude/linux/kthread.h
|
||||
!Ekernel/kthread.c
|
||||
</sect1>
|
||||
|
||||
<sect1><title>Kernel objects manipulation</title>
|
||||
<!--
|
||||
X!Iinclude/linux/kobject.h
|
||||
-->
|
||||
!Elib/kobject.c
|
||||
</sect1>
|
||||
|
||||
<sect1><title>Kernel utility functions</title>
|
||||
!Iinclude/linux/kernel.h
|
||||
!Ekernel/printk.c
|
||||
!Ekernel/panic.c
|
||||
!Ekernel/sys.c
|
||||
!Ekernel/rcupdate.c
|
||||
</sect1>
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter id="adt">
|
||||
<title>Data Types</title>
|
||||
<sect1><title>Doubly Linked Lists</title>
|
||||
!Iinclude/linux/list.h
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="libc">
|
||||
<title>Basic C Library Functions</title>
|
||||
|
||||
<para>
|
||||
When writing drivers, you cannot in general use routines which are
|
||||
from the C Library. Some of the functions have been found generally
|
||||
useful and they are listed below. The behaviour of these functions
|
||||
may vary slightly from those defined by ANSI, and these deviations
|
||||
are noted in the text.
|
||||
</para>
|
||||
|
||||
<sect1><title>String Conversions</title>
|
||||
!Ilib/vsprintf.c
|
||||
!Elib/vsprintf.c
|
||||
</sect1>
|
||||
<sect1><title>String Manipulation</title>
|
||||
<!-- All functions are exported at now
|
||||
X!Ilib/string.c
|
||||
-->
|
||||
!Elib/string.c
|
||||
</sect1>
|
||||
<sect1><title>Bit Operations</title>
|
||||
!Iinclude/asm-i386/bitops.h
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="kernel-lib">
|
||||
<title>Basic Kernel Library Functions</title>
|
||||
|
||||
<para>
|
||||
The Linux kernel provides more basic utility functions.
|
||||
</para>
|
||||
|
||||
<sect1><title>Bitmap Operations</title>
|
||||
!Elib/bitmap.c
|
||||
!Ilib/bitmap.c
|
||||
</sect1>
|
||||
|
||||
<sect1><title>Command-line Parsing</title>
|
||||
!Elib/cmdline.c
|
||||
</sect1>
|
||||
|
||||
<sect1><title>CRC Functions</title>
|
||||
!Elib/crc16.c
|
||||
!Elib/crc32.c
|
||||
!Elib/crc-ccitt.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="mm">
|
||||
<title>Memory Management in Linux</title>
|
||||
<sect1><title>The Slab Cache</title>
|
||||
!Iinclude/linux/slab.h
|
||||
!Emm/slab.c
|
||||
</sect1>
|
||||
<sect1><title>User Space Memory Access</title>
|
||||
!Iinclude/asm-i386/uaccess.h
|
||||
!Earch/i386/lib/usercopy.c
|
||||
</sect1>
|
||||
<sect1><title>More Memory Management Functions</title>
|
||||
!Iinclude/linux/rmap.h
|
||||
!Emm/readahead.c
|
||||
!Emm/filemap.c
|
||||
!Emm/memory.c
|
||||
!Emm/vmalloc.c
|
||||
!Imm/page_alloc.c
|
||||
!Emm/mempool.c
|
||||
!Emm/page-writeback.c
|
||||
!Emm/truncate.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
|
||||
<chapter id="ipc">
|
||||
<title>Kernel IPC facilities</title>
|
||||
|
||||
<sect1><title>IPC utilities</title>
|
||||
!Iipc/util.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="kfifo">
|
||||
<title>FIFO Buffer</title>
|
||||
<sect1><title>kfifo interface</title>
|
||||
!Iinclude/linux/kfifo.h
|
||||
!Ekernel/kfifo.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="relayfs">
|
||||
<title>relay interface support</title>
|
||||
|
||||
<para>
|
||||
Relay interface support
|
||||
is designed to provide an efficient mechanism for tools and
|
||||
facilities to relay large amounts of data from kernel space to
|
||||
user space.
|
||||
</para>
|
||||
|
||||
<sect1><title>relay interface</title>
|
||||
!Ekernel/relay.c
|
||||
!Ikernel/relay.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="netcore">
|
||||
<title>Linux Networking</title>
|
||||
<sect1><title>Networking Base Types</title>
|
||||
!Iinclude/linux/net.h
|
||||
</sect1>
|
||||
<sect1><title>Socket Buffer Functions</title>
|
||||
!Iinclude/linux/skbuff.h
|
||||
!Iinclude/net/sock.h
|
||||
!Enet/socket.c
|
||||
!Enet/core/skbuff.c
|
||||
!Enet/core/sock.c
|
||||
!Enet/core/datagram.c
|
||||
!Enet/core/stream.c
|
||||
</sect1>
|
||||
<sect1><title>Socket Filter</title>
|
||||
!Enet/core/filter.c
|
||||
</sect1>
|
||||
<sect1><title>Generic Network Statistics</title>
|
||||
!Iinclude/linux/gen_stats.h
|
||||
!Enet/core/gen_stats.c
|
||||
!Enet/core/gen_estimator.c
|
||||
</sect1>
|
||||
<sect1><title>SUN RPC subsystem</title>
|
||||
<!-- The !D functionality is not perfect, garbage has to be protected by comments
|
||||
!Dnet/sunrpc/sunrpc_syms.c
|
||||
-->
|
||||
!Enet/sunrpc/xdr.c
|
||||
!Enet/sunrpc/svcsock.c
|
||||
!Enet/sunrpc/sched.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="netdev">
|
||||
<title>Network device support</title>
|
||||
<sect1><title>Driver Support</title>
|
||||
!Enet/core/dev.c
|
||||
!Enet/ethernet/eth.c
|
||||
!Iinclude/linux/etherdevice.h
|
||||
<!-- FIXME: Removed for now since no structured comments in source
|
||||
X!Enet/core/wireless.c
|
||||
-->
|
||||
</sect1>
|
||||
<sect1><title>Synchronous PPP</title>
|
||||
!Edrivers/net/wan/syncppp.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="modload">
|
||||
<title>Module Support</title>
|
||||
<sect1><title>Module Loading</title>
|
||||
!Ekernel/kmod.c
|
||||
</sect1>
|
||||
<sect1><title>Inter Module support</title>
|
||||
<para>
|
||||
Refer to the file kernel/module.c for more information.
|
||||
</para>
|
||||
<!-- FIXME: Removed for now since no structured comments in source
|
||||
X!Ekernel/module.c
|
||||
-->
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="hardware">
|
||||
<title>Hardware Interfaces</title>
|
||||
<sect1><title>Interrupt Handling</title>
|
||||
!Ekernel/irq/manage.c
|
||||
</sect1>
|
||||
|
||||
<sect1><title>DMA Channels</title>
|
||||
!Ekernel/dma.c
|
||||
</sect1>
|
||||
|
||||
<sect1><title>Resources Management</title>
|
||||
!Ikernel/resource.c
|
||||
!Ekernel/resource.c
|
||||
</sect1>
|
||||
|
||||
<sect1><title>MTRR Handling</title>
|
||||
!Earch/i386/kernel/cpu/mtrr/main.c
|
||||
</sect1>
|
||||
|
||||
<sect1><title>PCI Support Library</title>
|
||||
!Edrivers/pci/pci.c
|
||||
!Edrivers/pci/pci-driver.c
|
||||
!Edrivers/pci/remove.c
|
||||
!Edrivers/pci/pci-acpi.c
|
||||
!Edrivers/pci/search.c
|
||||
!Edrivers/pci/msi.c
|
||||
!Edrivers/pci/bus.c
|
||||
<!-- FIXME: Removed for now since no structured comments in source
|
||||
X!Edrivers/pci/hotplug.c
|
||||
-->
|
||||
!Edrivers/pci/probe.c
|
||||
!Edrivers/pci/rom.c
|
||||
</sect1>
|
||||
<sect1><title>PCI Hotplug Support Library</title>
|
||||
!Edrivers/pci/hotplug/pci_hotplug_core.c
|
||||
</sect1>
|
||||
<sect1><title>MCA Architecture</title>
|
||||
<sect2><title>MCA Device Functions</title>
|
||||
<para>
|
||||
Refer to the file arch/i386/kernel/mca.c for more information.
|
||||
</para>
|
||||
<!-- FIXME: Removed for now since no structured comments in source
|
||||
X!Earch/i386/kernel/mca.c
|
||||
-->
|
||||
</sect2>
|
||||
<sect2><title>MCA Bus DMA</title>
|
||||
!Iinclude/asm-i386/mca_dma.h
|
||||
</sect2>
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="firmware">
|
||||
<title>Firmware Interfaces</title>
|
||||
<sect1><title>DMI Interfaces</title>
|
||||
!Edrivers/firmware/dmi_scan.c
|
||||
</sect1>
|
||||
<sect1><title>EDD Interfaces</title>
|
||||
!Idrivers/firmware/edd.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="security">
|
||||
<title>Security Framework</title>
|
||||
!Esecurity/security.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="audit">
|
||||
<title>Audit Interfaces</title>
|
||||
!Ekernel/audit.c
|
||||
!Ikernel/auditsc.c
|
||||
!Ikernel/auditfilter.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="accounting">
|
||||
<title>Accounting Framework</title>
|
||||
!Ikernel/acct.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="pmfuncs">
|
||||
<title>Power Management</title>
|
||||
!Ekernel/power/pm.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="devdrivers">
|
||||
<title>Device drivers infrastructure</title>
|
||||
<sect1><title>Device Drivers Base</title>
|
||||
<!--
|
||||
X!Iinclude/linux/device.h
|
||||
-->
|
||||
!Edrivers/base/driver.c
|
||||
!Edrivers/base/core.c
|
||||
!Edrivers/base/class.c
|
||||
!Edrivers/base/firmware_class.c
|
||||
!Edrivers/base/transport_class.c
|
||||
!Edrivers/base/dmapool.c
|
||||
<!-- Cannot be included, because
|
||||
attribute_container_add_class_device_adapter
|
||||
and attribute_container_classdev_to_container
|
||||
exceed allowed 44 characters maximum
|
||||
X!Edrivers/base/attribute_container.c
|
||||
-->
|
||||
!Edrivers/base/sys.c
|
||||
<!--
|
||||
X!Edrivers/base/interface.c
|
||||
-->
|
||||
!Edrivers/base/platform.c
|
||||
!Edrivers/base/bus.c
|
||||
</sect1>
|
||||
<sect1><title>Device Drivers Power Management</title>
|
||||
!Edrivers/base/power/main.c
|
||||
!Edrivers/base/power/resume.c
|
||||
!Edrivers/base/power/suspend.c
|
||||
</sect1>
|
||||
<sect1><title>Device Drivers ACPI Support</title>
|
||||
<!-- Internal functions only
|
||||
X!Edrivers/acpi/sleep/main.c
|
||||
X!Edrivers/acpi/sleep/wakeup.c
|
||||
X!Edrivers/acpi/motherboard.c
|
||||
X!Edrivers/acpi/bus.c
|
||||
-->
|
||||
!Edrivers/acpi/scan.c
|
||||
!Idrivers/acpi/scan.c
|
||||
<!-- No correct structured comments
|
||||
X!Edrivers/acpi/pci_bind.c
|
||||
-->
|
||||
</sect1>
|
||||
<sect1><title>Device drivers PnP support</title>
|
||||
!Edrivers/pnp/core.c
|
||||
<!-- No correct structured comments
|
||||
X!Edrivers/pnp/system.c
|
||||
-->
|
||||
!Edrivers/pnp/card.c
|
||||
!Edrivers/pnp/driver.c
|
||||
!Edrivers/pnp/manager.c
|
||||
!Edrivers/pnp/support.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="blkdev">
|
||||
<title>Block Devices</title>
|
||||
!Eblock/ll_rw_blk.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="chrdev">
|
||||
<title>Char devices</title>
|
||||
!Efs/char_dev.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="miscdev">
|
||||
<title>Miscellaneous Devices</title>
|
||||
!Edrivers/char/misc.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="parportdev">
|
||||
<title>Parallel Port Devices</title>
|
||||
!Iinclude/linux/parport.h
|
||||
!Edrivers/parport/ieee1284.c
|
||||
!Edrivers/parport/share.c
|
||||
!Idrivers/parport/daisy.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="message_devices">
|
||||
<title>Message-based devices</title>
|
||||
<sect1><title>Fusion message devices</title>
|
||||
!Edrivers/message/fusion/mptbase.c
|
||||
!Idrivers/message/fusion/mptbase.c
|
||||
!Edrivers/message/fusion/mptscsih.c
|
||||
!Idrivers/message/fusion/mptscsih.c
|
||||
!Idrivers/message/fusion/mptctl.c
|
||||
!Idrivers/message/fusion/mptspi.c
|
||||
!Idrivers/message/fusion/mptfc.c
|
||||
!Idrivers/message/fusion/mptlan.c
|
||||
</sect1>
|
||||
<sect1><title>I2O message devices</title>
|
||||
!Iinclude/linux/i2o.h
|
||||
!Idrivers/message/i2o/core.h
|
||||
!Edrivers/message/i2o/iop.c
|
||||
!Idrivers/message/i2o/iop.c
|
||||
!Idrivers/message/i2o/config-osm.c
|
||||
!Edrivers/message/i2o/exec-osm.c
|
||||
!Idrivers/message/i2o/exec-osm.c
|
||||
!Idrivers/message/i2o/bus-osm.c
|
||||
!Edrivers/message/i2o/device.c
|
||||
!Idrivers/message/i2o/device.c
|
||||
!Idrivers/message/i2o/driver.c
|
||||
!Idrivers/message/i2o/pci.c
|
||||
!Idrivers/message/i2o/i2o_block.c
|
||||
!Idrivers/message/i2o/i2o_scsi.c
|
||||
!Idrivers/message/i2o/i2o_proc.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="snddev">
|
||||
<title>Sound Devices</title>
|
||||
!Iinclude/sound/core.h
|
||||
!Esound/sound_core.c
|
||||
!Iinclude/sound/pcm.h
|
||||
!Esound/core/pcm.c
|
||||
!Esound/core/device.c
|
||||
!Esound/core/info.c
|
||||
!Esound/core/rawmidi.c
|
||||
!Esound/core/sound.c
|
||||
!Esound/core/memory.c
|
||||
!Esound/core/pcm_memory.c
|
||||
!Esound/core/init.c
|
||||
!Esound/core/isadma.c
|
||||
!Esound/core/control.c
|
||||
!Esound/core/pcm_lib.c
|
||||
!Esound/core/hwdep.c
|
||||
!Esound/core/pcm_native.c
|
||||
!Esound/core/memalloc.c
|
||||
<!-- FIXME: Removed for now since no structured comments in source
|
||||
X!Isound/sound_firmware.c
|
||||
-->
|
||||
</chapter>
|
||||
|
||||
<chapter id="uart16x50">
|
||||
<title>16x50 UART Driver</title>
|
||||
!Iinclude/linux/serial_core.h
|
||||
!Edrivers/serial/serial_core.c
|
||||
!Edrivers/serial/8250.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="z85230">
|
||||
<title>Z85230 Support Library</title>
|
||||
!Edrivers/net/wan/z85230.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="fbdev">
|
||||
<title>Frame Buffer Library</title>
|
||||
|
||||
<para>
|
||||
The frame buffer drivers depend heavily on four data structures.
|
||||
These structures are declared in include/linux/fb.h. They are
|
||||
fb_info, fb_var_screeninfo, fb_fix_screeninfo and fb_monospecs.
|
||||
The last three can be made available to and from userland.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
fb_info defines the current state of a particular video card.
|
||||
Inside fb_info, there exists a fb_ops structure which is a
|
||||
collection of needed functions to make fbdev and fbcon work.
|
||||
fb_info is only visible to the kernel.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
fb_var_screeninfo is used to describe the features of a video card
|
||||
that are user defined. With fb_var_screeninfo, things such as
|
||||
depth and the resolution may be defined.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The next structure is fb_fix_screeninfo. This defines the
|
||||
properties of a card that are created when a mode is set and can't
|
||||
be changed otherwise. A good example of this is the start of the
|
||||
frame buffer memory. This "locks" the address of the frame buffer
|
||||
memory, so that it cannot be changed or moved.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The last structure is fb_monospecs. In the old API, there was
|
||||
little importance for fb_monospecs. This allowed for forbidden things
|
||||
such as setting a mode of 800x600 on a fix frequency monitor. With
|
||||
the new API, fb_monospecs prevents such things, and if used
|
||||
correctly, can prevent a monitor from being cooked. fb_monospecs
|
||||
will not be useful until kernels 2.5.x.
|
||||
</para>
|
||||
|
||||
<sect1><title>Frame Buffer Memory</title>
|
||||
!Edrivers/video/fbmem.c
|
||||
</sect1>
|
||||
<!--
|
||||
<sect1><title>Frame Buffer Console</title>
|
||||
X!Edrivers/video/console/fbcon.c
|
||||
</sect1>
|
||||
-->
|
||||
<sect1><title>Frame Buffer Colormap</title>
|
||||
!Edrivers/video/fbcmap.c
|
||||
</sect1>
|
||||
<!-- FIXME:
|
||||
drivers/video/fbgen.c has no docs, which stuffs up the sgml. Comment
|
||||
out until somebody adds docs. KAO
|
||||
<sect1><title>Frame Buffer Generic Functions</title>
|
||||
X!Idrivers/video/fbgen.c
|
||||
</sect1>
|
||||
KAO -->
|
||||
<sect1><title>Frame Buffer Video Mode Database</title>
|
||||
!Idrivers/video/modedb.c
|
||||
!Edrivers/video/modedb.c
|
||||
</sect1>
|
||||
<sect1><title>Frame Buffer Macintosh Video Mode Database</title>
|
||||
!Edrivers/video/macmodes.c
|
||||
</sect1>
|
||||
<sect1><title>Frame Buffer Fonts</title>
|
||||
<para>
|
||||
Refer to the file drivers/video/console/fonts.c for more information.
|
||||
</para>
|
||||
<!-- FIXME: Removed for now since no structured comments in source
|
||||
X!Idrivers/video/console/fonts.c
|
||||
-->
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="input_subsystem">
|
||||
<title>Input Subsystem</title>
|
||||
!Iinclude/linux/input.h
|
||||
!Edrivers/input/input.c
|
||||
!Edrivers/input/ff-core.c
|
||||
!Edrivers/input/ff-memless.c
|
||||
</chapter>
|
||||
</book>
|
||||
1327
Documentation/DocBook/kernel-hacking.tmpl
Normal file
1327
Documentation/DocBook/kernel-hacking.tmpl
Normal file
File diff suppressed because it is too large
Load Diff
2094
Documentation/DocBook/kernel-locking.tmpl
Normal file
2094
Documentation/DocBook/kernel-locking.tmpl
Normal file
File diff suppressed because it is too large
Load Diff
1631
Documentation/DocBook/libata.tmpl
Normal file
1631
Documentation/DocBook/libata.tmpl
Normal file
File diff suppressed because it is too large
Load Diff
289
Documentation/DocBook/librs.tmpl
Normal file
289
Documentation/DocBook/librs.tmpl
Normal file
@ -0,0 +1,289 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="Reed-Solomon-Library-Guide">
|
||||
<bookinfo>
|
||||
<title>Reed-Solomon Library Programming Interface</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Thomas</firstname>
|
||||
<surname>Gleixner</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>tglx@linutronix.de</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2004</year>
|
||||
<holder>Thomas Gleixner</holder>
|
||||
</copyright>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License version 2 as published by the Free Software Foundation.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter id="intro">
|
||||
<title>Introduction</title>
|
||||
<para>
|
||||
The generic Reed-Solomon Library provides encoding, decoding
|
||||
and error correction functions.
|
||||
</para>
|
||||
<para>
|
||||
Reed-Solomon codes are used in communication and storage
|
||||
applications to ensure data integrity.
|
||||
</para>
|
||||
<para>
|
||||
This documentation is provided for developers who want to utilize
|
||||
the functions provided by the library.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="bugs">
|
||||
<title>Known Bugs And Assumptions</title>
|
||||
<para>
|
||||
None.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="usage">
|
||||
<title>Usage</title>
|
||||
<para>
|
||||
This chapter provides examples how to use the library.
|
||||
</para>
|
||||
<sect1>
|
||||
<title>Initializing</title>
|
||||
<para>
|
||||
The init function init_rs returns a pointer to a
|
||||
rs decoder structure, which holds the necessary
|
||||
information for encoding, decoding and error correction
|
||||
with the given polynomial. It either uses an existing
|
||||
matching decoder or creates a new one. On creation all
|
||||
the lookup tables for fast en/decoding are created.
|
||||
The function may take a while, so make sure not to
|
||||
call it in critical code paths.
|
||||
</para>
|
||||
<programlisting>
|
||||
/* the Reed Solomon control structure */
|
||||
static struct rs_control *rs_decoder;
|
||||
|
||||
/* Symbolsize is 10 (bits)
|
||||
* Primitve polynomial is x^10+x^3+1
|
||||
* first consecutive root is 0
|
||||
* primitve element to generate roots = 1
|
||||
* generator polinomial degree (number of roots) = 6
|
||||
*/
|
||||
rs_decoder = init_rs (10, 0x409, 0, 1, 6);
|
||||
</programlisting>
|
||||
</sect1>
|
||||
<sect1>
|
||||
<title>Encoding</title>
|
||||
<para>
|
||||
The encoder calculates the Reed-Solomon code over
|
||||
the given data length and stores the result in
|
||||
the parity buffer. Note that the parity buffer must
|
||||
be initialized before calling the encoder.
|
||||
</para>
|
||||
<para>
|
||||
The expanded data can be inverted on the fly by
|
||||
providing a non zero inversion mask. The expanded data is
|
||||
XOR'ed with the mask. This is used e.g. for FLASH
|
||||
ECC, where the all 0xFF is inverted to an all 0x00.
|
||||
The Reed-Solomon code for all 0x00 is all 0x00. The
|
||||
code is inverted before storing to FLASH so it is 0xFF
|
||||
too. This prevent's that reading from an erased FLASH
|
||||
results in ECC errors.
|
||||
</para>
|
||||
<para>
|
||||
The databytes are expanded to the given symbol size
|
||||
on the fly. There is no support for encoding continuous
|
||||
bitstreams with a symbol size != 8 at the moment. If
|
||||
it is necessary it should be not a big deal to implement
|
||||
such functionality.
|
||||
</para>
|
||||
<programlisting>
|
||||
/* Parity buffer. Size = number of roots */
|
||||
uint16_t par[6];
|
||||
/* Initialize the parity buffer */
|
||||
memset(par, 0, sizeof(par));
|
||||
/* Encode 512 byte in data8. Store parity in buffer par */
|
||||
encode_rs8 (rs_decoder, data8, 512, par, 0);
|
||||
</programlisting>
|
||||
</sect1>
|
||||
<sect1>
|
||||
<title>Decoding</title>
|
||||
<para>
|
||||
The decoder calculates the syndrome over
|
||||
the given data length and the received parity symbols
|
||||
and corrects errors in the data.
|
||||
</para>
|
||||
<para>
|
||||
If a syndrome is available from a hardware decoder
|
||||
then the syndrome calculation is skipped.
|
||||
</para>
|
||||
<para>
|
||||
The correction of the data buffer can be suppressed
|
||||
by providing a correction pattern buffer and an error
|
||||
location buffer to the decoder. The decoder stores the
|
||||
calculated error location and the correction bitmask
|
||||
in the given buffers. This is useful for hardware
|
||||
decoders which use a weird bit ordering scheme.
|
||||
</para>
|
||||
<para>
|
||||
The databytes are expanded to the given symbol size
|
||||
on the fly. There is no support for decoding continuous
|
||||
bitstreams with a symbolsize != 8 at the moment. If
|
||||
it is necessary it should be not a big deal to implement
|
||||
such functionality.
|
||||
</para>
|
||||
|
||||
<sect2>
|
||||
<title>
|
||||
Decoding with syndrome calculation, direct data correction
|
||||
</title>
|
||||
<programlisting>
|
||||
/* Parity buffer. Size = number of roots */
|
||||
uint16_t par[6];
|
||||
uint8_t data[512];
|
||||
int numerr;
|
||||
/* Receive data */
|
||||
.....
|
||||
/* Receive parity */
|
||||
.....
|
||||
/* Decode 512 byte in data8.*/
|
||||
numerr = decode_rs8 (rs_decoder, data8, par, 512, NULL, 0, NULL, 0, NULL);
|
||||
</programlisting>
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>
|
||||
Decoding with syndrome given by hardware decoder, direct data correction
|
||||
</title>
|
||||
<programlisting>
|
||||
/* Parity buffer. Size = number of roots */
|
||||
uint16_t par[6], syn[6];
|
||||
uint8_t data[512];
|
||||
int numerr;
|
||||
/* Receive data */
|
||||
.....
|
||||
/* Receive parity */
|
||||
.....
|
||||
/* Get syndrome from hardware decoder */
|
||||
.....
|
||||
/* Decode 512 byte in data8.*/
|
||||
numerr = decode_rs8 (rs_decoder, data8, par, 512, syn, 0, NULL, 0, NULL);
|
||||
</programlisting>
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>
|
||||
Decoding with syndrome given by hardware decoder, no direct data correction.
|
||||
</title>
|
||||
<para>
|
||||
Note: It's not necessary to give data and received parity to the decoder.
|
||||
</para>
|
||||
<programlisting>
|
||||
/* Parity buffer. Size = number of roots */
|
||||
uint16_t par[6], syn[6], corr[8];
|
||||
uint8_t data[512];
|
||||
int numerr, errpos[8];
|
||||
/* Receive data */
|
||||
.....
|
||||
/* Receive parity */
|
||||
.....
|
||||
/* Get syndrome from hardware decoder */
|
||||
.....
|
||||
/* Decode 512 byte in data8.*/
|
||||
numerr = decode_rs8 (rs_decoder, NULL, NULL, 512, syn, 0, errpos, 0, corr);
|
||||
for (i = 0; i < numerr; i++) {
|
||||
do_error_correction_in_your_buffer(errpos[i], corr[i]);
|
||||
}
|
||||
</programlisting>
|
||||
</sect2>
|
||||
</sect1>
|
||||
<sect1>
|
||||
<title>Cleanup</title>
|
||||
<para>
|
||||
The function free_rs frees the allocated resources,
|
||||
if the caller is the last user of the decoder.
|
||||
</para>
|
||||
<programlisting>
|
||||
/* Release resources */
|
||||
free_rs(rs_decoder);
|
||||
</programlisting>
|
||||
</sect1>
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter id="structs">
|
||||
<title>Structures</title>
|
||||
<para>
|
||||
This chapter contains the autogenerated documentation of the structures which are
|
||||
used in the Reed-Solomon Library and are relevant for a developer.
|
||||
</para>
|
||||
!Iinclude/linux/rslib.h
|
||||
</chapter>
|
||||
|
||||
<chapter id="pubfunctions">
|
||||
<title>Public Functions Provided</title>
|
||||
<para>
|
||||
This chapter contains the autogenerated documentation of the Reed-Solomon functions
|
||||
which are exported.
|
||||
</para>
|
||||
!Elib/reed_solomon/reed_solomon.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="credits">
|
||||
<title>Credits</title>
|
||||
<para>
|
||||
The library code for encoding and decoding was written by Phil Karn.
|
||||
</para>
|
||||
<programlisting>
|
||||
Copyright 2002, Phil Karn, KA9Q
|
||||
May be used under the terms of the GNU General Public License (GPL)
|
||||
</programlisting>
|
||||
<para>
|
||||
The wrapper functions and interfaces are written by Thomas Gleixner
|
||||
</para>
|
||||
<para>
|
||||
Many users have provided bugfixes, improvements and helping hands for testing.
|
||||
Thanks a lot.
|
||||
</para>
|
||||
<para>
|
||||
The following people have contributed to this document:
|
||||
</para>
|
||||
<para>
|
||||
Thomas Gleixner<email>tglx@linutronix.de</email>
|
||||
</para>
|
||||
</chapter>
|
||||
</book>
|
||||
265
Documentation/DocBook/lsm.tmpl
Normal file
265
Documentation/DocBook/lsm.tmpl
Normal file
@ -0,0 +1,265 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<article class="whitepaper" id="LinuxSecurityModule" lang="en">
|
||||
<articleinfo>
|
||||
<title>Linux Security Modules: General Security Hooks for Linux</title>
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Stephen</firstname>
|
||||
<surname>Smalley</surname>
|
||||
<affiliation>
|
||||
<orgname>NAI Labs</orgname>
|
||||
<address><email>ssmalley@nai.com</email></address>
|
||||
</affiliation>
|
||||
</author>
|
||||
<author>
|
||||
<firstname>Timothy</firstname>
|
||||
<surname>Fraser</surname>
|
||||
<affiliation>
|
||||
<orgname>NAI Labs</orgname>
|
||||
<address><email>tfraser@nai.com</email></address>
|
||||
</affiliation>
|
||||
</author>
|
||||
<author>
|
||||
<firstname>Chris</firstname>
|
||||
<surname>Vance</surname>
|
||||
<affiliation>
|
||||
<orgname>NAI Labs</orgname>
|
||||
<address><email>cvance@nai.com</email></address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
</articleinfo>
|
||||
|
||||
<sect1><title>Introduction</title>
|
||||
|
||||
<para>
|
||||
In March 2001, the National Security Agency (NSA) gave a presentation
|
||||
about Security-Enhanced Linux (SELinux) at the 2.5 Linux Kernel
|
||||
Summit. SELinux is an implementation of flexible and fine-grained
|
||||
nondiscretionary access controls in the Linux kernel, originally
|
||||
implemented as its own particular kernel patch. Several other
|
||||
security projects (e.g. RSBAC, Medusa) have also developed flexible
|
||||
access control architectures for the Linux kernel, and various
|
||||
projects have developed particular access control models for Linux
|
||||
(e.g. LIDS, DTE, SubDomain). Each project has developed and
|
||||
maintained its own kernel patch to support its security needs.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
In response to the NSA presentation, Linus Torvalds made a set of
|
||||
remarks that described a security framework he would be willing to
|
||||
consider for inclusion in the mainstream Linux kernel. He described a
|
||||
general framework that would provide a set of security hooks to
|
||||
control operations on kernel objects and a set of opaque security
|
||||
fields in kernel data structures for maintaining security attributes.
|
||||
This framework could then be used by loadable kernel modules to
|
||||
implement any desired model of security. Linus also suggested the
|
||||
possibility of migrating the Linux capabilities code into such a
|
||||
module.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The Linux Security Modules (LSM) project was started by WireX to
|
||||
develop such a framework. LSM is a joint development effort by
|
||||
several security projects, including Immunix, SELinux, SGI and Janus,
|
||||
and several individuals, including Greg Kroah-Hartman and James
|
||||
Morris, to develop a Linux kernel patch that implements this
|
||||
framework. The patch is currently tracking the 2.4 series and is
|
||||
targeted for integration into the 2.5 development series. This
|
||||
technical report provides an overview of the framework and the example
|
||||
capabilities security module provided by the LSM kernel patch.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="framework"><title>LSM Framework</title>
|
||||
|
||||
<para>
|
||||
The LSM kernel patch provides a general kernel framework to support
|
||||
security modules. In particular, the LSM framework is primarily
|
||||
focused on supporting access control modules, although future
|
||||
development is likely to address other security needs such as
|
||||
auditing. By itself, the framework does not provide any additional
|
||||
security; it merely provides the infrastructure to support security
|
||||
modules. The LSM kernel patch also moves most of the capabilities
|
||||
logic into an optional security module, with the system defaulting
|
||||
to the traditional superuser logic. This capabilities module
|
||||
is discussed further in <xref linkend="cap"/>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The LSM kernel patch adds security fields to kernel data structures
|
||||
and inserts calls to hook functions at critical points in the kernel
|
||||
code to manage the security fields and to perform access control. It
|
||||
also adds functions for registering and unregistering security
|
||||
modules, and adds a general <function>security</function> system call
|
||||
to support new system calls for security-aware applications.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The LSM security fields are simply <type>void*</type> pointers. For
|
||||
process and program execution security information, security fields
|
||||
were added to <structname>struct task_struct</structname> and
|
||||
<structname>struct linux_binprm</structname>. For filesystem security
|
||||
information, a security field was added to
|
||||
<structname>struct super_block</structname>. For pipe, file, and socket
|
||||
security information, security fields were added to
|
||||
<structname>struct inode</structname> and
|
||||
<structname>struct file</structname>. For packet and network device security
|
||||
information, security fields were added to
|
||||
<structname>struct sk_buff</structname> and
|
||||
<structname>struct net_device</structname>. For System V IPC security
|
||||
information, security fields were added to
|
||||
<structname>struct kern_ipc_perm</structname> and
|
||||
<structname>struct msg_msg</structname>; additionally, the definitions
|
||||
for <structname>struct msg_msg</structname>, <structname>struct
|
||||
msg_queue</structname>, and <structname>struct
|
||||
shmid_kernel</structname> were moved to header files
|
||||
(<filename>include/linux/msg.h</filename> and
|
||||
<filename>include/linux/shm.h</filename> as appropriate) to allow
|
||||
the security modules to use these definitions.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Each LSM hook is a function pointer in a global table,
|
||||
security_ops. This table is a
|
||||
<structname>security_operations</structname> structure as defined by
|
||||
<filename>include/linux/security.h</filename>. Detailed documentation
|
||||
for each hook is included in this header file. At present, this
|
||||
structure consists of a collection of substructures that group related
|
||||
hooks based on the kernel object (e.g. task, inode, file, sk_buff,
|
||||
etc) as well as some top-level hook function pointers for system
|
||||
operations. This structure is likely to be flattened in the future
|
||||
for performance. The placement of the hook calls in the kernel code
|
||||
is described by the "called:" lines in the per-hook documentation in
|
||||
the header file. The hook calls can also be easily found in the
|
||||
kernel code by looking for the string "security_ops->".
|
||||
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Linus mentioned per-process security hooks in his original remarks as a
|
||||
possible alternative to global security hooks. However, if LSM were
|
||||
to start from the perspective of per-process hooks, then the base
|
||||
framework would have to deal with how to handle operations that
|
||||
involve multiple processes (e.g. kill), since each process might have
|
||||
its own hook for controlling the operation. This would require a
|
||||
general mechanism for composing hooks in the base framework.
|
||||
Additionally, LSM would still need global hooks for operations that
|
||||
have no process context (e.g. network input operations).
|
||||
Consequently, LSM provides global security hooks, but a security
|
||||
module is free to implement per-process hooks (where that makes sense)
|
||||
by storing a security_ops table in each process' security field and
|
||||
then invoking these per-process hooks from the global hooks.
|
||||
The problem of composition is thus deferred to the module.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The global security_ops table is initialized to a set of hook
|
||||
functions provided by a dummy security module that provides
|
||||
traditional superuser logic. A <function>register_security</function>
|
||||
function (in <filename>security/security.c</filename>) is provided to
|
||||
allow a security module to set security_ops to refer to its own hook
|
||||
functions, and an <function>unregister_security</function> function is
|
||||
provided to revert security_ops to the dummy module hooks. This
|
||||
mechanism is used to set the primary security module, which is
|
||||
responsible for making the final decision for each hook.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
LSM also provides a simple mechanism for stacking additional security
|
||||
modules with the primary security module. It defines
|
||||
<function>register_security</function> and
|
||||
<function>unregister_security</function> hooks in the
|
||||
<structname>security_operations</structname> structure and provides
|
||||
<function>mod_reg_security</function> and
|
||||
<function>mod_unreg_security</function> functions that invoke these
|
||||
hooks after performing some sanity checking. A security module can
|
||||
call these functions in order to stack with other modules. However,
|
||||
the actual details of how this stacking is handled are deferred to the
|
||||
module, which can implement these hooks in any way it wishes
|
||||
(including always returning an error if it does not wish to support
|
||||
stacking). In this manner, LSM again defers the problem of
|
||||
composition to the module.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Although the LSM hooks are organized into substructures based on
|
||||
kernel object, all of the hooks can be viewed as falling into two
|
||||
major categories: hooks that are used to manage the security fields
|
||||
and hooks that are used to perform access control. Examples of the
|
||||
first category of hooks include the
|
||||
<function>alloc_security</function> and
|
||||
<function>free_security</function> hooks defined for each kernel data
|
||||
structure that has a security field. These hooks are used to allocate
|
||||
and free security structures for kernel objects. The first category
|
||||
of hooks also includes hooks that set information in the security
|
||||
field after allocation, such as the <function>post_lookup</function>
|
||||
hook in <structname>struct inode_security_ops</structname>. This hook
|
||||
is used to set security information for inodes after successful lookup
|
||||
operations. An example of the second category of hooks is the
|
||||
<function>permission</function> hook in
|
||||
<structname>struct inode_security_ops</structname>. This hook checks
|
||||
permission when accessing an inode.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="cap"><title>LSM Capabilities Module</title>
|
||||
|
||||
<para>
|
||||
The LSM kernel patch moves most of the existing POSIX.1e capabilities
|
||||
logic into an optional security module stored in the file
|
||||
<filename>security/capability.c</filename>. This change allows
|
||||
users who do not want to use capabilities to omit this code entirely
|
||||
from their kernel, instead using the dummy module for traditional
|
||||
superuser logic or any other module that they desire. This change
|
||||
also allows the developers of the capabilities logic to maintain and
|
||||
enhance their code more freely, without needing to integrate patches
|
||||
back into the base kernel.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
In addition to moving the capabilities logic, the LSM kernel patch
|
||||
could move the capability-related fields from the kernel data
|
||||
structures into the new security fields managed by the security
|
||||
modules. However, at present, the LSM kernel patch leaves the
|
||||
capability fields in the kernel data structures. In his original
|
||||
remarks, Linus suggested that this might be preferable so that other
|
||||
security modules can be easily stacked with the capabilities module
|
||||
without needing to chain multiple security structures on the security field.
|
||||
It also avoids imposing extra overhead on the capabilities module
|
||||
to manage the security fields. However, the LSM framework could
|
||||
certainly support such a move if it is determined to be desirable,
|
||||
with only a few additional changes described below.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
At present, the capabilities logic for computing process capabilities
|
||||
on <function>execve</function> and <function>set*uid</function>,
|
||||
checking capabilities for a particular process, saving and checking
|
||||
capabilities for netlink messages, and handling the
|
||||
<function>capget</function> and <function>capset</function> system
|
||||
calls have been moved into the capabilities module. There are still a
|
||||
few locations in the base kernel where capability-related fields are
|
||||
directly examined or modified, but the current version of the LSM
|
||||
patch does allow a security module to completely replace the
|
||||
assignment and testing of capabilities. These few locations would
|
||||
need to be changed if the capability-related fields were moved into
|
||||
the security field. The following is a list of known locations that
|
||||
still perform such direct examination or modification of
|
||||
capability-related fields:
|
||||
<itemizedlist>
|
||||
<listitem><para><filename>fs/open.c</filename>:<function>sys_access</function></para></listitem>
|
||||
<listitem><para><filename>fs/lockd/host.c</filename>:<function>nlm_bind_host</function></para></listitem>
|
||||
<listitem><para><filename>fs/nfsd/auth.c</filename>:<function>nfsd_setuser</function></para></listitem>
|
||||
<listitem><para><filename>fs/proc/array.c</filename>:<function>task_cap</function></para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
</article>
|
||||
3
Documentation/DocBook/man/Makefile
Normal file
3
Documentation/DocBook/man/Makefile
Normal file
@ -0,0 +1,3 @@
|
||||
# Rules are put in Documentation/DocBook
|
||||
|
||||
clean-files := *.9.gz *.sgml manpage.links manpage.refs
|
||||
107
Documentation/DocBook/mcabook.tmpl
Normal file
107
Documentation/DocBook/mcabook.tmpl
Normal file
@ -0,0 +1,107 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="MCAGuide">
|
||||
<bookinfo>
|
||||
<title>MCA Driver Programming Interface</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Alan</firstname>
|
||||
<surname>Cox</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>alan@redhat.com</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
<author>
|
||||
<firstname>David</firstname>
|
||||
<surname>Weinehall</surname>
|
||||
</author>
|
||||
<author>
|
||||
<firstname>Chris</firstname>
|
||||
<surname>Beauregard</surname>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2000</year>
|
||||
<holder>Alan Cox</holder>
|
||||
<holder>David Weinehall</holder>
|
||||
<holder>Chris Beauregard</holder>
|
||||
</copyright>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License as published by the Free Software Foundation; either
|
||||
version 2 of the License, or (at your option) any later
|
||||
version.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter id="intro">
|
||||
<title>Introduction</title>
|
||||
<para>
|
||||
The MCA bus functions provide a generalised interface to find MCA
|
||||
bus cards, to claim them for a driver, and to read and manipulate POS
|
||||
registers without being aware of the motherboard internals or
|
||||
certain deep magic specific to onboard devices.
|
||||
</para>
|
||||
<para>
|
||||
The basic interface to the MCA bus devices is the slot. Each slot
|
||||
is numbered and virtual slot numbers are assigned to the internal
|
||||
devices. Using a pci_dev as other busses do does not really make
|
||||
sense in the MCA context as the MCA bus resources require card
|
||||
specific interpretation.
|
||||
</para>
|
||||
<para>
|
||||
Finally the MCA bus functions provide a parallel set of DMA
|
||||
functions mimicing the ISA bus DMA functions as closely as possible,
|
||||
although also supporting the additional DMA functionality on the
|
||||
MCA bus controllers.
|
||||
</para>
|
||||
</chapter>
|
||||
<chapter id="bugs">
|
||||
<title>Known Bugs And Assumptions</title>
|
||||
<para>
|
||||
None.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="pubfunctions">
|
||||
<title>Public Functions Provided</title>
|
||||
!Edrivers/mca/mca-legacy.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="dmafunctions">
|
||||
<title>DMA Functions Provided</title>
|
||||
!Iinclude/asm-i386/mca_dma.h
|
||||
</chapter>
|
||||
|
||||
</book>
|
||||
1321
Documentation/DocBook/mtdnand.tmpl
Normal file
1321
Documentation/DocBook/mtdnand.tmpl
Normal file
File diff suppressed because it is too large
Load Diff
591
Documentation/DocBook/procfs-guide.tmpl
Normal file
591
Documentation/DocBook/procfs-guide.tmpl
Normal file
@ -0,0 +1,591 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [
|
||||
<!ENTITY procfsexample SYSTEM "procfs_example.xml">
|
||||
]>
|
||||
|
||||
<book id="LKProcfsGuide">
|
||||
<bookinfo>
|
||||
<title>Linux Kernel Procfs Guide</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Erik</firstname>
|
||||
<othername>(J.A.K.)</othername>
|
||||
<surname>Mouw</surname>
|
||||
<affiliation>
|
||||
<orgname>Delft University of Technology</orgname>
|
||||
<orgdiv>Faculty of Information Technology and Systems</orgdiv>
|
||||
<address>
|
||||
<email>J.A.K.Mouw@its.tudelft.nl</email>
|
||||
<pob>PO BOX 5031</pob>
|
||||
<postcode>2600 GA</postcode>
|
||||
<city>Delft</city>
|
||||
<country>The Netherlands</country>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<revhistory>
|
||||
<revision>
|
||||
<revnumber>1.0 </revnumber>
|
||||
<date>May 30, 2001</date>
|
||||
<revremark>Initial revision posted to linux-kernel</revremark>
|
||||
</revision>
|
||||
<revision>
|
||||
<revnumber>1.1 </revnumber>
|
||||
<date>June 3, 2001</date>
|
||||
<revremark>Revised after comments from linux-kernel</revremark>
|
||||
</revision>
|
||||
</revhistory>
|
||||
|
||||
<copyright>
|
||||
<year>2001</year>
|
||||
<holder>Erik Mouw</holder>
|
||||
</copyright>
|
||||
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute it
|
||||
and/or modify it under the terms of the GNU General Public
|
||||
License as published by the Free Software Foundation; either
|
||||
version 2 of the License, or (at your option) any later
|
||||
version.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This documentation is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
|
||||
PURPOSE. See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
|
||||
|
||||
|
||||
<toc>
|
||||
</toc>
|
||||
|
||||
|
||||
|
||||
|
||||
<preface>
|
||||
<title>Preface</title>
|
||||
|
||||
<para>
|
||||
This guide describes the use of the procfs file system from
|
||||
within the Linux kernel. The idea to write this guide came up on
|
||||
the #kernelnewbies IRC channel (see <ulink
|
||||
url="http://www.kernelnewbies.org/">http://www.kernelnewbies.org/</ulink>),
|
||||
when Jeff Garzik explained the use of procfs and forwarded me a
|
||||
message Alexander Viro wrote to the linux-kernel mailing list. I
|
||||
agreed to write it up nicely, so here it is.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
I'd like to thank Jeff Garzik
|
||||
<email>jgarzik@pobox.com</email> and Alexander Viro
|
||||
<email>viro@parcelfarce.linux.theplanet.co.uk</email> for their input,
|
||||
Tim Waugh <email>twaugh@redhat.com</email> for his <ulink
|
||||
url="http://people.redhat.com/twaugh/docbook/selfdocbook/">Selfdocbook</ulink>,
|
||||
and Marc Joosen <email>marcj@historia.et.tudelft.nl</email> for
|
||||
proofreading.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This documentation was written while working on the LART
|
||||
computing board (<ulink
|
||||
url="http://www.lart.tudelft.nl/">http://www.lart.tudelft.nl/</ulink>),
|
||||
which is sponsored by the Mobile Multi-media Communications
|
||||
(<ulink
|
||||
url="http://www.mmc.tudelft.nl/">http://www.mmc.tudelft.nl/</ulink>)
|
||||
and Ubiquitous Communications (<ulink
|
||||
url="http://www.ubicom.tudelft.nl/">http://www.ubicom.tudelft.nl/</ulink>)
|
||||
projects.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Erik
|
||||
</para>
|
||||
</preface>
|
||||
|
||||
|
||||
|
||||
|
||||
<chapter id="intro">
|
||||
<title>Introduction</title>
|
||||
|
||||
<para>
|
||||
The <filename class="directory">/proc</filename> file system
|
||||
(procfs) is a special file system in the linux kernel. It's a
|
||||
virtual file system: it is not associated with a block device
|
||||
but exists only in memory. The files in the procfs are there to
|
||||
allow userland programs access to certain information from the
|
||||
kernel (like process information in <filename
|
||||
class="directory">/proc/[0-9]+/</filename>), but also for debug
|
||||
purposes (like <filename>/proc/ksyms</filename>).
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This guide describes the use of the procfs file system from
|
||||
within the Linux kernel. It starts by introducing all relevant
|
||||
functions to manage the files within the file system. After that
|
||||
it shows how to communicate with userland, and some tips and
|
||||
tricks will be pointed out. Finally a complete example will be
|
||||
shown.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Note that the files in <filename
|
||||
class="directory">/proc/sys</filename> are sysctl files: they
|
||||
don't belong to procfs and are governed by a completely
|
||||
different API described in the Kernel API book.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
|
||||
|
||||
|
||||
<chapter id="managing">
|
||||
<title>Managing procfs entries</title>
|
||||
|
||||
<para>
|
||||
This chapter describes the functions that various kernel
|
||||
components use to populate the procfs with files, symlinks,
|
||||
device nodes, and directories.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
A minor note before we start: if you want to use any of the
|
||||
procfs functions, be sure to include the correct header file!
|
||||
This should be one of the first lines in your code:
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
#include <linux/proc_fs.h>
|
||||
</programlisting>
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1 id="regularfile">
|
||||
<title>Creating a regular file</title>
|
||||
|
||||
<funcsynopsis>
|
||||
<funcprototype>
|
||||
<funcdef>struct proc_dir_entry* <function>create_proc_entry</function></funcdef>
|
||||
<paramdef>const char* <parameter>name</parameter></paramdef>
|
||||
<paramdef>mode_t <parameter>mode</parameter></paramdef>
|
||||
<paramdef>struct proc_dir_entry* <parameter>parent</parameter></paramdef>
|
||||
</funcprototype>
|
||||
</funcsynopsis>
|
||||
|
||||
<para>
|
||||
This function creates a regular file with the name
|
||||
<parameter>name</parameter>, file mode
|
||||
<parameter>mode</parameter> in the directory
|
||||
<parameter>parent</parameter>. To create a file in the root of
|
||||
the procfs, use <constant>NULL</constant> as
|
||||
<parameter>parent</parameter> parameter. When successful, the
|
||||
function will return a pointer to the freshly created
|
||||
<structname>struct proc_dir_entry</structname>; otherwise it
|
||||
will return <constant>NULL</constant>. <xref
|
||||
linkend="userland"/> describes how to do something useful with
|
||||
regular files.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Note that it is specifically supported that you can pass a
|
||||
path that spans multiple directories. For example
|
||||
<function>create_proc_entry</function>(<parameter>"drivers/via0/info"</parameter>)
|
||||
will create the <filename class="directory">via0</filename>
|
||||
directory if necessary, with standard
|
||||
<constant>0755</constant> permissions.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
If you only want to be able to read the file, the function
|
||||
<function>create_proc_read_entry</function> described in <xref
|
||||
linkend="convenience"/> may be used to create and initialise
|
||||
the procfs entry in one single call.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1>
|
||||
<title>Creating a symlink</title>
|
||||
|
||||
<funcsynopsis>
|
||||
<funcprototype>
|
||||
<funcdef>struct proc_dir_entry*
|
||||
<function>proc_symlink</function></funcdef> <paramdef>const
|
||||
char* <parameter>name</parameter></paramdef>
|
||||
<paramdef>struct proc_dir_entry*
|
||||
<parameter>parent</parameter></paramdef> <paramdef>const
|
||||
char* <parameter>dest</parameter></paramdef>
|
||||
</funcprototype>
|
||||
</funcsynopsis>
|
||||
|
||||
<para>
|
||||
This creates a symlink in the procfs directory
|
||||
<parameter>parent</parameter> that points from
|
||||
<parameter>name</parameter> to
|
||||
<parameter>dest</parameter>. This translates in userland to
|
||||
<literal>ln -s</literal> <parameter>dest</parameter>
|
||||
<parameter>name</parameter>.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>Creating a directory</title>
|
||||
|
||||
<funcsynopsis>
|
||||
<funcprototype>
|
||||
<funcdef>struct proc_dir_entry* <function>proc_mkdir</function></funcdef>
|
||||
<paramdef>const char* <parameter>name</parameter></paramdef>
|
||||
<paramdef>struct proc_dir_entry* <parameter>parent</parameter></paramdef>
|
||||
</funcprototype>
|
||||
</funcsynopsis>
|
||||
|
||||
<para>
|
||||
Create a directory <parameter>name</parameter> in the procfs
|
||||
directory <parameter>parent</parameter>.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1>
|
||||
<title>Removing an entry</title>
|
||||
|
||||
<funcsynopsis>
|
||||
<funcprototype>
|
||||
<funcdef>void <function>remove_proc_entry</function></funcdef>
|
||||
<paramdef>const char* <parameter>name</parameter></paramdef>
|
||||
<paramdef>struct proc_dir_entry* <parameter>parent</parameter></paramdef>
|
||||
</funcprototype>
|
||||
</funcsynopsis>
|
||||
|
||||
<para>
|
||||
Removes the entry <parameter>name</parameter> in the directory
|
||||
<parameter>parent</parameter> from the procfs. Entries are
|
||||
removed by their <emphasis>name</emphasis>, not by the
|
||||
<structname>struct proc_dir_entry</structname> returned by the
|
||||
various create functions. Note that this function doesn't
|
||||
recursively remove entries.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Be sure to free the <structfield>data</structfield> entry from
|
||||
the <structname>struct proc_dir_entry</structname> before
|
||||
<function>remove_proc_entry</function> is called (that is: if
|
||||
there was some <structfield>data</structfield> allocated, of
|
||||
course). See <xref linkend="usingdata"/> for more information
|
||||
on using the <structfield>data</structfield> entry.
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
|
||||
|
||||
|
||||
<chapter id="userland">
|
||||
<title>Communicating with userland</title>
|
||||
|
||||
<para>
|
||||
Instead of reading (or writing) information directly from
|
||||
kernel memory, procfs works with <emphasis>call back
|
||||
functions</emphasis> for files: functions that are called when
|
||||
a specific file is being read or written. Such functions have
|
||||
to be initialised after the procfs file is created by setting
|
||||
the <structfield>read_proc</structfield> and/or
|
||||
<structfield>write_proc</structfield> fields in the
|
||||
<structname>struct proc_dir_entry*</structname> that the
|
||||
function <function>create_proc_entry</function> returned:
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
struct proc_dir_entry* entry;
|
||||
|
||||
entry->read_proc = read_proc_foo;
|
||||
entry->write_proc = write_proc_foo;
|
||||
</programlisting>
|
||||
|
||||
<para>
|
||||
If you only want to use a the
|
||||
<structfield>read_proc</structfield>, the function
|
||||
<function>create_proc_read_entry</function> described in <xref
|
||||
linkend="convenience"/> may be used to create and initialise the
|
||||
procfs entry in one single call.
|
||||
</para>
|
||||
|
||||
|
||||
|
||||
<sect1>
|
||||
<title>Reading data</title>
|
||||
|
||||
<para>
|
||||
The read function is a call back function that allows userland
|
||||
processes to read data from the kernel. The read function
|
||||
should have the following format:
|
||||
</para>
|
||||
|
||||
<funcsynopsis>
|
||||
<funcprototype>
|
||||
<funcdef>int <function>read_func</function></funcdef>
|
||||
<paramdef>char* <parameter>page</parameter></paramdef>
|
||||
<paramdef>char** <parameter>start</parameter></paramdef>
|
||||
<paramdef>off_t <parameter>off</parameter></paramdef>
|
||||
<paramdef>int <parameter>count</parameter></paramdef>
|
||||
<paramdef>int* <parameter>eof</parameter></paramdef>
|
||||
<paramdef>void* <parameter>data</parameter></paramdef>
|
||||
</funcprototype>
|
||||
</funcsynopsis>
|
||||
|
||||
<para>
|
||||
The read function should write its information into the
|
||||
<parameter>page</parameter>. For proper use, the function
|
||||
should start writing at an offset of
|
||||
<parameter>off</parameter> in <parameter>page</parameter> and
|
||||
write at most <parameter>count</parameter> bytes, but because
|
||||
most read functions are quite simple and only return a small
|
||||
amount of information, these two parameters are usually
|
||||
ignored (it breaks pagers like <literal>more</literal> and
|
||||
<literal>less</literal>, but <literal>cat</literal> still
|
||||
works).
|
||||
</para>
|
||||
|
||||
<para>
|
||||
If the <parameter>off</parameter> and
|
||||
<parameter>count</parameter> parameters are properly used,
|
||||
<parameter>eof</parameter> should be used to signal that the
|
||||
end of the file has been reached by writing
|
||||
<literal>1</literal> to the memory location
|
||||
<parameter>eof</parameter> points to.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The parameter <parameter>start</parameter> doesn't seem to be
|
||||
used anywhere in the kernel. The <parameter>data</parameter>
|
||||
parameter can be used to create a single call back function for
|
||||
several files, see <xref linkend="usingdata"/>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The <function>read_func</function> function must return the
|
||||
number of bytes written into the <parameter>page</parameter>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
<xref linkend="example"/> shows how to use a read call back
|
||||
function.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1>
|
||||
<title>Writing data</title>
|
||||
|
||||
<para>
|
||||
The write call back function allows a userland process to write
|
||||
data to the kernel, so it has some kind of control over the
|
||||
kernel. The write function should have the following format:
|
||||
</para>
|
||||
|
||||
<funcsynopsis>
|
||||
<funcprototype>
|
||||
<funcdef>int <function>write_func</function></funcdef>
|
||||
<paramdef>struct file* <parameter>file</parameter></paramdef>
|
||||
<paramdef>const char* <parameter>buffer</parameter></paramdef>
|
||||
<paramdef>unsigned long <parameter>count</parameter></paramdef>
|
||||
<paramdef>void* <parameter>data</parameter></paramdef>
|
||||
</funcprototype>
|
||||
</funcsynopsis>
|
||||
|
||||
<para>
|
||||
The write function should read <parameter>count</parameter>
|
||||
bytes at maximum from the <parameter>buffer</parameter>. Note
|
||||
that the <parameter>buffer</parameter> doesn't live in the
|
||||
kernel's memory space, so it should first be copied to kernel
|
||||
space with <function>copy_from_user</function>. The
|
||||
<parameter>file</parameter> parameter is usually
|
||||
ignored. <xref linkend="usingdata"/> shows how to use the
|
||||
<parameter>data</parameter> parameter.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Again, <xref linkend="example"/> shows how to use this call back
|
||||
function.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1 id="usingdata">
|
||||
<title>A single call back for many files</title>
|
||||
|
||||
<para>
|
||||
When a large number of almost identical files is used, it's
|
||||
quite inconvenient to use a separate call back function for
|
||||
each file. A better approach is to have a single call back
|
||||
function that distinguishes between the files by using the
|
||||
<structfield>data</structfield> field in <structname>struct
|
||||
proc_dir_entry</structname>. First of all, the
|
||||
<structfield>data</structfield> field has to be initialised:
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
struct proc_dir_entry* entry;
|
||||
struct my_file_data *file_data;
|
||||
|
||||
file_data = kmalloc(sizeof(struct my_file_data), GFP_KERNEL);
|
||||
entry->data = file_data;
|
||||
</programlisting>
|
||||
|
||||
<para>
|
||||
The <structfield>data</structfield> field is a <type>void
|
||||
*</type>, so it can be initialised with anything.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Now that the <structfield>data</structfield> field is set, the
|
||||
<function>read_proc</function> and
|
||||
<function>write_proc</function> can use it to distinguish
|
||||
between files because they get it passed into their
|
||||
<parameter>data</parameter> parameter:
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
int foo_read_func(char *page, char **start, off_t off,
|
||||
int count, int *eof, void *data)
|
||||
{
|
||||
int len;
|
||||
|
||||
if(data == file_data) {
|
||||
/* special case for this file */
|
||||
} else {
|
||||
/* normal processing */
|
||||
}
|
||||
|
||||
return len;
|
||||
}
|
||||
</programlisting>
|
||||
|
||||
<para>
|
||||
Be sure to free the <structfield>data</structfield> data field
|
||||
when removing the procfs entry.
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
|
||||
|
||||
|
||||
<chapter id="tips">
|
||||
<title>Tips and tricks</title>
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1 id="convenience">
|
||||
<title>Convenience functions</title>
|
||||
|
||||
<funcsynopsis>
|
||||
<funcprototype>
|
||||
<funcdef>struct proc_dir_entry* <function>create_proc_read_entry</function></funcdef>
|
||||
<paramdef>const char* <parameter>name</parameter></paramdef>
|
||||
<paramdef>mode_t <parameter>mode</parameter></paramdef>
|
||||
<paramdef>struct proc_dir_entry* <parameter>parent</parameter></paramdef>
|
||||
<paramdef>read_proc_t* <parameter>read_proc</parameter></paramdef>
|
||||
<paramdef>void* <parameter>data</parameter></paramdef>
|
||||
</funcprototype>
|
||||
</funcsynopsis>
|
||||
|
||||
<para>
|
||||
This function creates a regular file in exactly the same way
|
||||
as <function>create_proc_entry</function> from <xref
|
||||
linkend="regularfile"/> does, but also allows to set the read
|
||||
function <parameter>read_proc</parameter> in one call. This
|
||||
function can set the <parameter>data</parameter> as well, like
|
||||
explained in <xref linkend="usingdata"/>.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
<sect1>
|
||||
<title>Modules</title>
|
||||
|
||||
<para>
|
||||
If procfs is being used from within a module, be sure to set
|
||||
the <structfield>owner</structfield> field in the
|
||||
<structname>struct proc_dir_entry</structname> to
|
||||
<constant>THIS_MODULE</constant>.
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
struct proc_dir_entry* entry;
|
||||
|
||||
entry->owner = THIS_MODULE;
|
||||
</programlisting>
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1>
|
||||
<title>Mode and ownership</title>
|
||||
|
||||
<para>
|
||||
Sometimes it is useful to change the mode and/or ownership of
|
||||
a procfs entry. Here is an example that shows how to achieve
|
||||
that:
|
||||
</para>
|
||||
|
||||
<programlisting>
|
||||
struct proc_dir_entry* entry;
|
||||
|
||||
entry->mode = S_IWUSR |S_IRUSR | S_IRGRP | S_IROTH;
|
||||
entry->uid = 0;
|
||||
entry->gid = 100;
|
||||
</programlisting>
|
||||
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
|
||||
|
||||
|
||||
<chapter id="example">
|
||||
<title>Example</title>
|
||||
|
||||
<!-- be careful with the example code: it shouldn't be wider than
|
||||
approx. 60 columns, or otherwise it won't fit properly on a page
|
||||
-->
|
||||
|
||||
&procfsexample;
|
||||
|
||||
</chapter>
|
||||
</book>
|
||||
224
Documentation/DocBook/procfs_example.c
Normal file
224
Documentation/DocBook/procfs_example.c
Normal file
@ -0,0 +1,224 @@
|
||||
/*
|
||||
* procfs_example.c: an example proc interface
|
||||
*
|
||||
* Copyright (C) 2001, Erik Mouw (J.A.K.Mouw@its.tudelft.nl)
|
||||
*
|
||||
* This file accompanies the procfs-guide in the Linux kernel
|
||||
* source. Its main use is to demonstrate the concepts and
|
||||
* functions described in the guide.
|
||||
*
|
||||
* This software has been developed while working on the LART
|
||||
* computing board (http://www.lart.tudelft.nl/), which is
|
||||
* sponsored by the Mobile Multi-media Communications
|
||||
* (http://www.mmc.tudelft.nl/) and Ubiquitous Communications
|
||||
* (http://www.ubicom.tudelft.nl/) projects.
|
||||
*
|
||||
* The author can be reached at:
|
||||
*
|
||||
* Erik Mouw
|
||||
* Information and Communication Theory Group
|
||||
* Faculty of Information Technology and Systems
|
||||
* Delft University of Technology
|
||||
* P.O. Box 5031
|
||||
* 2600 GA Delft
|
||||
* The Netherlands
|
||||
*
|
||||
*
|
||||
* This program is free software; you can redistribute
|
||||
* it and/or modify it under the terms of the GNU General
|
||||
* Public License as published by the Free Software
|
||||
* Foundation; either version 2 of the License, or (at your
|
||||
* option) any later version.
|
||||
*
|
||||
* This program is distributed in the hope that it will be
|
||||
* useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
* warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
|
||||
* PURPOSE. See the GNU General Public License for more
|
||||
* details.
|
||||
*
|
||||
* You should have received a copy of the GNU General Public
|
||||
* License along with this program; if not, write to the
|
||||
* Free Software Foundation, Inc., 59 Temple Place,
|
||||
* Suite 330, Boston, MA 02111-1307 USA
|
||||
*
|
||||
*/
|
||||
|
||||
#include <linux/module.h>
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/proc_fs.h>
|
||||
#include <linux/jiffies.h>
|
||||
#include <asm/uaccess.h>
|
||||
|
||||
|
||||
#define MODULE_VERS "1.0"
|
||||
#define MODULE_NAME "procfs_example"
|
||||
|
||||
#define FOOBAR_LEN 8
|
||||
|
||||
struct fb_data_t {
|
||||
char name[FOOBAR_LEN + 1];
|
||||
char value[FOOBAR_LEN + 1];
|
||||
};
|
||||
|
||||
|
||||
static struct proc_dir_entry *example_dir, *foo_file,
|
||||
*bar_file, *jiffies_file, *symlink;
|
||||
|
||||
|
||||
struct fb_data_t foo_data, bar_data;
|
||||
|
||||
|
||||
static int proc_read_jiffies(char *page, char **start,
|
||||
off_t off, int count,
|
||||
int *eof, void *data)
|
||||
{
|
||||
int len;
|
||||
|
||||
len = sprintf(page, "jiffies = %ld\n",
|
||||
jiffies);
|
||||
|
||||
return len;
|
||||
}
|
||||
|
||||
|
||||
static int proc_read_foobar(char *page, char **start,
|
||||
off_t off, int count,
|
||||
int *eof, void *data)
|
||||
{
|
||||
int len;
|
||||
struct fb_data_t *fb_data = (struct fb_data_t *)data;
|
||||
|
||||
/* DON'T DO THAT - buffer overruns are bad */
|
||||
len = sprintf(page, "%s = '%s'\n",
|
||||
fb_data->name, fb_data->value);
|
||||
|
||||
return len;
|
||||
}
|
||||
|
||||
|
||||
static int proc_write_foobar(struct file *file,
|
||||
const char *buffer,
|
||||
unsigned long count,
|
||||
void *data)
|
||||
{
|
||||
int len;
|
||||
struct fb_data_t *fb_data = (struct fb_data_t *)data;
|
||||
|
||||
if(count > FOOBAR_LEN)
|
||||
len = FOOBAR_LEN;
|
||||
else
|
||||
len = count;
|
||||
|
||||
if(copy_from_user(fb_data->value, buffer, len))
|
||||
return -EFAULT;
|
||||
|
||||
fb_data->value[len] = '\0';
|
||||
|
||||
return len;
|
||||
}
|
||||
|
||||
|
||||
static int __init init_procfs_example(void)
|
||||
{
|
||||
int rv = 0;
|
||||
|
||||
/* create directory */
|
||||
example_dir = proc_mkdir(MODULE_NAME, NULL);
|
||||
if(example_dir == NULL) {
|
||||
rv = -ENOMEM;
|
||||
goto out;
|
||||
}
|
||||
|
||||
example_dir->owner = THIS_MODULE;
|
||||
|
||||
/* create jiffies using convenience function */
|
||||
jiffies_file = create_proc_read_entry("jiffies",
|
||||
0444, example_dir,
|
||||
proc_read_jiffies,
|
||||
NULL);
|
||||
if(jiffies_file == NULL) {
|
||||
rv = -ENOMEM;
|
||||
goto no_jiffies;
|
||||
}
|
||||
|
||||
jiffies_file->owner = THIS_MODULE;
|
||||
|
||||
/* create foo and bar files using same callback
|
||||
* functions
|
||||
*/
|
||||
foo_file = create_proc_entry("foo", 0644, example_dir);
|
||||
if(foo_file == NULL) {
|
||||
rv = -ENOMEM;
|
||||
goto no_foo;
|
||||
}
|
||||
|
||||
strcpy(foo_data.name, "foo");
|
||||
strcpy(foo_data.value, "foo");
|
||||
foo_file->data = &foo_data;
|
||||
foo_file->read_proc = proc_read_foobar;
|
||||
foo_file->write_proc = proc_write_foobar;
|
||||
foo_file->owner = THIS_MODULE;
|
||||
|
||||
bar_file = create_proc_entry("bar", 0644, example_dir);
|
||||
if(bar_file == NULL) {
|
||||
rv = -ENOMEM;
|
||||
goto no_bar;
|
||||
}
|
||||
|
||||
strcpy(bar_data.name, "bar");
|
||||
strcpy(bar_data.value, "bar");
|
||||
bar_file->data = &bar_data;
|
||||
bar_file->read_proc = proc_read_foobar;
|
||||
bar_file->write_proc = proc_write_foobar;
|
||||
bar_file->owner = THIS_MODULE;
|
||||
|
||||
/* create symlink */
|
||||
symlink = proc_symlink("jiffies_too", example_dir,
|
||||
"jiffies");
|
||||
if(symlink == NULL) {
|
||||
rv = -ENOMEM;
|
||||
goto no_symlink;
|
||||
}
|
||||
|
||||
symlink->owner = THIS_MODULE;
|
||||
|
||||
/* everything OK */
|
||||
printk(KERN_INFO "%s %s initialised\n",
|
||||
MODULE_NAME, MODULE_VERS);
|
||||
return 0;
|
||||
|
||||
no_symlink:
|
||||
remove_proc_entry("tty", example_dir);
|
||||
no_tty:
|
||||
remove_proc_entry("bar", example_dir);
|
||||
no_bar:
|
||||
remove_proc_entry("foo", example_dir);
|
||||
no_foo:
|
||||
remove_proc_entry("jiffies", example_dir);
|
||||
no_jiffies:
|
||||
remove_proc_entry(MODULE_NAME, NULL);
|
||||
out:
|
||||
return rv;
|
||||
}
|
||||
|
||||
|
||||
static void __exit cleanup_procfs_example(void)
|
||||
{
|
||||
remove_proc_entry("jiffies_too", example_dir);
|
||||
remove_proc_entry("tty", example_dir);
|
||||
remove_proc_entry("bar", example_dir);
|
||||
remove_proc_entry("foo", example_dir);
|
||||
remove_proc_entry("jiffies", example_dir);
|
||||
remove_proc_entry(MODULE_NAME, NULL);
|
||||
|
||||
printk(KERN_INFO "%s %s removed\n",
|
||||
MODULE_NAME, MODULE_VERS);
|
||||
}
|
||||
|
||||
|
||||
module_init(init_procfs_example);
|
||||
module_exit(cleanup_procfs_example);
|
||||
|
||||
MODULE_AUTHOR("Erik Mouw");
|
||||
MODULE_DESCRIPTION("procfs examples");
|
||||
160
Documentation/DocBook/rapidio.tmpl
Normal file
160
Documentation/DocBook/rapidio.tmpl
Normal file
@ -0,0 +1,160 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [
|
||||
<!ENTITY rapidio SYSTEM "rapidio.xml">
|
||||
]>
|
||||
|
||||
<book id="RapidIO-Guide">
|
||||
<bookinfo>
|
||||
<title>RapidIO Subsystem Guide</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Matt</firstname>
|
||||
<surname>Porter</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>mporter@kernel.crashing.org</email>
|
||||
<email>mporter@mvista.com</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2005</year>
|
||||
<holder>MontaVista Software, Inc.</holder>
|
||||
</copyright>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License version 2 as published by the Free Software Foundation.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter id="intro">
|
||||
<title>Introduction</title>
|
||||
<para>
|
||||
RapidIO is a high speed switched fabric interconnect with
|
||||
features aimed at the embedded market. RapidIO provides
|
||||
support for memory-mapped I/O as well as message-based
|
||||
transactions over the switched fabric network. RapidIO has
|
||||
a standardized discovery mechanism not unlike the PCI bus
|
||||
standard that allows simple detection of devices in a
|
||||
network.
|
||||
</para>
|
||||
<para>
|
||||
This documentation is provided for developers intending
|
||||
to support RapidIO on new architectures, write new drivers,
|
||||
or to understand the subsystem internals.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="bugs">
|
||||
<title>Known Bugs and Limitations</title>
|
||||
|
||||
<sect1>
|
||||
<title>Bugs</title>
|
||||
<para>None. ;)</para>
|
||||
</sect1>
|
||||
<sect1>
|
||||
<title>Limitations</title>
|
||||
<para>
|
||||
<orderedlist>
|
||||
<listitem><para>Access/management of RapidIO memory regions is not supported</para></listitem>
|
||||
<listitem><para>Multiple host enumeration is not supported</para></listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="drivers">
|
||||
<title>RapidIO driver interface</title>
|
||||
<para>
|
||||
Drivers are provided a set of calls in order
|
||||
to interface with the subsystem to gather info
|
||||
on devices, request/map memory region resources,
|
||||
and manage mailboxes/doorbells.
|
||||
</para>
|
||||
<sect1>
|
||||
<title>Functions</title>
|
||||
!Iinclude/linux/rio_drv.h
|
||||
!Edrivers/rapidio/rio-driver.c
|
||||
!Edrivers/rapidio/rio.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="internals">
|
||||
<title>Internals</title>
|
||||
|
||||
<para>
|
||||
This chapter contains the autogenerated documentation of the RapidIO
|
||||
subsystem.
|
||||
</para>
|
||||
|
||||
<sect1><title>Structures</title>
|
||||
!Iinclude/linux/rio.h
|
||||
</sect1>
|
||||
<sect1><title>Enumeration and Discovery</title>
|
||||
!Idrivers/rapidio/rio-scan.c
|
||||
</sect1>
|
||||
<sect1><title>Driver functionality</title>
|
||||
!Idrivers/rapidio/rio.c
|
||||
!Idrivers/rapidio/rio-access.c
|
||||
</sect1>
|
||||
<sect1><title>Device model support</title>
|
||||
!Idrivers/rapidio/rio-driver.c
|
||||
</sect1>
|
||||
<sect1><title>Sysfs support</title>
|
||||
!Idrivers/rapidio/rio-sysfs.c
|
||||
</sect1>
|
||||
<sect1><title>PPC32 support</title>
|
||||
!Iarch/ppc/kernel/rio.c
|
||||
!Earch/ppc/syslib/ppc85xx_rio.c
|
||||
!Iarch/ppc/syslib/ppc85xx_rio.c
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<chapter id="credits">
|
||||
<title>Credits</title>
|
||||
<para>
|
||||
The following people have contributed to the RapidIO
|
||||
subsystem directly or indirectly:
|
||||
<orderedlist>
|
||||
<listitem><para>Matt Porter<email>mporter@kernel.crashing.org</email></para></listitem>
|
||||
<listitem><para>Randy Vinson<email>rvinson@mvista.com</email></para></listitem>
|
||||
<listitem><para>Dan Malek<email>dan@embeddedalley.com</email></para></listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
<para>
|
||||
The following people have contributed to this document:
|
||||
<orderedlist>
|
||||
<listitem><para>Matt Porter<email>mporter@kernel.crashing.org</email></para></listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
</chapter>
|
||||
</book>
|
||||
8
Documentation/DocBook/stylesheet.xsl
Normal file
8
Documentation/DocBook/stylesheet.xsl
Normal file
@ -0,0 +1,8 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<stylesheet xmlns="http://www.w3.org/1999/XSL/Transform" version="1.0">
|
||||
<param name="chunk.quietly">1</param>
|
||||
<param name="funcsynopsis.style">ansi</param>
|
||||
<param name="funcsynopsis.tabular.threshold">80</param>
|
||||
<!-- <param name="paper.type">A4</param> -->
|
||||
<param name="generate.section.toc.level">2</param>
|
||||
</stylesheet>
|
||||
980
Documentation/DocBook/usb.tmpl
Normal file
980
Documentation/DocBook/usb.tmpl
Normal file
@ -0,0 +1,980 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="Linux-USB-API">
|
||||
<bookinfo>
|
||||
<title>The Linux-USB Host Side API</title>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License as published by the Free Software Foundation; either
|
||||
version 2 of the License, or (at your option) any later
|
||||
version.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter id="intro">
|
||||
<title>Introduction to USB on Linux</title>
|
||||
|
||||
<para>A Universal Serial Bus (USB) is used to connect a host,
|
||||
such as a PC or workstation, to a number of peripheral
|
||||
devices. USB uses a tree structure, with the host as the
|
||||
root (the system's master), hubs as interior nodes, and
|
||||
peripherals as leaves (and slaves).
|
||||
Modern PCs support several such trees of USB devices, usually
|
||||
one USB 2.0 tree (480 Mbit/sec each) with
|
||||
a few USB 1.1 trees (12 Mbit/sec each) that are used when you
|
||||
connect a USB 1.1 device directly to the machine's "root hub".
|
||||
</para>
|
||||
|
||||
<para>That master/slave asymmetry was designed-in for a number of
|
||||
reasons, one being ease of use. It is not physically possible to
|
||||
assemble (legal) USB cables incorrectly: all upstream "to the host"
|
||||
connectors are the rectangular type (matching the sockets on
|
||||
root hubs), and all downstream connectors are the squarish type
|
||||
(or they are built into the peripheral).
|
||||
Also, the host software doesn't need to deal with distributed
|
||||
auto-configuration since the pre-designated master node manages all that.
|
||||
And finally, at the electrical level, bus protocol overhead is reduced by
|
||||
eliminating arbitration and moving scheduling into the host software.
|
||||
</para>
|
||||
|
||||
<para>USB 1.0 was announced in January 1996 and was revised
|
||||
as USB 1.1 (with improvements in hub specification and
|
||||
support for interrupt-out transfers) in September 1998.
|
||||
USB 2.0 was released in April 2000, adding high-speed
|
||||
transfers and transaction-translating hubs (used for USB 1.1
|
||||
and 1.0 backward compatibility).
|
||||
</para>
|
||||
|
||||
<para>Kernel developers added USB support to Linux early in the 2.2 kernel
|
||||
series, shortly before 2.3 development forked. Updates from 2.3 were
|
||||
regularly folded back into 2.2 releases, which improved reliability and
|
||||
brought <filename>/sbin/hotplug</filename> support as well more drivers.
|
||||
Such improvements were continued in the 2.5 kernel series, where they added
|
||||
USB 2.0 support, improved performance, and made the host controller drivers
|
||||
(HCDs) more consistent. They also simplified the API (to make bugs less
|
||||
likely) and added internal "kerneldoc" documentation.
|
||||
</para>
|
||||
|
||||
<para>Linux can run inside USB devices as well as on
|
||||
the hosts that control the devices.
|
||||
But USB device drivers running inside those peripherals
|
||||
don't do the same things as the ones running inside hosts,
|
||||
so they've been given a different name:
|
||||
<emphasis>gadget drivers</emphasis>.
|
||||
This document does not cover gadget drivers.
|
||||
</para>
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter id="host">
|
||||
<title>USB Host-Side API Model</title>
|
||||
|
||||
<para>Host-side drivers for USB devices talk to the "usbcore" APIs.
|
||||
There are two. One is intended for
|
||||
<emphasis>general-purpose</emphasis> drivers (exposed through
|
||||
driver frameworks), and the other is for drivers that are
|
||||
<emphasis>part of the core</emphasis>.
|
||||
Such core drivers include the <emphasis>hub</emphasis> driver
|
||||
(which manages trees of USB devices) and several different kinds
|
||||
of <emphasis>host controller drivers</emphasis>,
|
||||
which control individual busses.
|
||||
</para>
|
||||
|
||||
<para>The device model seen by USB drivers is relatively complex.
|
||||
</para>
|
||||
|
||||
<itemizedlist>
|
||||
|
||||
<listitem><para>USB supports four kinds of data transfers
|
||||
(control, bulk, interrupt, and isochronous). Two of them (control
|
||||
and bulk) use bandwidth as it's available,
|
||||
while the other two (interrupt and isochronous)
|
||||
are scheduled to provide guaranteed bandwidth.
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>The device description model includes one or more
|
||||
"configurations" per device, only one of which is active at a time.
|
||||
Devices that are capable of high-speed operation must also support
|
||||
full-speed configurations, along with a way to ask about the
|
||||
"other speed" configurations which might be used.
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>Configurations have one or more "interfaces", each
|
||||
of which may have "alternate settings". Interfaces may be
|
||||
standardized by USB "Class" specifications, or may be specific to
|
||||
a vendor or device.</para>
|
||||
|
||||
<para>USB device drivers actually bind to interfaces, not devices.
|
||||
Think of them as "interface drivers", though you
|
||||
may not see many devices where the distinction is important.
|
||||
<emphasis>Most USB devices are simple, with only one configuration,
|
||||
one interface, and one alternate setting.</emphasis>
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>Interfaces have one or more "endpoints", each of
|
||||
which supports one type and direction of data transfer such as
|
||||
"bulk out" or "interrupt in". The entire configuration may have
|
||||
up to sixteen endpoints in each direction, allocated as needed
|
||||
among all the interfaces.
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>Data transfer on USB is packetized; each endpoint
|
||||
has a maximum packet size.
|
||||
Drivers must often be aware of conventions such as flagging the end
|
||||
of bulk transfers using "short" (including zero length) packets.
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para>The Linux USB API supports synchronous calls for
|
||||
control and bulk messages.
|
||||
It also supports asynchnous calls for all kinds of data transfer,
|
||||
using request structures called "URBs" (USB Request Blocks).
|
||||
</para></listitem>
|
||||
|
||||
</itemizedlist>
|
||||
|
||||
<para>Accordingly, the USB Core API exposed to device drivers
|
||||
covers quite a lot of territory. You'll probably need to consult
|
||||
the USB 2.0 specification, available online from www.usb.org at
|
||||
no cost, as well as class or device specifications.
|
||||
</para>
|
||||
|
||||
<para>The only host-side drivers that actually touch hardware
|
||||
(reading/writing registers, handling IRQs, and so on) are the HCDs.
|
||||
In theory, all HCDs provide the same functionality through the same
|
||||
API. In practice, that's becoming more true on the 2.5 kernels,
|
||||
but there are still differences that crop up especially with
|
||||
fault handling. Different controllers don't necessarily report
|
||||
the same aspects of failures, and recovery from faults (including
|
||||
software-induced ones like unlinking an URB) isn't yet fully
|
||||
consistent.
|
||||
Device driver authors should make a point of doing disconnect
|
||||
testing (while the device is active) with each different host
|
||||
controller driver, to make sure drivers don't have bugs of
|
||||
their own as well as to make sure they aren't relying on some
|
||||
HCD-specific behavior.
|
||||
(You will need external USB 1.1 and/or
|
||||
USB 2.0 hubs to perform all those tests.)
|
||||
</para>
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter><title>USB-Standard Types</title>
|
||||
|
||||
<para>In <filename><linux/usb/ch9.h></filename> you will find
|
||||
the USB data types defined in chapter 9 of the USB specification.
|
||||
These data types are used throughout USB, and in APIs including
|
||||
this host side API, gadget APIs, and usbfs.
|
||||
</para>
|
||||
|
||||
!Iinclude/linux/usb/ch9.h
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter><title>Host-Side Data Types and Macros</title>
|
||||
|
||||
<para>The host side API exposes several layers to drivers, some of
|
||||
which are more necessary than others.
|
||||
These support lifecycle models for host side drivers
|
||||
and devices, and support passing buffers through usbcore to
|
||||
some HCD that performs the I/O for the device driver.
|
||||
</para>
|
||||
|
||||
|
||||
!Iinclude/linux/usb.h
|
||||
|
||||
</chapter>
|
||||
|
||||
<chapter><title>USB Core APIs</title>
|
||||
|
||||
<para>There are two basic I/O models in the USB API.
|
||||
The most elemental one is asynchronous: drivers submit requests
|
||||
in the form of an URB, and the URB's completion callback
|
||||
handle the next step.
|
||||
All USB transfer types support that model, although there
|
||||
are special cases for control URBs (which always have setup
|
||||
and status stages, but may not have a data stage) and
|
||||
isochronous URBs (which allow large packets and include
|
||||
per-packet fault reports).
|
||||
Built on top of that is synchronous API support, where a
|
||||
driver calls a routine that allocates one or more URBs,
|
||||
submits them, and waits until they complete.
|
||||
There are synchronous wrappers for single-buffer control
|
||||
and bulk transfers (which are awkward to use in some
|
||||
driver disconnect scenarios), and for scatterlist based
|
||||
streaming i/o (bulk or interrupt).
|
||||
</para>
|
||||
|
||||
<para>USB drivers need to provide buffers that can be
|
||||
used for DMA, although they don't necessarily need to
|
||||
provide the DMA mapping themselves.
|
||||
There are APIs to use used when allocating DMA buffers,
|
||||
which can prevent use of bounce buffers on some systems.
|
||||
In some cases, drivers may be able to rely on 64bit DMA
|
||||
to eliminate another kind of bounce buffer.
|
||||
</para>
|
||||
|
||||
!Edrivers/usb/core/urb.c
|
||||
!Edrivers/usb/core/message.c
|
||||
!Edrivers/usb/core/file.c
|
||||
!Edrivers/usb/core/driver.c
|
||||
!Edrivers/usb/core/usb.c
|
||||
!Edrivers/usb/core/hub.c
|
||||
</chapter>
|
||||
|
||||
<chapter><title>Host Controller APIs</title>
|
||||
|
||||
<para>These APIs are only for use by host controller drivers,
|
||||
most of which implement standard register interfaces such as
|
||||
EHCI, OHCI, or UHCI.
|
||||
UHCI was one of the first interfaces, designed by Intel and
|
||||
also used by VIA; it doesn't do much in hardware.
|
||||
OHCI was designed later, to have the hardware do more work
|
||||
(bigger transfers, tracking protocol state, and so on).
|
||||
EHCI was designed with USB 2.0; its design has features that
|
||||
resemble OHCI (hardware does much more work) as well as
|
||||
UHCI (some parts of ISO support, TD list processing).
|
||||
</para>
|
||||
|
||||
<para>There are host controllers other than the "big three",
|
||||
although most PCI based controllers (and a few non-PCI based
|
||||
ones) use one of those interfaces.
|
||||
Not all host controllers use DMA; some use PIO, and there
|
||||
is also a simulator.
|
||||
</para>
|
||||
|
||||
<para>The same basic APIs are available to drivers for all
|
||||
those controllers.
|
||||
For historical reasons they are in two layers:
|
||||
<structname>struct usb_bus</structname> is a rather thin
|
||||
layer that became available in the 2.2 kernels, while
|
||||
<structname>struct usb_hcd</structname> is a more featureful
|
||||
layer (available in later 2.4 kernels and in 2.5) that
|
||||
lets HCDs share common code, to shrink driver size
|
||||
and significantly reduce hcd-specific behaviors.
|
||||
</para>
|
||||
|
||||
!Edrivers/usb/core/hcd.c
|
||||
!Edrivers/usb/core/hcd-pci.c
|
||||
!Idrivers/usb/core/buffer.c
|
||||
</chapter>
|
||||
|
||||
<chapter>
|
||||
<title>The USB Filesystem (usbfs)</title>
|
||||
|
||||
<para>This chapter presents the Linux <emphasis>usbfs</emphasis>.
|
||||
You may prefer to avoid writing new kernel code for your
|
||||
USB driver; that's the problem that usbfs set out to solve.
|
||||
User mode device drivers are usually packaged as applications
|
||||
or libraries, and may use usbfs through some programming library
|
||||
that wraps it. Such libraries include
|
||||
<ulink url="http://libusb.sourceforge.net">libusb</ulink>
|
||||
for C/C++, and
|
||||
<ulink url="http://jUSB.sourceforge.net">jUSB</ulink> for Java.
|
||||
</para>
|
||||
|
||||
<note><title>Unfinished</title>
|
||||
<para>This particular documentation is incomplete,
|
||||
especially with respect to the asynchronous mode.
|
||||
As of kernel 2.5.66 the code and this (new) documentation
|
||||
need to be cross-reviewed.
|
||||
</para>
|
||||
</note>
|
||||
|
||||
<para>Configure usbfs into Linux kernels by enabling the
|
||||
<emphasis>USB filesystem</emphasis> option (CONFIG_USB_DEVICEFS),
|
||||
and you get basic support for user mode USB device drivers.
|
||||
Until relatively recently it was often (confusingly) called
|
||||
<emphasis>usbdevfs</emphasis> although it wasn't solving what
|
||||
<emphasis>devfs</emphasis> was.
|
||||
Every USB device will appear in usbfs, regardless of whether or
|
||||
not it has a kernel driver.
|
||||
</para>
|
||||
|
||||
<sect1>
|
||||
<title>What files are in "usbfs"?</title>
|
||||
|
||||
<para>Conventionally mounted at
|
||||
<filename>/proc/bus/usb</filename>, usbfs
|
||||
features include:
|
||||
<itemizedlist>
|
||||
<listitem><para><filename>/proc/bus/usb/devices</filename>
|
||||
... a text file
|
||||
showing each of the USB devices on known to the kernel,
|
||||
and their configuration descriptors.
|
||||
You can also poll() this to learn about new devices.
|
||||
</para></listitem>
|
||||
<listitem><para><filename>/proc/bus/usb/BBB/DDD</filename>
|
||||
... magic files
|
||||
exposing the each device's configuration descriptors, and
|
||||
supporting a series of ioctls for making device requests,
|
||||
including I/O to devices. (Purely for access by programs.)
|
||||
</para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
|
||||
<para> Each bus is given a number (BBB) based on when it was
|
||||
enumerated; within each bus, each device is given a similar
|
||||
number (DDD).
|
||||
Those BBB/DDD paths are not "stable" identifiers;
|
||||
expect them to change even if you always leave the devices
|
||||
plugged in to the same hub port.
|
||||
<emphasis>Don't even think of saving these in application
|
||||
configuration files.</emphasis>
|
||||
Stable identifiers are available, for user mode applications
|
||||
that want to use them. HID and networking devices expose
|
||||
these stable IDs, so that for example you can be sure that
|
||||
you told the right UPS to power down its second server.
|
||||
"usbfs" doesn't (yet) expose those IDs.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>Mounting and Access Control</title>
|
||||
|
||||
<para>There are a number of mount options for usbfs, which will
|
||||
be of most interest to you if you need to override the default
|
||||
access control policy.
|
||||
That policy is that only root may read or write device files
|
||||
(<filename>/proc/bus/BBB/DDD</filename>) although anyone may read
|
||||
the <filename>devices</filename>
|
||||
or <filename>drivers</filename> files.
|
||||
I/O requests to the device also need the CAP_SYS_RAWIO capability,
|
||||
</para>
|
||||
|
||||
<para>The significance of that is that by default, all user mode
|
||||
device drivers need super-user privileges.
|
||||
You can change modes or ownership in a driver setup
|
||||
when the device hotplugs, or maye just start the
|
||||
driver right then, as a privileged server (or some activity
|
||||
within one).
|
||||
That's the most secure approach for multi-user systems,
|
||||
but for single user systems ("trusted" by that user)
|
||||
it's more convenient just to grant everyone all access
|
||||
(using the <emphasis>devmode=0666</emphasis> option)
|
||||
so the driver can start whenever it's needed.
|
||||
</para>
|
||||
|
||||
<para>The mount options for usbfs, usable in /etc/fstab or
|
||||
in command line invocations of <emphasis>mount</emphasis>, are:
|
||||
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
<term><emphasis>busgid</emphasis>=NNNNN</term>
|
||||
<listitem><para>Controls the GID used for the
|
||||
/proc/bus/usb/BBB
|
||||
directories. (Default: 0)</para></listitem></varlistentry>
|
||||
<varlistentry><term><emphasis>busmode</emphasis>=MMM</term>
|
||||
<listitem><para>Controls the file mode used for the
|
||||
/proc/bus/usb/BBB
|
||||
directories. (Default: 0555)
|
||||
</para></listitem></varlistentry>
|
||||
<varlistentry><term><emphasis>busuid</emphasis>=NNNNN</term>
|
||||
<listitem><para>Controls the UID used for the
|
||||
/proc/bus/usb/BBB
|
||||
directories. (Default: 0)</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term><emphasis>devgid</emphasis>=NNNNN</term>
|
||||
<listitem><para>Controls the GID used for the
|
||||
/proc/bus/usb/BBB/DDD
|
||||
files. (Default: 0)</para></listitem></varlistentry>
|
||||
<varlistentry><term><emphasis>devmode</emphasis>=MMM</term>
|
||||
<listitem><para>Controls the file mode used for the
|
||||
/proc/bus/usb/BBB/DDD
|
||||
files. (Default: 0644)</para></listitem></varlistentry>
|
||||
<varlistentry><term><emphasis>devuid</emphasis>=NNNNN</term>
|
||||
<listitem><para>Controls the UID used for the
|
||||
/proc/bus/usb/BBB/DDD
|
||||
files. (Default: 0)</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term><emphasis>listgid</emphasis>=NNNNN</term>
|
||||
<listitem><para>Controls the GID used for the
|
||||
/proc/bus/usb/devices and drivers files.
|
||||
(Default: 0)</para></listitem></varlistentry>
|
||||
<varlistentry><term><emphasis>listmode</emphasis>=MMM</term>
|
||||
<listitem><para>Controls the file mode used for the
|
||||
/proc/bus/usb/devices and drivers files.
|
||||
(Default: 0444)</para></listitem></varlistentry>
|
||||
<varlistentry><term><emphasis>listuid</emphasis>=NNNNN</term>
|
||||
<listitem><para>Controls the UID used for the
|
||||
/proc/bus/usb/devices and drivers files.
|
||||
(Default: 0)</para></listitem></varlistentry>
|
||||
</variablelist>
|
||||
|
||||
</para>
|
||||
|
||||
<para>Note that many Linux distributions hard-wire the mount options
|
||||
for usbfs in their init scripts, such as
|
||||
<filename>/etc/rc.d/rc.sysinit</filename>,
|
||||
rather than making it easy to set this per-system
|
||||
policy in <filename>/etc/fstab</filename>.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>/proc/bus/usb/devices</title>
|
||||
|
||||
<para>This file is handy for status viewing tools in user
|
||||
mode, which can scan the text format and ignore most of it.
|
||||
More detailed device status (including class and vendor
|
||||
status) is available from device-specific files.
|
||||
For information about the current format of this file,
|
||||
see the
|
||||
<filename>Documentation/usb/proc_usb_info.txt</filename>
|
||||
file in your Linux kernel sources.
|
||||
</para>
|
||||
|
||||
<para>This file, in combination with the poll() system call, can
|
||||
also be used to detect when devices are added or removed:
|
||||
<programlisting>int fd;
|
||||
struct pollfd pfd;
|
||||
|
||||
fd = open("/proc/bus/usb/devices", O_RDONLY);
|
||||
pfd = { fd, POLLIN, 0 };
|
||||
for (;;) {
|
||||
/* The first time through, this call will return immediately. */
|
||||
poll(&pfd, 1, -1);
|
||||
|
||||
/* To see what's changed, compare the file's previous and current
|
||||
contents or scan the filesystem. (Scanning is more precise.) */
|
||||
}</programlisting>
|
||||
Note that this behavior is intended to be used for informational
|
||||
and debug purposes. It would be more appropriate to use programs
|
||||
such as udev or HAL to initialize a device or start a user-mode
|
||||
helper program, for instance.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>/proc/bus/usb/BBB/DDD</title>
|
||||
|
||||
<para>Use these files in one of these basic ways:
|
||||
</para>
|
||||
|
||||
<para><emphasis>They can be read,</emphasis>
|
||||
producing first the device descriptor
|
||||
(18 bytes) and then the descriptors for the current configuration.
|
||||
See the USB 2.0 spec for details about those binary data formats.
|
||||
You'll need to convert most multibyte values from little endian
|
||||
format to your native host byte order, although a few of the
|
||||
fields in the device descriptor (both of the BCD-encoded fields,
|
||||
and the vendor and product IDs) will be byteswapped for you.
|
||||
Note that configuration descriptors include descriptors for
|
||||
interfaces, altsettings, endpoints, and maybe additional
|
||||
class descriptors.
|
||||
</para>
|
||||
|
||||
<para><emphasis>Perform USB operations</emphasis> using
|
||||
<emphasis>ioctl()</emphasis> requests to make endpoint I/O
|
||||
requests (synchronously or asynchronously) or manage
|
||||
the device.
|
||||
These requests need the CAP_SYS_RAWIO capability,
|
||||
as well as filesystem access permissions.
|
||||
Only one ioctl request can be made on one of these
|
||||
device files at a time.
|
||||
This means that if you are synchronously reading an endpoint
|
||||
from one thread, you won't be able to write to a different
|
||||
endpoint from another thread until the read completes.
|
||||
This works for <emphasis>half duplex</emphasis> protocols,
|
||||
but otherwise you'd use asynchronous i/o requests.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
|
||||
<sect1>
|
||||
<title>Life Cycle of User Mode Drivers</title>
|
||||
|
||||
<para>Such a driver first needs to find a device file
|
||||
for a device it knows how to handle.
|
||||
Maybe it was told about it because a
|
||||
<filename>/sbin/hotplug</filename> event handling agent
|
||||
chose that driver to handle the new device.
|
||||
Or maybe it's an application that scans all the
|
||||
/proc/bus/usb device files, and ignores most devices.
|
||||
In either case, it should <function>read()</function> all
|
||||
the descriptors from the device file,
|
||||
and check them against what it knows how to handle.
|
||||
It might just reject everything except a particular
|
||||
vendor and product ID, or need a more complex policy.
|
||||
</para>
|
||||
|
||||
<para>Never assume there will only be one such device
|
||||
on the system at a time!
|
||||
If your code can't handle more than one device at
|
||||
a time, at least detect when there's more than one, and
|
||||
have your users choose which device to use.
|
||||
</para>
|
||||
|
||||
<para>Once your user mode driver knows what device to use,
|
||||
it interacts with it in either of two styles.
|
||||
The simple style is to make only control requests; some
|
||||
devices don't need more complex interactions than those.
|
||||
(An example might be software using vendor-specific control
|
||||
requests for some initialization or configuration tasks,
|
||||
with a kernel driver for the rest.)
|
||||
</para>
|
||||
|
||||
<para>More likely, you need a more complex style driver:
|
||||
one using non-control endpoints, reading or writing data
|
||||
and claiming exclusive use of an interface.
|
||||
<emphasis>Bulk</emphasis> transfers are easiest to use,
|
||||
but only their sibling <emphasis>interrupt</emphasis> transfers
|
||||
work with low speed devices.
|
||||
Both interrupt and <emphasis>isochronous</emphasis> transfers
|
||||
offer service guarantees because their bandwidth is reserved.
|
||||
Such "periodic" transfers are awkward to use through usbfs,
|
||||
unless you're using the asynchronous calls. However, interrupt
|
||||
transfers can also be used in a synchronous "one shot" style.
|
||||
</para>
|
||||
|
||||
<para>Your user-mode driver should never need to worry
|
||||
about cleaning up request state when the device is
|
||||
disconnected, although it should close its open file
|
||||
descriptors as soon as it starts seeing the ENODEV
|
||||
errors.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1><title>The ioctl() Requests</title>
|
||||
|
||||
<para>To use these ioctls, you need to include the following
|
||||
headers in your userspace program:
|
||||
<programlisting>#include <linux/usb.h>
|
||||
#include <linux/usbdevice_fs.h>
|
||||
#include <asm/byteorder.h></programlisting>
|
||||
The standard USB device model requests, from "Chapter 9" of
|
||||
the USB 2.0 specification, are automatically included from
|
||||
the <filename><linux/usb/ch9.h></filename> header.
|
||||
</para>
|
||||
|
||||
<para>Unless noted otherwise, the ioctl requests
|
||||
described here will
|
||||
update the modification time on the usbfs file to which
|
||||
they are applied (unless they fail).
|
||||
A return of zero indicates success; otherwise, a
|
||||
standard USB error code is returned. (These are
|
||||
documented in
|
||||
<filename>Documentation/usb/error-codes.txt</filename>
|
||||
in your kernel sources.)
|
||||
</para>
|
||||
|
||||
<para>Each of these files multiplexes access to several
|
||||
I/O streams, one per endpoint.
|
||||
Each device has one control endpoint (endpoint zero)
|
||||
which supports a limited RPC style RPC access.
|
||||
Devices are configured
|
||||
by khubd (in the kernel) setting a device-wide
|
||||
<emphasis>configuration</emphasis> that affects things
|
||||
like power consumption and basic functionality.
|
||||
The endpoints are part of USB <emphasis>interfaces</emphasis>,
|
||||
which may have <emphasis>altsettings</emphasis>
|
||||
affecting things like which endpoints are available.
|
||||
Many devices only have a single configuration and interface,
|
||||
so drivers for them will ignore configurations and altsettings.
|
||||
</para>
|
||||
|
||||
|
||||
<sect2>
|
||||
<title>Management/Status Requests</title>
|
||||
|
||||
<para>A number of usbfs requests don't deal very directly
|
||||
with device I/O.
|
||||
They mostly relate to device management and status.
|
||||
These are all synchronous requests.
|
||||
</para>
|
||||
|
||||
<variablelist>
|
||||
|
||||
<varlistentry><term>USBDEVFS_CLAIMINTERFACE</term>
|
||||
<listitem><para>This is used to force usbfs to
|
||||
claim a specific interface,
|
||||
which has not previously been claimed by usbfs or any other
|
||||
kernel driver.
|
||||
The ioctl parameter is an integer holding the number of
|
||||
the interface (bInterfaceNumber from descriptor).
|
||||
</para><para>
|
||||
Note that if your driver doesn't claim an interface
|
||||
before trying to use one of its endpoints, and no
|
||||
other driver has bound to it, then the interface is
|
||||
automatically claimed by usbfs.
|
||||
</para><para>
|
||||
This claim will be released by a RELEASEINTERFACE ioctl,
|
||||
or by closing the file descriptor.
|
||||
File modification time is not updated by this request.
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_CONNECTINFO</term>
|
||||
<listitem><para>Says whether the device is lowspeed.
|
||||
The ioctl parameter points to a structure like this:
|
||||
<programlisting>struct usbdevfs_connectinfo {
|
||||
unsigned int devnum;
|
||||
unsigned char slow;
|
||||
}; </programlisting>
|
||||
File modification time is not updated by this request.
|
||||
</para><para>
|
||||
<emphasis>You can't tell whether a "not slow"
|
||||
device is connected at high speed (480 MBit/sec)
|
||||
or just full speed (12 MBit/sec).</emphasis>
|
||||
You should know the devnum value already,
|
||||
it's the DDD value of the device file name.
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_GETDRIVER</term>
|
||||
<listitem><para>Returns the name of the kernel driver
|
||||
bound to a given interface (a string). Parameter
|
||||
is a pointer to this structure, which is modified:
|
||||
<programlisting>struct usbdevfs_getdriver {
|
||||
unsigned int interface;
|
||||
char driver[USBDEVFS_MAXDRIVERNAME + 1];
|
||||
};</programlisting>
|
||||
File modification time is not updated by this request.
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_IOCTL</term>
|
||||
<listitem><para>Passes a request from userspace through
|
||||
to a kernel driver that has an ioctl entry in the
|
||||
<emphasis>struct usb_driver</emphasis> it registered.
|
||||
<programlisting>struct usbdevfs_ioctl {
|
||||
int ifno;
|
||||
int ioctl_code;
|
||||
void *data;
|
||||
};
|
||||
|
||||
/* user mode call looks like this.
|
||||
* 'request' becomes the driver->ioctl() 'code' parameter.
|
||||
* the size of 'param' is encoded in 'request', and that data
|
||||
* is copied to or from the driver->ioctl() 'buf' parameter.
|
||||
*/
|
||||
static int
|
||||
usbdev_ioctl (int fd, int ifno, unsigned request, void *param)
|
||||
{
|
||||
struct usbdevfs_ioctl wrapper;
|
||||
|
||||
wrapper.ifno = ifno;
|
||||
wrapper.ioctl_code = request;
|
||||
wrapper.data = param;
|
||||
|
||||
return ioctl (fd, USBDEVFS_IOCTL, &wrapper);
|
||||
} </programlisting>
|
||||
File modification time is not updated by this request.
|
||||
</para><para>
|
||||
This request lets kernel drivers talk to user mode code
|
||||
through filesystem operations even when they don't create
|
||||
a charactor or block special device.
|
||||
It's also been used to do things like ask devices what
|
||||
device special file should be used.
|
||||
Two pre-defined ioctls are used
|
||||
to disconnect and reconnect kernel drivers, so
|
||||
that user mode code can completely manage binding
|
||||
and configuration of devices.
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_RELEASEINTERFACE</term>
|
||||
<listitem><para>This is used to release the claim usbfs
|
||||
made on interface, either implicitly or because of a
|
||||
USBDEVFS_CLAIMINTERFACE call, before the file
|
||||
descriptor is closed.
|
||||
The ioctl parameter is an integer holding the number of
|
||||
the interface (bInterfaceNumber from descriptor);
|
||||
File modification time is not updated by this request.
|
||||
</para><warning><para>
|
||||
<emphasis>No security check is made to ensure
|
||||
that the task which made the claim is the one
|
||||
which is releasing it.
|
||||
This means that user mode driver may interfere
|
||||
other ones. </emphasis>
|
||||
</para></warning></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_RESETEP</term>
|
||||
<listitem><para>Resets the data toggle value for an endpoint
|
||||
(bulk or interrupt) to DATA0.
|
||||
The ioctl parameter is an integer endpoint number
|
||||
(1 to 15, as identified in the endpoint descriptor),
|
||||
with USB_DIR_IN added if the device's endpoint sends
|
||||
data to the host.
|
||||
</para><warning><para>
|
||||
<emphasis>Avoid using this request.
|
||||
It should probably be removed.</emphasis>
|
||||
Using it typically means the device and driver will lose
|
||||
toggle synchronization. If you really lost synchronization,
|
||||
you likely need to completely handshake with the device,
|
||||
using a request like CLEAR_HALT
|
||||
or SET_INTERFACE.
|
||||
</para></warning></listitem></varlistentry>
|
||||
|
||||
</variablelist>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>Synchronous I/O Support</title>
|
||||
|
||||
<para>Synchronous requests involve the kernel blocking
|
||||
until the user mode request completes, either by
|
||||
finishing successfully or by reporting an error.
|
||||
In most cases this is the simplest way to use usbfs,
|
||||
although as noted above it does prevent performing I/O
|
||||
to more than one endpoint at a time.
|
||||
</para>
|
||||
|
||||
<variablelist>
|
||||
|
||||
<varlistentry><term>USBDEVFS_BULK</term>
|
||||
<listitem><para>Issues a bulk read or write request to the
|
||||
device.
|
||||
The ioctl parameter is a pointer to this structure:
|
||||
<programlisting>struct usbdevfs_bulktransfer {
|
||||
unsigned int ep;
|
||||
unsigned int len;
|
||||
unsigned int timeout; /* in milliseconds */
|
||||
void *data;
|
||||
};</programlisting>
|
||||
</para><para>The "ep" value identifies a
|
||||
bulk endpoint number (1 to 15, as identified in an endpoint
|
||||
descriptor),
|
||||
masked with USB_DIR_IN when referring to an endpoint which
|
||||
sends data to the host from the device.
|
||||
The length of the data buffer is identified by "len";
|
||||
Recent kernels support requests up to about 128KBytes.
|
||||
<emphasis>FIXME say how read length is returned,
|
||||
and how short reads are handled.</emphasis>.
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_CLEAR_HALT</term>
|
||||
<listitem><para>Clears endpoint halt (stall) and
|
||||
resets the endpoint toggle. This is only
|
||||
meaningful for bulk or interrupt endpoints.
|
||||
The ioctl parameter is an integer endpoint number
|
||||
(1 to 15, as identified in an endpoint descriptor),
|
||||
masked with USB_DIR_IN when referring to an endpoint which
|
||||
sends data to the host from the device.
|
||||
</para><para>
|
||||
Use this on bulk or interrupt endpoints which have
|
||||
stalled, returning <emphasis>-EPIPE</emphasis> status
|
||||
to a data transfer request.
|
||||
Do not issue the control request directly, since
|
||||
that could invalidate the host's record of the
|
||||
data toggle.
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_CONTROL</term>
|
||||
<listitem><para>Issues a control request to the device.
|
||||
The ioctl parameter points to a structure like this:
|
||||
<programlisting>struct usbdevfs_ctrltransfer {
|
||||
__u8 bRequestType;
|
||||
__u8 bRequest;
|
||||
__u16 wValue;
|
||||
__u16 wIndex;
|
||||
__u16 wLength;
|
||||
__u32 timeout; /* in milliseconds */
|
||||
void *data;
|
||||
};</programlisting>
|
||||
</para><para>
|
||||
The first eight bytes of this structure are the contents
|
||||
of the SETUP packet to be sent to the device; see the
|
||||
USB 2.0 specification for details.
|
||||
The bRequestType value is composed by combining a
|
||||
USB_TYPE_* value, a USB_DIR_* value, and a
|
||||
USB_RECIP_* value (from
|
||||
<emphasis><linux/usb.h></emphasis>).
|
||||
If wLength is nonzero, it describes the length of the data
|
||||
buffer, which is either written to the device
|
||||
(USB_DIR_OUT) or read from the device (USB_DIR_IN).
|
||||
</para><para>
|
||||
At this writing, you can't transfer more than 4 KBytes
|
||||
of data to or from a device; usbfs has a limit, and
|
||||
some host controller drivers have a limit.
|
||||
(That's not usually a problem.)
|
||||
<emphasis>Also</emphasis> there's no way to say it's
|
||||
not OK to get a short read back from the device.
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_RESET</term>
|
||||
<listitem><para>Does a USB level device reset.
|
||||
The ioctl parameter is ignored.
|
||||
After the reset, this rebinds all device interfaces.
|
||||
File modification time is not updated by this request.
|
||||
</para><warning><para>
|
||||
<emphasis>Avoid using this call</emphasis>
|
||||
until some usbcore bugs get fixed,
|
||||
since it does not fully synchronize device, interface,
|
||||
and driver (not just usbfs) state.
|
||||
</para></warning></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_SETINTERFACE</term>
|
||||
<listitem><para>Sets the alternate setting for an
|
||||
interface. The ioctl parameter is a pointer to a
|
||||
structure like this:
|
||||
<programlisting>struct usbdevfs_setinterface {
|
||||
unsigned int interface;
|
||||
unsigned int altsetting;
|
||||
}; </programlisting>
|
||||
File modification time is not updated by this request.
|
||||
</para><para>
|
||||
Those struct members are from some interface descriptor
|
||||
applying to the current configuration.
|
||||
The interface number is the bInterfaceNumber value, and
|
||||
the altsetting number is the bAlternateSetting value.
|
||||
(This resets each endpoint in the interface.)
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_SETCONFIGURATION</term>
|
||||
<listitem><para>Issues the
|
||||
<function>usb_set_configuration</function> call
|
||||
for the device.
|
||||
The parameter is an integer holding the number of
|
||||
a configuration (bConfigurationValue from descriptor).
|
||||
File modification time is not updated by this request.
|
||||
</para><warning><para>
|
||||
<emphasis>Avoid using this call</emphasis>
|
||||
until some usbcore bugs get fixed,
|
||||
since it does not fully synchronize device, interface,
|
||||
and driver (not just usbfs) state.
|
||||
</para></warning></listitem></varlistentry>
|
||||
|
||||
</variablelist>
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>Asynchronous I/O Support</title>
|
||||
|
||||
<para>As mentioned above, there are situations where it may be
|
||||
important to initiate concurrent operations from user mode code.
|
||||
This is particularly important for periodic transfers
|
||||
(interrupt and isochronous), but it can be used for other
|
||||
kinds of USB requests too.
|
||||
In such cases, the asynchronous requests described here
|
||||
are essential. Rather than submitting one request and having
|
||||
the kernel block until it completes, the blocking is separate.
|
||||
</para>
|
||||
|
||||
<para>These requests are packaged into a structure that
|
||||
resembles the URB used by kernel device drivers.
|
||||
(No POSIX Async I/O support here, sorry.)
|
||||
It identifies the endpoint type (USBDEVFS_URB_TYPE_*),
|
||||
endpoint (number, masked with USB_DIR_IN as appropriate),
|
||||
buffer and length, and a user "context" value serving to
|
||||
uniquely identify each request.
|
||||
(It's usually a pointer to per-request data.)
|
||||
Flags can modify requests (not as many as supported for
|
||||
kernel drivers).
|
||||
</para>
|
||||
|
||||
<para>Each request can specify a realtime signal number
|
||||
(between SIGRTMIN and SIGRTMAX, inclusive) to request a
|
||||
signal be sent when the request completes.
|
||||
</para>
|
||||
|
||||
<para>When usbfs returns these urbs, the status value
|
||||
is updated, and the buffer may have been modified.
|
||||
Except for isochronous transfers, the actual_length is
|
||||
updated to say how many bytes were transferred; if the
|
||||
USBDEVFS_URB_DISABLE_SPD flag is set
|
||||
("short packets are not OK"), if fewer bytes were read
|
||||
than were requested then you get an error report.
|
||||
</para>
|
||||
|
||||
<programlisting>struct usbdevfs_iso_packet_desc {
|
||||
unsigned int length;
|
||||
unsigned int actual_length;
|
||||
unsigned int status;
|
||||
};
|
||||
|
||||
struct usbdevfs_urb {
|
||||
unsigned char type;
|
||||
unsigned char endpoint;
|
||||
int status;
|
||||
unsigned int flags;
|
||||
void *buffer;
|
||||
int buffer_length;
|
||||
int actual_length;
|
||||
int start_frame;
|
||||
int number_of_packets;
|
||||
int error_count;
|
||||
unsigned int signr;
|
||||
void *usercontext;
|
||||
struct usbdevfs_iso_packet_desc iso_frame_desc[];
|
||||
};</programlisting>
|
||||
|
||||
<para> For these asynchronous requests, the file modification
|
||||
time reflects when the request was initiated.
|
||||
This contrasts with their use with the synchronous requests,
|
||||
where it reflects when requests complete.
|
||||
</para>
|
||||
|
||||
<variablelist>
|
||||
|
||||
<varlistentry><term>USBDEVFS_DISCARDURB</term>
|
||||
<listitem><para>
|
||||
<emphasis>TBS</emphasis>
|
||||
File modification time is not updated by this request.
|
||||
</para><para>
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_DISCSIGNAL</term>
|
||||
<listitem><para>
|
||||
<emphasis>TBS</emphasis>
|
||||
File modification time is not updated by this request.
|
||||
</para><para>
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_REAPURB</term>
|
||||
<listitem><para>
|
||||
<emphasis>TBS</emphasis>
|
||||
File modification time is not updated by this request.
|
||||
</para><para>
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_REAPURBNDELAY</term>
|
||||
<listitem><para>
|
||||
<emphasis>TBS</emphasis>
|
||||
File modification time is not updated by this request.
|
||||
</para><para>
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>USBDEVFS_SUBMITURB</term>
|
||||
<listitem><para>
|
||||
<emphasis>TBS</emphasis>
|
||||
</para><para>
|
||||
</para></listitem></varlistentry>
|
||||
|
||||
</variablelist>
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
||||
|
||||
</chapter>
|
||||
|
||||
</book>
|
||||
<!-- vim:syntax=sgml:sw=4
|
||||
-->
|
||||
1663
Documentation/DocBook/videobook.tmpl
Normal file
1663
Documentation/DocBook/videobook.tmpl
Normal file
File diff suppressed because it is too large
Load Diff
99
Documentation/DocBook/wanbook.tmpl
Normal file
99
Documentation/DocBook/wanbook.tmpl
Normal file
@ -0,0 +1,99 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="WANGuide">
|
||||
<bookinfo>
|
||||
<title>Synchronous PPP and Cisco HDLC Programming Guide</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Alan</firstname>
|
||||
<surname>Cox</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>alan@redhat.com</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2000</year>
|
||||
<holder>Alan Cox</holder>
|
||||
</copyright>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License as published by the Free Software Foundation; either
|
||||
version 2 of the License, or (at your option) any later
|
||||
version.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter id="intro">
|
||||
<title>Introduction</title>
|
||||
<para>
|
||||
The syncppp drivers in Linux provide a fairly complete
|
||||
implementation of Cisco HDLC and a minimal implementation of
|
||||
PPP. The longer term goal is to switch the PPP layer to the
|
||||
generic PPP interface that is new in Linux 2.3.x. The API should
|
||||
remain unchanged when this is done, but support will then be
|
||||
available for IPX, compression and other PPP features
|
||||
</para>
|
||||
</chapter>
|
||||
<chapter id="bugs">
|
||||
<title>Known Bugs And Assumptions</title>
|
||||
<para>
|
||||
<variablelist>
|
||||
<varlistentry><term>PPP is minimal</term>
|
||||
<listitem>
|
||||
<para>
|
||||
The current PPP implementation is very basic, although sufficient
|
||||
for most wan usages.
|
||||
</para>
|
||||
</listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>Cisco HDLC Quirks</term>
|
||||
<listitem>
|
||||
<para>
|
||||
Currently we do not end all packets with the correct Cisco multicast
|
||||
or unicast flags. Nothing appears to mind too much but this should
|
||||
be corrected.
|
||||
</para>
|
||||
</listitem></varlistentry>
|
||||
</variablelist>
|
||||
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="pubfunctions">
|
||||
<title>Public Functions Provided</title>
|
||||
!Edrivers/net/wan/syncppp.c
|
||||
</chapter>
|
||||
|
||||
</book>
|
||||
412
Documentation/DocBook/writing_usb_driver.tmpl
Normal file
412
Documentation/DocBook/writing_usb_driver.tmpl
Normal file
@ -0,0 +1,412 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="USBDeviceDriver">
|
||||
<bookinfo>
|
||||
<title>Writing USB Device Drivers</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Greg</firstname>
|
||||
<surname>Kroah-Hartman</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>greg@kroah.com</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2001-2002</year>
|
||||
<holder>Greg Kroah-Hartman</holder>
|
||||
</copyright>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License as published by the Free Software Foundation; either
|
||||
version 2 of the License, or (at your option) any later
|
||||
version.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This documentation is based on an article published in
|
||||
Linux Journal Magazine, October 2001, Issue 90.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter id="intro">
|
||||
<title>Introduction</title>
|
||||
<para>
|
||||
The Linux USB subsystem has grown from supporting only two different
|
||||
types of devices in the 2.2.7 kernel (mice and keyboards), to over 20
|
||||
different types of devices in the 2.4 kernel. Linux currently supports
|
||||
almost all USB class devices (standard types of devices like keyboards,
|
||||
mice, modems, printers and speakers) and an ever-growing number of
|
||||
vendor-specific devices (such as USB to serial converters, digital
|
||||
cameras, Ethernet devices and MP3 players). For a full list of the
|
||||
different USB devices currently supported, see Resources.
|
||||
</para>
|
||||
<para>
|
||||
The remaining kinds of USB devices that do not have support on Linux are
|
||||
almost all vendor-specific devices. Each vendor decides to implement a
|
||||
custom protocol to talk to their device, so a custom driver usually needs
|
||||
to be created. Some vendors are open with their USB protocols and help
|
||||
with the creation of Linux drivers, while others do not publish them, and
|
||||
developers are forced to reverse-engineer. See Resources for some links
|
||||
to handy reverse-engineering tools.
|
||||
</para>
|
||||
<para>
|
||||
Because each different protocol causes a new driver to be created, I have
|
||||
written a generic USB driver skeleton, modeled after the pci-skeleton.c
|
||||
file in the kernel source tree upon which many PCI network drivers have
|
||||
been based. This USB skeleton can be found at drivers/usb/usb-skeleton.c
|
||||
in the kernel source tree. In this article I will walk through the basics
|
||||
of the skeleton driver, explaining the different pieces and what needs to
|
||||
be done to customize it to your specific device.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="basics">
|
||||
<title>Linux USB Basics</title>
|
||||
<para>
|
||||
If you are going to write a Linux USB driver, please become familiar with
|
||||
the USB protocol specification. It can be found, along with many other
|
||||
useful documents, at the USB home page (see Resources). An excellent
|
||||
introduction to the Linux USB subsystem can be found at the USB Working
|
||||
Devices List (see Resources). It explains how the Linux USB subsystem is
|
||||
structured and introduces the reader to the concept of USB urbs, which
|
||||
are essential to USB drivers.
|
||||
</para>
|
||||
<para>
|
||||
The first thing a Linux USB driver needs to do is register itself with
|
||||
the Linux USB subsystem, giving it some information about which devices
|
||||
the driver supports and which functions to call when a device supported
|
||||
by the driver is inserted or removed from the system. All of this
|
||||
information is passed to the USB subsystem in the usb_driver structure.
|
||||
The skeleton driver declares a usb_driver as:
|
||||
</para>
|
||||
<programlisting>
|
||||
static struct usb_driver skel_driver = {
|
||||
.name = "skeleton",
|
||||
.probe = skel_probe,
|
||||
.disconnect = skel_disconnect,
|
||||
.fops = &skel_fops,
|
||||
.minor = USB_SKEL_MINOR_BASE,
|
||||
.id_table = skel_table,
|
||||
};
|
||||
</programlisting>
|
||||
<para>
|
||||
The variable name is a string that describes the driver. It is used in
|
||||
informational messages printed to the system log. The probe and
|
||||
disconnect function pointers are called when a device that matches the
|
||||
information provided in the id_table variable is either seen or removed.
|
||||
</para>
|
||||
<para>
|
||||
The fops and minor variables are optional. Most USB drivers hook into
|
||||
another kernel subsystem, such as the SCSI, network or TTY subsystem.
|
||||
These types of drivers register themselves with the other kernel
|
||||
subsystem, and any user-space interactions are provided through that
|
||||
interface. But for drivers that do not have a matching kernel subsystem,
|
||||
such as MP3 players or scanners, a method of interacting with user space
|
||||
is needed. The USB subsystem provides a way to register a minor device
|
||||
number and a set of file_operations function pointers that enable this
|
||||
user-space interaction. The skeleton driver needs this kind of interface,
|
||||
so it provides a minor starting number and a pointer to its
|
||||
file_operations functions.
|
||||
</para>
|
||||
<para>
|
||||
The USB driver is then registered with a call to usb_register, usually in
|
||||
the driver's init function, as shown here:
|
||||
</para>
|
||||
<programlisting>
|
||||
static int __init usb_skel_init(void)
|
||||
{
|
||||
int result;
|
||||
|
||||
/* register this driver with the USB subsystem */
|
||||
result = usb_register(&skel_driver);
|
||||
if (result < 0) {
|
||||
err("usb_register failed for the "__FILE__ "driver."
|
||||
"Error number %d", result);
|
||||
return -1;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
module_init(usb_skel_init);
|
||||
</programlisting>
|
||||
<para>
|
||||
When the driver is unloaded from the system, it needs to unregister
|
||||
itself with the USB subsystem. This is done with the usb_unregister
|
||||
function:
|
||||
</para>
|
||||
<programlisting>
|
||||
static void __exit usb_skel_exit(void)
|
||||
{
|
||||
/* deregister this driver with the USB subsystem */
|
||||
usb_deregister(&skel_driver);
|
||||
}
|
||||
module_exit(usb_skel_exit);
|
||||
</programlisting>
|
||||
<para>
|
||||
To enable the linux-hotplug system to load the driver automatically when
|
||||
the device is plugged in, you need to create a MODULE_DEVICE_TABLE. The
|
||||
following code tells the hotplug scripts that this module supports a
|
||||
single device with a specific vendor and product ID:
|
||||
</para>
|
||||
<programlisting>
|
||||
/* table of devices that work with this driver */
|
||||
static struct usb_device_id skel_table [] = {
|
||||
{ USB_DEVICE(USB_SKEL_VENDOR_ID, USB_SKEL_PRODUCT_ID) },
|
||||
{ } /* Terminating entry */
|
||||
};
|
||||
MODULE_DEVICE_TABLE (usb, skel_table);
|
||||
</programlisting>
|
||||
<para>
|
||||
There are other macros that can be used in describing a usb_device_id for
|
||||
drivers that support a whole class of USB drivers. See usb.h for more
|
||||
information on this.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="device">
|
||||
<title>Device operation</title>
|
||||
<para>
|
||||
When a device is plugged into the USB bus that matches the device ID
|
||||
pattern that your driver registered with the USB core, the probe function
|
||||
is called. The usb_device structure, interface number and the interface ID
|
||||
are passed to the function:
|
||||
</para>
|
||||
<programlisting>
|
||||
static int skel_probe(struct usb_interface *interface,
|
||||
const struct usb_device_id *id)
|
||||
</programlisting>
|
||||
<para>
|
||||
The driver now needs to verify that this device is actually one that it
|
||||
can accept. If so, it returns 0.
|
||||
If not, or if any error occurs during initialization, an errorcode
|
||||
(such as <literal>-ENOMEM</literal> or <literal>-ENODEV</literal>)
|
||||
is returned from the probe function.
|
||||
</para>
|
||||
<para>
|
||||
In the skeleton driver, we determine what end points are marked as bulk-in
|
||||
and bulk-out. We create buffers to hold the data that will be sent and
|
||||
received from the device, and a USB urb to write data to the device is
|
||||
initialized.
|
||||
</para>
|
||||
<para>
|
||||
Conversely, when the device is removed from the USB bus, the disconnect
|
||||
function is called with the device pointer. The driver needs to clean any
|
||||
private data that has been allocated at this time and to shut down any
|
||||
pending urbs that are in the USB system.
|
||||
</para>
|
||||
<para>
|
||||
Now that the device is plugged into the system and the driver is bound to
|
||||
the device, any of the functions in the file_operations structure that
|
||||
were passed to the USB subsystem will be called from a user program trying
|
||||
to talk to the device. The first function called will be open, as the
|
||||
program tries to open the device for I/O. We increment our private usage
|
||||
count and save off a pointer to our internal structure in the file
|
||||
structure. This is done so that future calls to file operations will
|
||||
enable the driver to determine which device the user is addressing. All
|
||||
of this is done with the following code:
|
||||
</para>
|
||||
<programlisting>
|
||||
/* increment our usage count for the module */
|
||||
++skel->open_count;
|
||||
|
||||
/* save our object in the file's private structure */
|
||||
file->private_data = dev;
|
||||
</programlisting>
|
||||
<para>
|
||||
After the open function is called, the read and write functions are called
|
||||
to receive and send data to the device. In the skel_write function, we
|
||||
receive a pointer to some data that the user wants to send to the device
|
||||
and the size of the data. The function determines how much data it can
|
||||
send to the device based on the size of the write urb it has created (this
|
||||
size depends on the size of the bulk out end point that the device has).
|
||||
Then it copies the data from user space to kernel space, points the urb to
|
||||
the data and submits the urb to the USB subsystem. This can be shown in
|
||||
he following code:
|
||||
</para>
|
||||
<programlisting>
|
||||
/* we can only write as much as 1 urb will hold */
|
||||
bytes_written = (count > skel->bulk_out_size) ? skel->bulk_out_size : count;
|
||||
|
||||
/* copy the data from user space into our urb */
|
||||
copy_from_user(skel->write_urb->transfer_buffer, buffer, bytes_written);
|
||||
|
||||
/* set up our urb */
|
||||
usb_fill_bulk_urb(skel->write_urb,
|
||||
skel->dev,
|
||||
usb_sndbulkpipe(skel->dev, skel->bulk_out_endpointAddr),
|
||||
skel->write_urb->transfer_buffer,
|
||||
bytes_written,
|
||||
skel_write_bulk_callback,
|
||||
skel);
|
||||
|
||||
/* send the data out the bulk port */
|
||||
result = usb_submit_urb(skel->write_urb);
|
||||
if (result) {
|
||||
err("Failed submitting write urb, error %d", result);
|
||||
}
|
||||
</programlisting>
|
||||
<para>
|
||||
When the write urb is filled up with the proper information using the
|
||||
usb_fill_bulk_urb function, we point the urb's completion callback to call our
|
||||
own skel_write_bulk_callback function. This function is called when the
|
||||
urb is finished by the USB subsystem. The callback function is called in
|
||||
interrupt context, so caution must be taken not to do very much processing
|
||||
at that time. Our implementation of skel_write_bulk_callback merely
|
||||
reports if the urb was completed successfully or not and then returns.
|
||||
</para>
|
||||
<para>
|
||||
The read function works a bit differently from the write function in that
|
||||
we do not use an urb to transfer data from the device to the driver.
|
||||
Instead we call the usb_bulk_msg function, which can be used to send or
|
||||
receive data from a device without having to create urbs and handle
|
||||
urb completion callback functions. We call the usb_bulk_msg function,
|
||||
giving it a buffer into which to place any data received from the device
|
||||
and a timeout value. If the timeout period expires without receiving any
|
||||
data from the device, the function will fail and return an error message.
|
||||
This can be shown with the following code:
|
||||
</para>
|
||||
<programlisting>
|
||||
/* do an immediate bulk read to get data from the device */
|
||||
retval = usb_bulk_msg (skel->dev,
|
||||
usb_rcvbulkpipe (skel->dev,
|
||||
skel->bulk_in_endpointAddr),
|
||||
skel->bulk_in_buffer,
|
||||
skel->bulk_in_size,
|
||||
&count, HZ*10);
|
||||
/* if the read was successful, copy the data to user space */
|
||||
if (!retval) {
|
||||
if (copy_to_user (buffer, skel->bulk_in_buffer, count))
|
||||
retval = -EFAULT;
|
||||
else
|
||||
retval = count;
|
||||
}
|
||||
</programlisting>
|
||||
<para>
|
||||
The usb_bulk_msg function can be very useful for doing single reads or
|
||||
writes to a device; however, if you need to read or write constantly to a
|
||||
device, it is recommended to set up your own urbs and submit them to the
|
||||
USB subsystem.
|
||||
</para>
|
||||
<para>
|
||||
When the user program releases the file handle that it has been using to
|
||||
talk to the device, the release function in the driver is called. In this
|
||||
function we decrement our private usage count and wait for possible
|
||||
pending writes:
|
||||
</para>
|
||||
<programlisting>
|
||||
/* decrement our usage count for the device */
|
||||
--skel->open_count;
|
||||
</programlisting>
|
||||
<para>
|
||||
One of the more difficult problems that USB drivers must be able to handle
|
||||
smoothly is the fact that the USB device may be removed from the system at
|
||||
any point in time, even if a program is currently talking to it. It needs
|
||||
to be able to shut down any current reads and writes and notify the
|
||||
user-space programs that the device is no longer there. The following
|
||||
code (function <function>skel_delete</function>)
|
||||
is an example of how to do this: </para>
|
||||
<programlisting>
|
||||
static inline void skel_delete (struct usb_skel *dev)
|
||||
{
|
||||
kfree (dev->bulk_in_buffer);
|
||||
if (dev->bulk_out_buffer != NULL)
|
||||
usb_buffer_free (dev->udev, dev->bulk_out_size,
|
||||
dev->bulk_out_buffer,
|
||||
dev->write_urb->transfer_dma);
|
||||
usb_free_urb (dev->write_urb);
|
||||
kfree (dev);
|
||||
}
|
||||
</programlisting>
|
||||
<para>
|
||||
If a program currently has an open handle to the device, we reset the flag
|
||||
<literal>device_present</literal>. For
|
||||
every read, write, release and other functions that expect a device to be
|
||||
present, the driver first checks this flag to see if the device is
|
||||
still present. If not, it releases that the device has disappeared, and a
|
||||
-ENODEV error is returned to the user-space program. When the release
|
||||
function is eventually called, it determines if there is no device
|
||||
and if not, it does the cleanup that the skel_disconnect
|
||||
function normally does if there are no open files on the device (see
|
||||
Listing 5).
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="iso">
|
||||
<title>Isochronous Data</title>
|
||||
<para>
|
||||
This usb-skeleton driver does not have any examples of interrupt or
|
||||
isochronous data being sent to or from the device. Interrupt data is sent
|
||||
almost exactly as bulk data is, with a few minor exceptions. Isochronous
|
||||
data works differently with continuous streams of data being sent to or
|
||||
from the device. The audio and video camera drivers are very good examples
|
||||
of drivers that handle isochronous data and will be useful if you also
|
||||
need to do this.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="Conclusion">
|
||||
<title>Conclusion</title>
|
||||
<para>
|
||||
Writing Linux USB device drivers is not a difficult task as the
|
||||
usb-skeleton driver shows. This driver, combined with the other current
|
||||
USB drivers, should provide enough examples to help a beginning author
|
||||
create a working driver in a minimal amount of time. The linux-usb-devel
|
||||
mailing list archives also contain a lot of helpful information.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="resources">
|
||||
<title>Resources</title>
|
||||
<para>
|
||||
The Linux USB Project: <ulink url="http://www.linux-usb.org">http://www.linux-usb.org/</ulink>
|
||||
</para>
|
||||
<para>
|
||||
Linux Hotplug Project: <ulink url="http://linux-hotplug.sourceforge.net">http://linux-hotplug.sourceforge.net/</ulink>
|
||||
</para>
|
||||
<para>
|
||||
Linux USB Working Devices List: <ulink url="http://www.qbik.ch/usb/devices">http://www.qbik.ch/usb/devices/</ulink>
|
||||
</para>
|
||||
<para>
|
||||
linux-usb-devel Mailing List Archives: <ulink url="http://marc.theaimsgroup.com/?l=linux-usb-devel">http://marc.theaimsgroup.com/?l=linux-usb-devel</ulink>
|
||||
</para>
|
||||
<para>
|
||||
Programming Guide for Linux USB Device Drivers: <ulink url="http://usb.cs.tum.edu/usbdoc">http://usb.cs.tum.edu/usbdoc</ulink>
|
||||
</para>
|
||||
<para>
|
||||
USB Home Page: <ulink url="http://www.usb.org">http://www.usb.org</ulink>
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
</book>
|
||||
385
Documentation/DocBook/z8530book.tmpl
Normal file
385
Documentation/DocBook/z8530book.tmpl
Normal file
@ -0,0 +1,385 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="Z85230Guide">
|
||||
<bookinfo>
|
||||
<title>Z8530 Programming Guide</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Alan</firstname>
|
||||
<surname>Cox</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>alan@redhat.com</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2000</year>
|
||||
<holder>Alan Cox</holder>
|
||||
</copyright>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
This documentation is free software; you can redistribute
|
||||
it and/or modify it under the terms of the GNU General Public
|
||||
License as published by the Free Software Foundation; either
|
||||
version 2 of the License, or (at your option) any later
|
||||
version.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This program is distributed in the hope that it will be
|
||||
useful, but WITHOUT ANY WARRANTY; without even the implied
|
||||
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
See the GNU General Public License for more details.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
You should have received a copy of the GNU General Public
|
||||
License along with this program; if not, write to the Free
|
||||
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
|
||||
MA 02111-1307 USA
|
||||
</para>
|
||||
|
||||
<para>
|
||||
For more details see the file COPYING in the source
|
||||
distribution of Linux.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<chapter id="intro">
|
||||
<title>Introduction</title>
|
||||
<para>
|
||||
The Z85x30 family synchronous/asynchronous controller chips are
|
||||
used on a large number of cheap network interface cards. The
|
||||
kernel provides a core interface layer that is designed to make
|
||||
it easy to provide WAN services using this chip.
|
||||
</para>
|
||||
<para>
|
||||
The current driver only support synchronous operation. Merging the
|
||||
asynchronous driver support into this code to allow any Z85x30
|
||||
device to be used as both a tty interface and as a synchronous
|
||||
controller is a project for Linux post the 2.4 release
|
||||
</para>
|
||||
<para>
|
||||
The support code handles most common card configurations and
|
||||
supports running both Cisco HDLC and Synchronous PPP. With extra
|
||||
glue the frame relay and X.25 protocols can also be used with this
|
||||
driver.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter>
|
||||
<title>Driver Modes</title>
|
||||
<para>
|
||||
The Z85230 driver layer can drive Z8530, Z85C30 and Z85230 devices
|
||||
in three different modes. Each mode can be applied to an individual
|
||||
channel on the chip (each chip has two channels).
|
||||
</para>
|
||||
<para>
|
||||
The PIO synchronous mode supports the most common Z8530 wiring. Here
|
||||
the chip is interface to the I/O and interrupt facilities of the
|
||||
host machine but not to the DMA subsystem. When running PIO the
|
||||
Z8530 has extremely tight timing requirements. Doing high speeds,
|
||||
even with a Z85230 will be tricky. Typically you should expect to
|
||||
achieve at best 9600 baud with a Z8C530 and 64Kbits with a Z85230.
|
||||
</para>
|
||||
<para>
|
||||
The DMA mode supports the chip when it is configured to use dual DMA
|
||||
channels on an ISA bus. The better cards tend to support this mode
|
||||
of operation for a single channel. With DMA running the Z85230 tops
|
||||
out when it starts to hit ISA DMA constraints at about 512Kbits. It
|
||||
is worth noting here that many PC machines hang or crash when the
|
||||
chip is driven fast enough to hold the ISA bus solid.
|
||||
</para>
|
||||
<para>
|
||||
Transmit DMA mode uses a single DMA channel. The DMA channel is used
|
||||
for transmission as the transmit FIFO is smaller than the receive
|
||||
FIFO. it gives better performance than pure PIO mode but is nowhere
|
||||
near as ideal as pure DMA mode.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter>
|
||||
<title>Using the Z85230 driver</title>
|
||||
<para>
|
||||
The Z85230 driver provides the back end interface to your board. To
|
||||
configure a Z8530 interface you need to detect the board and to
|
||||
identify its ports and interrupt resources. It is also your problem
|
||||
to verify the resources are available.
|
||||
</para>
|
||||
<para>
|
||||
Having identified the chip you need to fill in a struct z8530_dev,
|
||||
which describes each chip. This object must exist until you finally
|
||||
shutdown the board. Firstly zero the active field. This ensures
|
||||
nothing goes off without you intending it. The irq field should
|
||||
be set to the interrupt number of the chip. (Each chip has a single
|
||||
interrupt source rather than each channel). You are responsible
|
||||
for allocating the interrupt line. The interrupt handler should be
|
||||
set to <function>z8530_interrupt</function>. The device id should
|
||||
be set to the z8530_dev structure pointer. Whether the interrupt can
|
||||
be shared or not is board dependent, and up to you to initialise.
|
||||
</para>
|
||||
<para>
|
||||
The structure holds two channel structures.
|
||||
Initialise chanA.ctrlio and chanA.dataio with the address of the
|
||||
control and data ports. You can or this with Z8530_PORT_SLEEP to
|
||||
indicate your interface needs the 5uS delay for chip settling done
|
||||
in software. The PORT_SLEEP option is architecture specific. Other
|
||||
flags may become available on future platforms, eg for MMIO.
|
||||
Initialise the chanA.irqs to &z8530_nop to start the chip up
|
||||
as disabled and discarding interrupt events. This ensures that
|
||||
stray interrupts will be mopped up and not hang the bus. Set
|
||||
chanA.dev to point to the device structure itself. The
|
||||
private and name field you may use as you wish. The private field
|
||||
is unused by the Z85230 layer. The name is used for error reporting
|
||||
and it may thus make sense to make it match the network name.
|
||||
</para>
|
||||
<para>
|
||||
Repeat the same operation with the B channel if your chip has
|
||||
both channels wired to something useful. This isn't always the
|
||||
case. If it is not wired then the I/O values do not matter, but
|
||||
you must initialise chanB.dev.
|
||||
</para>
|
||||
<para>
|
||||
If your board has DMA facilities then initialise the txdma and
|
||||
rxdma fields for the relevant channels. You must also allocate the
|
||||
ISA DMA channels and do any necessary board level initialisation
|
||||
to configure them. The low level driver will do the Z8530 and
|
||||
DMA controller programming but not board specific magic.
|
||||
</para>
|
||||
<para>
|
||||
Having initialised the device you can then call
|
||||
<function>z8530_init</function>. This will probe the chip and
|
||||
reset it into a known state. An identification sequence is then
|
||||
run to identify the chip type. If the checks fail to pass the
|
||||
function returns a non zero error code. Typically this indicates
|
||||
that the port given is not valid. After this call the
|
||||
type field of the z8530_dev structure is initialised to either
|
||||
Z8530, Z85C30 or Z85230 according to the chip found.
|
||||
</para>
|
||||
<para>
|
||||
Once you have called z8530_init you can also make use of the utility
|
||||
function <function>z8530_describe</function>. This provides a
|
||||
consistent reporting format for the Z8530 devices, and allows all
|
||||
the drivers to provide consistent reporting.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter>
|
||||
<title>Attaching Network Interfaces</title>
|
||||
<para>
|
||||
If you wish to use the network interface facilities of the driver,
|
||||
then you need to attach a network device to each channel that is
|
||||
present and in use. In addition to use the SyncPPP and Cisco HDLC
|
||||
you need to follow some additional plumbing rules. They may seem
|
||||
complex but a look at the example hostess_sv11 driver should
|
||||
reassure you.
|
||||
</para>
|
||||
<para>
|
||||
The network device used for each channel should be pointed to by
|
||||
the netdevice field of each channel. The dev-> priv field of the
|
||||
network device points to your private data - you will need to be
|
||||
able to find your ppp device from this. In addition to use the
|
||||
sync ppp layer the private data must start with a void * pointer
|
||||
to the syncppp structures.
|
||||
</para>
|
||||
<para>
|
||||
The way most drivers approach this particular problem is to
|
||||
create a structure holding the Z8530 device definition and
|
||||
put that and the syncppp pointer into the private field of
|
||||
the network device. The network device fields of the channels
|
||||
then point back to the network devices. The ppp_device can also
|
||||
be put in the private structure conveniently.
|
||||
</para>
|
||||
<para>
|
||||
If you wish to use the synchronous ppp then you need to attach
|
||||
the syncppp layer to the network device. You should do this before
|
||||
you register the network device. The
|
||||
<function>sppp_attach</function> requires that the first void *
|
||||
pointer in your private data is pointing to an empty struct
|
||||
ppp_device. The function fills in the initial data for the
|
||||
ppp/hdlc layer.
|
||||
</para>
|
||||
<para>
|
||||
Before you register your network device you will also need to
|
||||
provide suitable handlers for most of the network device callbacks.
|
||||
See the network device documentation for more details on this.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter>
|
||||
<title>Configuring And Activating The Port</title>
|
||||
<para>
|
||||
The Z85230 driver provides helper functions and tables to load the
|
||||
port registers on the Z8530 chips. When programming the register
|
||||
settings for a channel be aware that the documentation recommends
|
||||
initialisation orders. Strange things happen when these are not
|
||||
followed.
|
||||
</para>
|
||||
<para>
|
||||
<function>z8530_channel_load</function> takes an array of
|
||||
pairs of initialisation values in an array of u8 type. The first
|
||||
value is the Z8530 register number. Add 16 to indicate the alternate
|
||||
register bank on the later chips. The array is terminated by a 255.
|
||||
</para>
|
||||
<para>
|
||||
The driver provides a pair of public tables. The
|
||||
z8530_hdlc_kilostream table is for the UK 'Kilostream' service and
|
||||
also happens to cover most other end host configurations. The
|
||||
z8530_hdlc_kilostream_85230 table is the same configuration using
|
||||
the enhancements of the 85230 chip. The configuration loaded is
|
||||
standard NRZ encoded synchronous data with HDLC bitstuffing. All
|
||||
of the timing is taken from the other end of the link.
|
||||
</para>
|
||||
<para>
|
||||
When writing your own tables be aware that the driver internally
|
||||
tracks register values. It may need to reload values. You should
|
||||
therefore be sure to set registers 1-7, 9-11, 14 and 15 in all
|
||||
configurations. Where the register settings depend on DMA selection
|
||||
the driver will update the bits itself when you open or close.
|
||||
Loading a new table with the interface open is not recommended.
|
||||
</para>
|
||||
<para>
|
||||
There are three standard configurations supported by the core
|
||||
code. In PIO mode the interface is programmed up to use
|
||||
interrupt driven PIO. This places high demands on the host processor
|
||||
to avoid latency. The driver is written to take account of latency
|
||||
issues but it cannot avoid latencies caused by other drivers,
|
||||
notably IDE in PIO mode. Because the drivers allocate buffers you
|
||||
must also prevent MTU changes while the port is open.
|
||||
</para>
|
||||
<para>
|
||||
Once the port is open it will call the rx_function of each channel
|
||||
whenever a completed packet arrived. This is invoked from
|
||||
interrupt context and passes you the channel and a network
|
||||
buffer (struct sk_buff) holding the data. The data includes
|
||||
the CRC bytes so most users will want to trim the last two
|
||||
bytes before processing the data. This function is very timing
|
||||
critical. When you wish to simply discard data the support
|
||||
code provides the function <function>z8530_null_rx</function>
|
||||
to discard the data.
|
||||
</para>
|
||||
<para>
|
||||
To active PIO mode sending and receiving the <function>
|
||||
z8530_sync_open</function> is called. This expects to be passed
|
||||
the network device and the channel. Typically this is called from
|
||||
your network device open callback. On a failure a non zero error
|
||||
status is returned. The <function>z8530_sync_close</function>
|
||||
function shuts down a PIO channel. This must be done before the
|
||||
channel is opened again and before the driver shuts down
|
||||
and unloads.
|
||||
</para>
|
||||
<para>
|
||||
The ideal mode of operation is dual channel DMA mode. Here the
|
||||
kernel driver will configure the board for DMA in both directions.
|
||||
The driver also handles ISA DMA issues such as controller
|
||||
programming and the memory range limit for you. This mode is
|
||||
activated by calling the <function>z8530_sync_dma_open</function>
|
||||
function. On failure a non zero error value is returned.
|
||||
Once this mode is activated it can be shut down by calling the
|
||||
<function>z8530_sync_dma_close</function>. You must call the close
|
||||
function matching the open mode you used.
|
||||
</para>
|
||||
<para>
|
||||
The final supported mode uses a single DMA channel to drive the
|
||||
transmit side. As the Z85C30 has a larger FIFO on the receive
|
||||
channel this tends to increase the maximum speed a little.
|
||||
This is activated by calling the <function>z8530_sync_txdma_open
|
||||
</function>. This returns a non zero error code on failure. The
|
||||
<function>z8530_sync_txdma_close</function> function closes down
|
||||
the Z8530 interface from this mode.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter>
|
||||
<title>Network Layer Functions</title>
|
||||
<para>
|
||||
The Z8530 layer provides functions to queue packets for
|
||||
transmission. The driver internally buffers the frame currently
|
||||
being transmitted and one further frame (in order to keep back
|
||||
to back transmission running). Any further buffering is up to
|
||||
the caller.
|
||||
</para>
|
||||
<para>
|
||||
The function <function>z8530_queue_xmit</function> takes a network
|
||||
buffer in sk_buff format and queues it for transmission. The
|
||||
caller must provide the entire packet with the exception of the
|
||||
bitstuffing and CRC. This is normally done by the caller via
|
||||
the syncppp interface layer. It returns 0 if the buffer has been
|
||||
queued and non zero values for queue full. If the function accepts
|
||||
the buffer it becomes property of the Z8530 layer and the caller
|
||||
should not free it.
|
||||
</para>
|
||||
<para>
|
||||
The function <function>z8530_get_stats</function> returns a pointer
|
||||
to an internally maintained per interface statistics block. This
|
||||
provides most of the interface code needed to implement the network
|
||||
layer get_stats callback.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter>
|
||||
<title>Porting The Z8530 Driver</title>
|
||||
<para>
|
||||
The Z8530 driver is written to be portable. In DMA mode it makes
|
||||
assumptions about the use of ISA DMA. These are probably warranted
|
||||
in most cases as the Z85230 in particular was designed to glue to PC
|
||||
type machines. The PIO mode makes no real assumptions.
|
||||
</para>
|
||||
<para>
|
||||
Should you need to retarget the Z8530 driver to another architecture
|
||||
the only code that should need changing are the port I/O functions.
|
||||
At the moment these assume PC I/O port accesses. This may not be
|
||||
appropriate for all platforms. Replacing
|
||||
<function>z8530_read_port</function> and <function>z8530_write_port
|
||||
</function> is intended to be all that is required to port this
|
||||
driver layer.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="bugs">
|
||||
<title>Known Bugs And Assumptions</title>
|
||||
<para>
|
||||
<variablelist>
|
||||
<varlistentry><term>Interrupt Locking</term>
|
||||
<listitem>
|
||||
<para>
|
||||
The locking in the driver is done via the global cli/sti lock. This
|
||||
makes for relatively poor SMP performance. Switching this to use a
|
||||
per device spin lock would probably materially improve performance.
|
||||
</para>
|
||||
</listitem></varlistentry>
|
||||
|
||||
<varlistentry><term>Occasional Failures</term>
|
||||
<listitem>
|
||||
<para>
|
||||
We have reports of occasional failures when run for very long
|
||||
periods of time and the driver starts to receive junk frames. At
|
||||
the moment the cause of this is not clear.
|
||||
</para>
|
||||
</listitem></varlistentry>
|
||||
</variablelist>
|
||||
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<chapter id="pubfunctions">
|
||||
<title>Public Functions Provided</title>
|
||||
!Edrivers/net/wan/z85230.c
|
||||
</chapter>
|
||||
|
||||
<chapter id="intfunctions">
|
||||
<title>Internal Functions</title>
|
||||
!Idrivers/net/wan/z85230.c
|
||||
</chapter>
|
||||
|
||||
</book>
|
||||
661
Documentation/HOWTO
Normal file
661
Documentation/HOWTO
Normal file
@ -0,0 +1,661 @@
|
||||
HOWTO do Linux kernel development
|
||||
---------------------------------
|
||||
|
||||
This is the be-all, end-all document on this topic. It contains
|
||||
instructions on how to become a Linux kernel developer and how to learn
|
||||
to work with the Linux kernel development community. It tries to not
|
||||
contain anything related to the technical aspects of kernel programming,
|
||||
but will help point you in the right direction for that.
|
||||
|
||||
If anything in this document becomes out of date, please send in patches
|
||||
to the maintainer of this file, who is listed at the bottom of the
|
||||
document.
|
||||
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
So, you want to learn how to become a Linux kernel developer? Or you
|
||||
have been told by your manager, "Go write a Linux driver for this
|
||||
device." This document's goal is to teach you everything you need to
|
||||
know to achieve this by describing the process you need to go through,
|
||||
and hints on how to work with the community. It will also try to
|
||||
explain some of the reasons why the community works like it does.
|
||||
|
||||
The kernel is written mostly in C, with some architecture-dependent
|
||||
parts written in assembly. A good understanding of C is required for
|
||||
kernel development. Assembly (any architecture) is not required unless
|
||||
you plan to do low-level development for that architecture. Though they
|
||||
are not a good substitute for a solid C education and/or years of
|
||||
experience, the following books are good for, if anything, reference:
|
||||
- "The C Programming Language" by Kernighan and Ritchie [Prentice Hall]
|
||||
- "Practical C Programming" by Steve Oualline [O'Reilly]
|
||||
- "C: A Reference Manual" by Harbison and Steele [Prentice Hall]
|
||||
|
||||
The kernel is written using GNU C and the GNU toolchain. While it
|
||||
adheres to the ISO C89 standard, it uses a number of extensions that are
|
||||
not featured in the standard. The kernel is a freestanding C
|
||||
environment, with no reliance on the standard C library, so some
|
||||
portions of the C standard are not supported. Arbitrary long long
|
||||
divisions and floating point are not allowed. It can sometimes be
|
||||
difficult to understand the assumptions the kernel has on the toolchain
|
||||
and the extensions that it uses, and unfortunately there is no
|
||||
definitive reference for them. Please check the gcc info pages (`info
|
||||
gcc`) for some information on them.
|
||||
|
||||
Please remember that you are trying to learn how to work with the
|
||||
existing development community. It is a diverse group of people, with
|
||||
high standards for coding, style and procedure. These standards have
|
||||
been created over time based on what they have found to work best for
|
||||
such a large and geographically dispersed team. Try to learn as much as
|
||||
possible about these standards ahead of time, as they are well
|
||||
documented; do not expect people to adapt to you or your company's way
|
||||
of doing things.
|
||||
|
||||
|
||||
Legal Issues
|
||||
------------
|
||||
|
||||
The Linux kernel source code is released under the GPL. Please see the
|
||||
file, COPYING, in the main directory of the source tree, for details on
|
||||
the license. If you have further questions about the license, please
|
||||
contact a lawyer, and do not ask on the Linux kernel mailing list. The
|
||||
people on the mailing lists are not lawyers, and you should not rely on
|
||||
their statements on legal matters.
|
||||
|
||||
For common questions and answers about the GPL, please see:
|
||||
http://www.gnu.org/licenses/gpl-faq.html
|
||||
|
||||
|
||||
Documentation
|
||||
------------
|
||||
|
||||
The Linux kernel source tree has a large range of documents that are
|
||||
invaluable for learning how to interact with the kernel community. When
|
||||
new features are added to the kernel, it is recommended that new
|
||||
documentation files are also added which explain how to use the feature.
|
||||
When a kernel change causes the interface that the kernel exposes to
|
||||
userspace to change, it is recommended that you send the information or
|
||||
a patch to the manual pages explaining the change to the manual pages
|
||||
maintainer at mtk-manpages@gmx.net.
|
||||
|
||||
Here is a list of files that are in the kernel source tree that are
|
||||
required reading:
|
||||
README
|
||||
This file gives a short background on the Linux kernel and describes
|
||||
what is necessary to do to configure and build the kernel. People
|
||||
who are new to the kernel should start here.
|
||||
|
||||
Documentation/Changes
|
||||
This file gives a list of the minimum levels of various software
|
||||
packages that are necessary to build and run the kernel
|
||||
successfully.
|
||||
|
||||
Documentation/CodingStyle
|
||||
This describes the Linux kernel coding style, and some of the
|
||||
rationale behind it. All new code is expected to follow the
|
||||
guidelines in this document. Most maintainers will only accept
|
||||
patches if these rules are followed, and many people will only
|
||||
review code if it is in the proper style.
|
||||
|
||||
Documentation/SubmittingPatches
|
||||
Documentation/SubmittingDrivers
|
||||
These files describe in explicit detail how to successfully create
|
||||
and send a patch, including (but not limited to):
|
||||
- Email contents
|
||||
- Email format
|
||||
- Who to send it to
|
||||
Following these rules will not guarantee success (as all patches are
|
||||
subject to scrutiny for content and style), but not following them
|
||||
will almost always prevent it.
|
||||
|
||||
Other excellent descriptions of how to create patches properly are:
|
||||
"The Perfect Patch"
|
||||
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt
|
||||
"Linux kernel patch submission format"
|
||||
http://linux.yyz.us/patch-format.html
|
||||
|
||||
Documentation/stable_api_nonsense.txt
|
||||
This file describes the rationale behind the conscious decision to
|
||||
not have a stable API within the kernel, including things like:
|
||||
- Subsystem shim-layers (for compatibility?)
|
||||
- Driver portability between Operating Systems.
|
||||
- Mitigating rapid change within the kernel source tree (or
|
||||
preventing rapid change)
|
||||
This document is crucial for understanding the Linux development
|
||||
philosophy and is very important for people moving to Linux from
|
||||
development on other Operating Systems.
|
||||
|
||||
Documentation/SecurityBugs
|
||||
If you feel you have found a security problem in the Linux kernel,
|
||||
please follow the steps in this document to help notify the kernel
|
||||
developers, and help solve the issue.
|
||||
|
||||
Documentation/ManagementStyle
|
||||
This document describes how Linux kernel maintainers operate and the
|
||||
shared ethos behind their methodologies. This is important reading
|
||||
for anyone new to kernel development (or anyone simply curious about
|
||||
it), as it resolves a lot of common misconceptions and confusion
|
||||
about the unique behavior of kernel maintainers.
|
||||
|
||||
Documentation/stable_kernel_rules.txt
|
||||
This file describes the rules on how the stable kernel releases
|
||||
happen, and what to do if you want to get a change into one of these
|
||||
releases.
|
||||
|
||||
Documentation/kernel-docs.txt
|
||||
A list of external documentation that pertains to kernel
|
||||
development. Please consult this list if you do not find what you
|
||||
are looking for within the in-kernel documentation.
|
||||
|
||||
Documentation/applying-patches.txt
|
||||
A good introduction describing exactly what a patch is and how to
|
||||
apply it to the different development branches of the kernel.
|
||||
|
||||
The kernel also has a large number of documents that can be
|
||||
automatically generated from the source code itself. This includes a
|
||||
full description of the in-kernel API, and rules on how to handle
|
||||
locking properly. The documents will be created in the
|
||||
Documentation/DocBook/ directory and can be generated as PDF,
|
||||
Postscript, HTML, and man pages by running:
|
||||
make pdfdocs
|
||||
make psdocs
|
||||
make htmldocs
|
||||
make mandocs
|
||||
respectively from the main kernel source directory.
|
||||
|
||||
|
||||
Becoming A Kernel Developer
|
||||
---------------------------
|
||||
|
||||
If you do not know anything about Linux kernel development, you should
|
||||
look at the Linux KernelNewbies project:
|
||||
http://kernelnewbies.org
|
||||
It consists of a helpful mailing list where you can ask almost any type
|
||||
of basic kernel development question (make sure to search the archives
|
||||
first, before asking something that has already been answered in the
|
||||
past.) It also has an IRC channel that you can use to ask questions in
|
||||
real-time, and a lot of helpful documentation that is useful for
|
||||
learning about Linux kernel development.
|
||||
|
||||
The website has basic information about code organization, subsystems,
|
||||
and current projects (both in-tree and out-of-tree). It also describes
|
||||
some basic logistical information, like how to compile a kernel and
|
||||
apply a patch.
|
||||
|
||||
If you do not know where you want to start, but you want to look for
|
||||
some task to start doing to join into the kernel development community,
|
||||
go to the Linux Kernel Janitor's project:
|
||||
http://janitor.kernelnewbies.org/
|
||||
It is a great place to start. It describes a list of relatively simple
|
||||
problems that need to be cleaned up and fixed within the Linux kernel
|
||||
source tree. Working with the developers in charge of this project, you
|
||||
will learn the basics of getting your patch into the Linux kernel tree,
|
||||
and possibly be pointed in the direction of what to go work on next, if
|
||||
you do not already have an idea.
|
||||
|
||||
If you already have a chunk of code that you want to put into the kernel
|
||||
tree, but need some help getting it in the proper form, the
|
||||
kernel-mentors project was created to help you out with this. It is a
|
||||
mailing list, and can be found at:
|
||||
http://selenic.com/mailman/listinfo/kernel-mentors
|
||||
|
||||
Before making any actual modifications to the Linux kernel code, it is
|
||||
imperative to understand how the code in question works. For this
|
||||
purpose, nothing is better than reading through it directly (most tricky
|
||||
bits are commented well), perhaps even with the help of specialized
|
||||
tools. One such tool that is particularly recommended is the Linux
|
||||
Cross-Reference project, which is able to present source code in a
|
||||
self-referential, indexed webpage format. An excellent up-to-date
|
||||
repository of the kernel code may be found at:
|
||||
http://sosdg.org/~coywolf/lxr/
|
||||
|
||||
|
||||
The development process
|
||||
-----------------------
|
||||
|
||||
Linux kernel development process currently consists of a few different
|
||||
main kernel "branches" and lots of different subsystem-specific kernel
|
||||
branches. These different branches are:
|
||||
- main 2.6.x kernel tree
|
||||
- 2.6.x.y -stable kernel tree
|
||||
- 2.6.x -git kernel patches
|
||||
- 2.6.x -mm kernel patches
|
||||
- subsystem specific kernel trees and patches
|
||||
|
||||
2.6.x kernel tree
|
||||
-----------------
|
||||
2.6.x kernels are maintained by Linus Torvalds, and can be found on
|
||||
kernel.org in the pub/linux/kernel/v2.6/ directory. Its development
|
||||
process is as follows:
|
||||
- As soon as a new kernel is released a two weeks window is open,
|
||||
during this period of time maintainers can submit big diffs to
|
||||
Linus, usually the patches that have already been included in the
|
||||
-mm kernel for a few weeks. The preferred way to submit big changes
|
||||
is using git (the kernel's source management tool, more information
|
||||
can be found at http://git.or.cz/) but plain patches are also just
|
||||
fine.
|
||||
- After two weeks a -rc1 kernel is released it is now possible to push
|
||||
only patches that do not include new features that could affect the
|
||||
stability of the whole kernel. Please note that a whole new driver
|
||||
(or filesystem) might be accepted after -rc1 because there is no
|
||||
risk of causing regressions with such a change as long as the change
|
||||
is self-contained and does not affect areas outside of the code that
|
||||
is being added. git can be used to send patches to Linus after -rc1
|
||||
is released, but the patches need to also be sent to a public
|
||||
mailing list for review.
|
||||
- A new -rc is released whenever Linus deems the current git tree to
|
||||
be in a reasonably sane state adequate for testing. The goal is to
|
||||
release a new -rc kernel every week.
|
||||
- Process continues until the kernel is considered "ready", the
|
||||
process should last around 6 weeks.
|
||||
|
||||
It is worth mentioning what Andrew Morton wrote on the linux-kernel
|
||||
mailing list about kernel releases:
|
||||
"Nobody knows when a kernel will be released, because it's
|
||||
released according to perceived bug status, not according to a
|
||||
preconceived timeline."
|
||||
|
||||
2.6.x.y -stable kernel tree
|
||||
---------------------------
|
||||
Kernels with 4 digit versions are -stable kernels. They contain
|
||||
relatively small and critical fixes for security problems or significant
|
||||
regressions discovered in a given 2.6.x kernel.
|
||||
|
||||
This is the recommended branch for users who want the most recent stable
|
||||
kernel and are not interested in helping test development/experimental
|
||||
versions.
|
||||
|
||||
If no 2.6.x.y kernel is available, then the highest numbered 2.6.x
|
||||
kernel is the current stable kernel.
|
||||
|
||||
2.6.x.y are maintained by the "stable" team <stable@kernel.org>, and are
|
||||
released almost every other week.
|
||||
|
||||
The file Documentation/stable_kernel_rules.txt in the kernel tree
|
||||
documents what kinds of changes are acceptable for the -stable tree, and
|
||||
how the release process works.
|
||||
|
||||
2.6.x -git patches
|
||||
------------------
|
||||
These are daily snapshots of Linus' kernel tree which are managed in a
|
||||
git repository (hence the name.) These patches are usually released
|
||||
daily and represent the current state of Linus' tree. They are more
|
||||
experimental than -rc kernels since they are generated automatically
|
||||
without even a cursory glance to see if they are sane.
|
||||
|
||||
2.6.x -mm kernel patches
|
||||
------------------------
|
||||
These are experimental kernel patches released by Andrew Morton. Andrew
|
||||
takes all of the different subsystem kernel trees and patches and mushes
|
||||
them together, along with a lot of patches that have been plucked from
|
||||
the linux-kernel mailing list. This tree serves as a proving ground for
|
||||
new features and patches. Once a patch has proved its worth in -mm for
|
||||
a while Andrew or the subsystem maintainer pushes it on to Linus for
|
||||
inclusion in mainline.
|
||||
|
||||
It is heavily encouraged that all new patches get tested in the -mm tree
|
||||
before they are sent to Linus for inclusion in the main kernel tree.
|
||||
|
||||
These kernels are not appropriate for use on systems that are supposed
|
||||
to be stable and they are more risky to run than any of the other
|
||||
branches.
|
||||
|
||||
If you wish to help out with the kernel development process, please test
|
||||
and use these kernel releases and provide feedback to the linux-kernel
|
||||
mailing list if you have any problems, and if everything works properly.
|
||||
|
||||
In addition to all the other experimental patches, these kernels usually
|
||||
also contain any changes in the mainline -git kernels available at the
|
||||
time of release.
|
||||
|
||||
The -mm kernels are not released on a fixed schedule, but usually a few
|
||||
-mm kernels are released in between each -rc kernel (1 to 3 is common).
|
||||
|
||||
Subsystem Specific kernel trees and patches
|
||||
-------------------------------------------
|
||||
A number of the different kernel subsystem developers expose their
|
||||
development trees so that others can see what is happening in the
|
||||
different areas of the kernel. These trees are pulled into the -mm
|
||||
kernel releases as described above.
|
||||
|
||||
Here is a list of some of the different kernel trees available:
|
||||
git trees:
|
||||
- Kbuild development tree, Sam Ravnborg <sam@ravnborg.org>
|
||||
kernel.org:/pub/scm/linux/kernel/git/sam/kbuild.git
|
||||
|
||||
- ACPI development tree, Len Brown <len.brown@intel.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6.git
|
||||
|
||||
- Block development tree, Jens Axboe <axboe@suse.de>
|
||||
kernel.org:/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git
|
||||
|
||||
- DRM development tree, Dave Airlie <airlied@linux.ie>
|
||||
kernel.org:/pub/scm/linux/kernel/git/airlied/drm-2.6.git
|
||||
|
||||
- ia64 development tree, Tony Luck <tony.luck@intel.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6.git
|
||||
|
||||
- ieee1394 development tree, Jody McIntyre <scjody@modernduck.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/scjody/ieee1394.git
|
||||
|
||||
- infiniband, Roland Dreier <rolandd@cisco.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git
|
||||
|
||||
- libata, Jeff Garzik <jgarzik@pobox.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev.git
|
||||
|
||||
- network drivers, Jeff Garzik <jgarzik@pobox.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git
|
||||
|
||||
- pcmcia, Dominik Brodowski <linux@dominikbrodowski.net>
|
||||
kernel.org:/pub/scm/linux/kernel/git/brodo/pcmcia-2.6.git
|
||||
|
||||
- SCSI, James Bottomley <James.Bottomley@SteelEye.com>
|
||||
kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git
|
||||
|
||||
Other git kernel trees can be found listed at http://kernel.org/git
|
||||
|
||||
quilt trees:
|
||||
- USB, PCI, Driver Core, and I2C, Greg Kroah-Hartman <gregkh@suse.de>
|
||||
kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/
|
||||
- x86-64, partly i386, Andi Kleen <ak@suse.de>
|
||||
ftp.firstfloor.org:/pub/ak/x86_64/quilt/
|
||||
|
||||
Bug Reporting
|
||||
-------------
|
||||
|
||||
bugzilla.kernel.org is where the Linux kernel developers track kernel
|
||||
bugs. Users are encouraged to report all bugs that they find in this
|
||||
tool. For details on how to use the kernel bugzilla, please see:
|
||||
http://test.kernel.org/bugzilla/faq.html
|
||||
|
||||
The file REPORTING-BUGS in the main kernel source directory has a good
|
||||
template for how to report a possible kernel bug, and details what kind
|
||||
of information is needed by the kernel developers to help track down the
|
||||
problem.
|
||||
|
||||
|
||||
Managing bug reports
|
||||
--------------------
|
||||
|
||||
One of the best ways to put into practice your hacking skills is by fixing
|
||||
bugs reported by other people. Not only you will help to make the kernel
|
||||
more stable, you'll learn to fix real world problems and you will improve
|
||||
your skills, and other developers will be aware of your presence. Fixing
|
||||
bugs is one of the best ways to earn merit amongst the developers, because
|
||||
not many people like wasting time fixing other people's bugs.
|
||||
|
||||
To work in the already reported bug reports, go to http://bugzilla.kernel.org.
|
||||
If you want to be advised of the future bug reports, you can subscribe to the
|
||||
bugme-new mailing list (only new bug reports are mailed here) or to the
|
||||
bugme-janitor mailing list (every change in the bugzilla is mailed here)
|
||||
|
||||
http://lists.osdl.org/mailman/listinfo/bugme-new
|
||||
http://lists.osdl.org/mailman/listinfo/bugme-janitors
|
||||
|
||||
|
||||
|
||||
Managing bug reports
|
||||
--------------------
|
||||
|
||||
One of the best ways to put into practice your hacking skills is by fixing
|
||||
bugs reported by other people. Not only you will help to make the kernel
|
||||
more stable, you'll learn to fix real world problems and you will improve
|
||||
your skills, and other developers will be aware of your presence. Fixing
|
||||
bugs is one of the best ways to get merits among other developers, because
|
||||
not many people like wasting time fixing other people's bugs.
|
||||
|
||||
To work in the already reported bug reports, go to http://bugzilla.kernel.org.
|
||||
If you want to be advised of the future bug reports, you can subscribe to the
|
||||
bugme-new mailing list (only new bug reports are mailed here) or to the
|
||||
bugme-janitor mailing list (every change in the bugzilla is mailed here)
|
||||
|
||||
http://lists.osdl.org/mailman/listinfo/bugme-new
|
||||
http://lists.osdl.org/mailman/listinfo/bugme-janitors
|
||||
|
||||
|
||||
|
||||
Mailing lists
|
||||
-------------
|
||||
|
||||
As some of the above documents describe, the majority of the core kernel
|
||||
developers participate on the Linux Kernel Mailing list. Details on how
|
||||
to subscribe and unsubscribe from the list can be found at:
|
||||
http://vger.kernel.org/vger-lists.html#linux-kernel
|
||||
There are archives of the mailing list on the web in many different
|
||||
places. Use a search engine to find these archives. For example:
|
||||
http://dir.gmane.org/gmane.linux.kernel
|
||||
It is highly recommended that you search the archives about the topic
|
||||
you want to bring up, before you post it to the list. A lot of things
|
||||
already discussed in detail are only recorded at the mailing list
|
||||
archives.
|
||||
|
||||
Most of the individual kernel subsystems also have their own separate
|
||||
mailing list where they do their development efforts. See the
|
||||
MAINTAINERS file for a list of what these lists are for the different
|
||||
groups.
|
||||
|
||||
Many of the lists are hosted on kernel.org. Information on them can be
|
||||
found at:
|
||||
http://vger.kernel.org/vger-lists.html
|
||||
|
||||
Please remember to follow good behavioral habits when using the lists.
|
||||
Though a bit cheesy, the following URL has some simple guidelines for
|
||||
interacting with the list (or any list):
|
||||
http://www.albion.com/netiquette/
|
||||
|
||||
If multiple people respond to your mail, the CC: list of recipients may
|
||||
get pretty large. Don't remove anybody from the CC: list without a good
|
||||
reason, or don't reply only to the list address. Get used to receiving the
|
||||
mail twice, one from the sender and the one from the list, and don't try
|
||||
to tune that by adding fancy mail-headers, people will not like it.
|
||||
|
||||
Remember to keep the context and the attribution of your replies intact,
|
||||
keep the "John Kernelhacker wrote ...:" lines at the top of your reply, and
|
||||
add your statements between the individual quoted sections instead of
|
||||
writing at the top of the mail.
|
||||
|
||||
If you add patches to your mail, make sure they are plain readable text
|
||||
as stated in Documentation/SubmittingPatches. Kernel developers don't
|
||||
want to deal with attachments or compressed patches; they may want
|
||||
to comment on individual lines of your patch, which works only that way.
|
||||
Make sure you use a mail program that does not mangle spaces and tab
|
||||
characters. A good first test is to send the mail to yourself and try
|
||||
to apply your own patch by yourself. If that doesn't work, get your
|
||||
mail program fixed or change it until it works.
|
||||
|
||||
Above all, please remember to show respect to other subscribers.
|
||||
|
||||
|
||||
Working with the community
|
||||
--------------------------
|
||||
|
||||
The goal of the kernel community is to provide the best possible kernel
|
||||
there is. When you submit a patch for acceptance, it will be reviewed
|
||||
on its technical merits and those alone. So, what should you be
|
||||
expecting?
|
||||
- criticism
|
||||
- comments
|
||||
- requests for change
|
||||
- requests for justification
|
||||
- silence
|
||||
|
||||
Remember, this is part of getting your patch into the kernel. You have
|
||||
to be able to take criticism and comments about your patches, evaluate
|
||||
them at a technical level and either rework your patches or provide
|
||||
clear and concise reasoning as to why those changes should not be made.
|
||||
If there are no responses to your posting, wait a few days and try
|
||||
again, sometimes things get lost in the huge volume.
|
||||
|
||||
What should you not do?
|
||||
- expect your patch to be accepted without question
|
||||
- become defensive
|
||||
- ignore comments
|
||||
- resubmit the patch without making any of the requested changes
|
||||
|
||||
In a community that is looking for the best technical solution possible,
|
||||
there will always be differing opinions on how beneficial a patch is.
|
||||
You have to be cooperative, and willing to adapt your idea to fit within
|
||||
the kernel. Or at least be willing to prove your idea is worth it.
|
||||
Remember, being wrong is acceptable as long as you are willing to work
|
||||
toward a solution that is right.
|
||||
|
||||
It is normal that the answers to your first patch might simply be a list
|
||||
of a dozen things you should correct. This does _not_ imply that your
|
||||
patch will not be accepted, and it is _not_ meant against you
|
||||
personally. Simply correct all issues raised against your patch and
|
||||
resend it.
|
||||
|
||||
|
||||
Differences between the kernel community and corporate structures
|
||||
-----------------------------------------------------------------
|
||||
|
||||
The kernel community works differently than most traditional corporate
|
||||
development environments. Here are a list of things that you can try to
|
||||
do to try to avoid problems:
|
||||
Good things to say regarding your proposed changes:
|
||||
- "This solves multiple problems."
|
||||
- "This deletes 2000 lines of code."
|
||||
- "Here is a patch that explains what I am trying to describe."
|
||||
- "I tested it on 5 different architectures..."
|
||||
- "Here is a series of small patches that..."
|
||||
- "This increases performance on typical machines..."
|
||||
|
||||
Bad things you should avoid saying:
|
||||
- "We did it this way in AIX/ptx/Solaris, so therefore it must be
|
||||
good..."
|
||||
- "I've being doing this for 20 years, so..."
|
||||
- "This is required for my company to make money"
|
||||
- "This is for our Enterprise product line."
|
||||
- "Here is my 1000 page design document that describes my idea"
|
||||
- "I've been working on this for 6 months..."
|
||||
- "Here's a 5000 line patch that..."
|
||||
- "I rewrote all of the current mess, and here it is..."
|
||||
- "I have a deadline, and this patch needs to be applied now."
|
||||
|
||||
Another way the kernel community is different than most traditional
|
||||
software engineering work environments is the faceless nature of
|
||||
interaction. One benefit of using email and irc as the primary forms of
|
||||
communication is the lack of discrimination based on gender or race.
|
||||
The Linux kernel work environment is accepting of women and minorities
|
||||
because all you are is an email address. The international aspect also
|
||||
helps to level the playing field because you can't guess gender based on
|
||||
a person's name. A man may be named Andrea and a woman may be named Pat.
|
||||
Most women who have worked in the Linux kernel and have expressed an
|
||||
opinion have had positive experiences.
|
||||
|
||||
The language barrier can cause problems for some people who are not
|
||||
comfortable with English. A good grasp of the language can be needed in
|
||||
order to get ideas across properly on mailing lists, so it is
|
||||
recommended that you check your emails to make sure they make sense in
|
||||
English before sending them.
|
||||
|
||||
|
||||
Break up your changes
|
||||
---------------------
|
||||
|
||||
The Linux kernel community does not gladly accept large chunks of code
|
||||
dropped on it all at once. The changes need to be properly introduced,
|
||||
discussed, and broken up into tiny, individual portions. This is almost
|
||||
the exact opposite of what companies are used to doing. Your proposal
|
||||
should also be introduced very early in the development process, so that
|
||||
you can receive feedback on what you are doing. It also lets the
|
||||
community feel that you are working with them, and not simply using them
|
||||
as a dumping ground for your feature. However, don't send 50 emails at
|
||||
one time to a mailing list, your patch series should be smaller than
|
||||
that almost all of the time.
|
||||
|
||||
The reasons for breaking things up are the following:
|
||||
|
||||
1) Small patches increase the likelihood that your patches will be
|
||||
applied, since they don't take much time or effort to verify for
|
||||
correctness. A 5 line patch can be applied by a maintainer with
|
||||
barely a second glance. However, a 500 line patch may take hours to
|
||||
review for correctness (the time it takes is exponentially
|
||||
proportional to the size of the patch, or something).
|
||||
|
||||
Small patches also make it very easy to debug when something goes
|
||||
wrong. It's much easier to back out patches one by one than it is
|
||||
to dissect a very large patch after it's been applied (and broken
|
||||
something).
|
||||
|
||||
2) It's important not only to send small patches, but also to rewrite
|
||||
and simplify (or simply re-order) patches before submitting them.
|
||||
|
||||
Here is an analogy from kernel developer Al Viro:
|
||||
"Think of a teacher grading homework from a math student. The
|
||||
teacher does not want to see the student's trials and errors
|
||||
before they came up with the solution. They want to see the
|
||||
cleanest, most elegant answer. A good student knows this, and
|
||||
would never submit her intermediate work before the final
|
||||
solution."
|
||||
|
||||
The same is true of kernel development. The maintainers and
|
||||
reviewers do not want to see the thought process behind the
|
||||
solution to the problem one is solving. They want to see a
|
||||
simple and elegant solution."
|
||||
|
||||
It may be challenging to keep the balance between presenting an elegant
|
||||
solution and working together with the community and discussing your
|
||||
unfinished work. Therefore it is good to get early in the process to
|
||||
get feedback to improve your work, but also keep your changes in small
|
||||
chunks that they may get already accepted, even when your whole task is
|
||||
not ready for inclusion now.
|
||||
|
||||
Also realize that it is not acceptable to send patches for inclusion
|
||||
that are unfinished and will be "fixed up later."
|
||||
|
||||
|
||||
Justify your change
|
||||
-------------------
|
||||
|
||||
Along with breaking up your patches, it is very important for you to let
|
||||
the Linux community know why they should add this change. New features
|
||||
must be justified as being needed and useful.
|
||||
|
||||
|
||||
Document your change
|
||||
--------------------
|
||||
|
||||
When sending in your patches, pay special attention to what you say in
|
||||
the text in your email. This information will become the ChangeLog
|
||||
information for the patch, and will be preserved for everyone to see for
|
||||
all time. It should describe the patch completely, containing:
|
||||
- why the change is necessary
|
||||
- the overall design approach in the patch
|
||||
- implementation details
|
||||
- testing results
|
||||
|
||||
For more details on what this should all look like, please see the
|
||||
ChangeLog section of the document:
|
||||
"The Perfect Patch"
|
||||
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt
|
||||
|
||||
|
||||
|
||||
|
||||
All of these things are sometimes very hard to do. It can take years to
|
||||
perfect these practices (if at all). It's a continuous process of
|
||||
improvement that requires a lot of patience and determination. But
|
||||
don't give up, it's possible. Many have done it before, and each had to
|
||||
start exactly where you are now.
|
||||
|
||||
|
||||
|
||||
|
||||
----------
|
||||
Thanks to Paolo Ciarrocchi who allowed the "Development Process"
|
||||
(http://linux.tar.bz/articles/2.6-development_process) section
|
||||
to be based on text he had written, and to Randy Dunlap and Gerrit
|
||||
Huizenga for some of the list of things you should and should not say.
|
||||
Also thanks to Pat Mochel, Hanna Linder, Randy Dunlap, Kay Sievers,
|
||||
Vojtech Pavlik, Jan Kara, Josh Boyer, Kees Cook, Andrew Morton, Andi
|
||||
Kleen, Vadim Lobanov, Jesper Juhl, Adrian Bunk, Keri Harris, Frans Pop,
|
||||
David A. Wheeler, Junio Hamano, Michael Kerrisk, and Alex Shepard for
|
||||
their review, comments, and contributions. Without their help, this
|
||||
document would not have been possible.
|
||||
|
||||
|
||||
|
||||
Maintainer: Greg Kroah-Hartman <greg@kroah.com>
|
||||
208
Documentation/IO-mapping.txt
Normal file
208
Documentation/IO-mapping.txt
Normal file
@ -0,0 +1,208 @@
|
||||
[ NOTE: The virt_to_bus() and bus_to_virt() functions have been
|
||||
superseded by the functionality provided by the PCI DMA
|
||||
interface (see Documentation/DMA-mapping.txt). They continue
|
||||
to be documented below for historical purposes, but new code
|
||||
must not use them. --davidm 00/12/12 ]
|
||||
|
||||
[ This is a mail message in response to a query on IO mapping, thus the
|
||||
strange format for a "document" ]
|
||||
|
||||
The AHA-1542 is a bus-master device, and your patch makes the driver give the
|
||||
controller the physical address of the buffers, which is correct on x86
|
||||
(because all bus master devices see the physical memory mappings directly).
|
||||
|
||||
However, on many setups, there are actually _three_ different ways of looking
|
||||
at memory addresses, and in this case we actually want the third, the
|
||||
so-called "bus address".
|
||||
|
||||
Essentially, the three ways of addressing memory are (this is "real memory",
|
||||
that is, normal RAM--see later about other details):
|
||||
|
||||
- CPU untranslated. This is the "physical" address. Physical address
|
||||
0 is what the CPU sees when it drives zeroes on the memory bus.
|
||||
|
||||
- CPU translated address. This is the "virtual" address, and is
|
||||
completely internal to the CPU itself with the CPU doing the appropriate
|
||||
translations into "CPU untranslated".
|
||||
|
||||
- bus address. This is the address of memory as seen by OTHER devices,
|
||||
not the CPU. Now, in theory there could be many different bus
|
||||
addresses, with each device seeing memory in some device-specific way, but
|
||||
happily most hardware designers aren't actually actively trying to make
|
||||
things any more complex than necessary, so you can assume that all
|
||||
external hardware sees the memory the same way.
|
||||
|
||||
Now, on normal PCs the bus address is exactly the same as the physical
|
||||
address, and things are very simple indeed. However, they are that simple
|
||||
because the memory and the devices share the same address space, and that is
|
||||
not generally necessarily true on other PCI/ISA setups.
|
||||
|
||||
Now, just as an example, on the PReP (PowerPC Reference Platform), the
|
||||
CPU sees a memory map something like this (this is from memory):
|
||||
|
||||
0-2 GB "real memory"
|
||||
2 GB-3 GB "system IO" (inb/out and similar accesses on x86)
|
||||
3 GB-4 GB "IO memory" (shared memory over the IO bus)
|
||||
|
||||
Now, that looks simple enough. However, when you look at the same thing from
|
||||
the viewpoint of the devices, you have the reverse, and the physical memory
|
||||
address 0 actually shows up as address 2 GB for any IO master.
|
||||
|
||||
So when the CPU wants any bus master to write to physical memory 0, it
|
||||
has to give the master address 0x80000000 as the memory address.
|
||||
|
||||
So, for example, depending on how the kernel is actually mapped on the
|
||||
PPC, you can end up with a setup like this:
|
||||
|
||||
physical address: 0
|
||||
virtual address: 0xC0000000
|
||||
bus address: 0x80000000
|
||||
|
||||
where all the addresses actually point to the same thing. It's just seen
|
||||
through different translations..
|
||||
|
||||
Similarly, on the Alpha, the normal translation is
|
||||
|
||||
physical address: 0
|
||||
virtual address: 0xfffffc0000000000
|
||||
bus address: 0x40000000
|
||||
|
||||
(but there are also Alphas where the physical address and the bus address
|
||||
are the same).
|
||||
|
||||
Anyway, the way to look up all these translations, you do
|
||||
|
||||
#include <asm/io.h>
|
||||
|
||||
phys_addr = virt_to_phys(virt_addr);
|
||||
virt_addr = phys_to_virt(phys_addr);
|
||||
bus_addr = virt_to_bus(virt_addr);
|
||||
virt_addr = bus_to_virt(bus_addr);
|
||||
|
||||
Now, when do you need these?
|
||||
|
||||
You want the _virtual_ address when you are actually going to access that
|
||||
pointer from the kernel. So you can have something like this:
|
||||
|
||||
/*
|
||||
* this is the hardware "mailbox" we use to communicate with
|
||||
* the controller. The controller sees this directly.
|
||||
*/
|
||||
struct mailbox {
|
||||
__u32 status;
|
||||
__u32 bufstart;
|
||||
__u32 buflen;
|
||||
..
|
||||
} mbox;
|
||||
|
||||
unsigned char * retbuffer;
|
||||
|
||||
/* get the address from the controller */
|
||||
retbuffer = bus_to_virt(mbox.bufstart);
|
||||
switch (retbuffer[0]) {
|
||||
case STATUS_OK:
|
||||
...
|
||||
|
||||
on the other hand, you want the bus address when you have a buffer that
|
||||
you want to give to the controller:
|
||||
|
||||
/* ask the controller to read the sense status into "sense_buffer" */
|
||||
mbox.bufstart = virt_to_bus(&sense_buffer);
|
||||
mbox.buflen = sizeof(sense_buffer);
|
||||
mbox.status = 0;
|
||||
notify_controller(&mbox);
|
||||
|
||||
And you generally _never_ want to use the physical address, because you can't
|
||||
use that from the CPU (the CPU only uses translated virtual addresses), and
|
||||
you can't use it from the bus master.
|
||||
|
||||
So why do we care about the physical address at all? We do need the physical
|
||||
address in some cases, it's just not very often in normal code. The physical
|
||||
address is needed if you use memory mappings, for example, because the
|
||||
"remap_pfn_range()" mm function wants the physical address of the memory to
|
||||
be remapped as measured in units of pages, a.k.a. the pfn (the memory
|
||||
management layer doesn't know about devices outside the CPU, so it
|
||||
shouldn't need to know about "bus addresses" etc).
|
||||
|
||||
NOTE NOTE NOTE! The above is only one part of the whole equation. The above
|
||||
only talks about "real memory", that is, CPU memory (RAM).
|
||||
|
||||
There is a completely different type of memory too, and that's the "shared
|
||||
memory" on the PCI or ISA bus. That's generally not RAM (although in the case
|
||||
of a video graphics card it can be normal DRAM that is just used for a frame
|
||||
buffer), but can be things like a packet buffer in a network card etc.
|
||||
|
||||
This memory is called "PCI memory" or "shared memory" or "IO memory" or
|
||||
whatever, and there is only one way to access it: the readb/writeb and
|
||||
related functions. You should never take the address of such memory, because
|
||||
there is really nothing you can do with such an address: it's not
|
||||
conceptually in the same memory space as "real memory" at all, so you cannot
|
||||
just dereference a pointer. (Sadly, on x86 it _is_ in the same memory space,
|
||||
so on x86 it actually works to just deference a pointer, but it's not
|
||||
portable).
|
||||
|
||||
For such memory, you can do things like
|
||||
|
||||
- reading:
|
||||
/*
|
||||
* read first 32 bits from ISA memory at 0xC0000, aka
|
||||
* C000:0000 in DOS terms
|
||||
*/
|
||||
unsigned int signature = isa_readl(0xC0000);
|
||||
|
||||
- remapping and writing:
|
||||
/*
|
||||
* remap framebuffer PCI memory area at 0xFC000000,
|
||||
* size 1MB, so that we can access it: We can directly
|
||||
* access only the 640k-1MB area, so anything else
|
||||
* has to be remapped.
|
||||
*/
|
||||
char * baseptr = ioremap(0xFC000000, 1024*1024);
|
||||
|
||||
/* write a 'A' to the offset 10 of the area */
|
||||
writeb('A',baseptr+10);
|
||||
|
||||
/* unmap when we unload the driver */
|
||||
iounmap(baseptr);
|
||||
|
||||
- copying and clearing:
|
||||
/* get the 6-byte Ethernet address at ISA address E000:0040 */
|
||||
memcpy_fromio(kernel_buffer, 0xE0040, 6);
|
||||
/* write a packet to the driver */
|
||||
memcpy_toio(0xE1000, skb->data, skb->len);
|
||||
/* clear the frame buffer */
|
||||
memset_io(0xA0000, 0, 0x10000);
|
||||
|
||||
OK, that just about covers the basics of accessing IO portably. Questions?
|
||||
Comments? You may think that all the above is overly complex, but one day you
|
||||
might find yourself with a 500 MHz Alpha in front of you, and then you'll be
|
||||
happy that your driver works ;)
|
||||
|
||||
Note that kernel versions 2.0.x (and earlier) mistakenly called the
|
||||
ioremap() function "vremap()". ioremap() is the proper name, but I
|
||||
didn't think straight when I wrote it originally. People who have to
|
||||
support both can do something like:
|
||||
|
||||
/* support old naming silliness */
|
||||
#if LINUX_VERSION_CODE < 0x020100
|
||||
#define ioremap vremap
|
||||
#define iounmap vfree
|
||||
#endif
|
||||
|
||||
at the top of their source files, and then they can use the right names
|
||||
even on 2.0.x systems.
|
||||
|
||||
And the above sounds worse than it really is. Most real drivers really
|
||||
don't do all that complex things (or rather: the complexity is not so
|
||||
much in the actual IO accesses as in error handling and timeouts etc).
|
||||
It's generally not hard to fix drivers, and in many cases the code
|
||||
actually looks better afterwards:
|
||||
|
||||
unsigned long signature = *(unsigned int *) 0xC0000;
|
||||
vs
|
||||
unsigned long signature = readl(0xC0000);
|
||||
|
||||
I think the second version actually is more readable, no?
|
||||
|
||||
Linus
|
||||
|
||||
660
Documentation/IPMI.txt
Normal file
660
Documentation/IPMI.txt
Normal file
@ -0,0 +1,660 @@
|
||||
|
||||
The Linux IPMI Driver
|
||||
---------------------
|
||||
Corey Minyard
|
||||
<minyard@mvista.com>
|
||||
<minyard@acm.org>
|
||||
|
||||
The Intelligent Platform Management Interface, or IPMI, is a
|
||||
standard for controlling intelligent devices that monitor a system.
|
||||
It provides for dynamic discovery of sensors in the system and the
|
||||
ability to monitor the sensors and be informed when the sensor's
|
||||
values change or go outside certain boundaries. It also has a
|
||||
standardized database for field-replaceable units (FRUs) and a watchdog
|
||||
timer.
|
||||
|
||||
To use this, you need an interface to an IPMI controller in your
|
||||
system (called a Baseboard Management Controller, or BMC) and
|
||||
management software that can use the IPMI system.
|
||||
|
||||
This document describes how to use the IPMI driver for Linux. If you
|
||||
are not familiar with IPMI itself, see the web site at
|
||||
http://www.intel.com/design/servers/ipmi/index.htm. IPMI is a big
|
||||
subject and I can't cover it all here!
|
||||
|
||||
Configuration
|
||||
-------------
|
||||
|
||||
The Linux IPMI driver is modular, which means you have to pick several
|
||||
things to have it work right depending on your hardware. Most of
|
||||
these are available in the 'Character Devices' menu then the IPMI
|
||||
menu.
|
||||
|
||||
No matter what, you must pick 'IPMI top-level message handler' to use
|
||||
IPMI. What you do beyond that depends on your needs and hardware.
|
||||
|
||||
The message handler does not provide any user-level interfaces.
|
||||
Kernel code (like the watchdog) can still use it. If you need access
|
||||
from userland, you need to select 'Device interface for IPMI' if you
|
||||
want access through a device driver.
|
||||
|
||||
The driver interface depends on your hardware. If your system
|
||||
properly provides the SMBIOS info for IPMI, the driver will detect it
|
||||
and just work. If you have a board with a standard interface (These
|
||||
will generally be either "KCS", "SMIC", or "BT", consult your hardware
|
||||
manual), choose the 'IPMI SI handler' option. A driver also exists
|
||||
for direct I2C access to the IPMI management controller. Some boards
|
||||
support this, but it is unknown if it will work on every board. For
|
||||
this, choose 'IPMI SMBus handler', but be ready to try to do some
|
||||
figuring to see if it will work on your system if the SMBIOS/APCI
|
||||
information is wrong or not present. It is fairly safe to have both
|
||||
these enabled and let the drivers auto-detect what is present.
|
||||
|
||||
You should generally enable ACPI on your system, as systems with IPMI
|
||||
can have ACPI tables describing them.
|
||||
|
||||
If you have a standard interface and the board manufacturer has done
|
||||
their job correctly, the IPMI controller should be automatically
|
||||
detected (via ACPI or SMBIOS tables) and should just work. Sadly,
|
||||
many boards do not have this information. The driver attempts
|
||||
standard defaults, but they may not work. If you fall into this
|
||||
situation, you need to read the section below named 'The SI Driver' or
|
||||
"The SMBus Driver" on how to hand-configure your system.
|
||||
|
||||
IPMI defines a standard watchdog timer. You can enable this with the
|
||||
'IPMI Watchdog Timer' config option. If you compile the driver into
|
||||
the kernel, then via a kernel command-line option you can have the
|
||||
watchdog timer start as soon as it initializes. It also have a lot
|
||||
of other options, see the 'Watchdog' section below for more details.
|
||||
Note that you can also have the watchdog continue to run if it is
|
||||
closed (by default it is disabled on close). Go into the 'Watchdog
|
||||
Cards' menu, enable 'Watchdog Timer Support', and enable the option
|
||||
'Disable watchdog shutdown on close'.
|
||||
|
||||
IPMI systems can often be powered off using IPMI commands. Select
|
||||
'IPMI Poweroff' to do this. The driver will auto-detect if the system
|
||||
can be powered off by IPMI. It is safe to enable this even if your
|
||||
system doesn't support this option. This works on ATCA systems, the
|
||||
Radisys CPI1 card, and any IPMI system that supports standard chassis
|
||||
management commands.
|
||||
|
||||
If you want the driver to put an event into the event log on a panic,
|
||||
enable the 'Generate a panic event to all BMCs on a panic' option. If
|
||||
you want the whole panic string put into the event log using OEM
|
||||
events, enable the 'Generate OEM events containing the panic string'
|
||||
option.
|
||||
|
||||
Basic Design
|
||||
------------
|
||||
|
||||
The Linux IPMI driver is designed to be very modular and flexible, you
|
||||
only need to take the pieces you need and you can use it in many
|
||||
different ways. Because of that, it's broken into many chunks of
|
||||
code. These chunks (by module name) are:
|
||||
|
||||
ipmi_msghandler - This is the central piece of software for the IPMI
|
||||
system. It handles all messages, message timing, and responses. The
|
||||
IPMI users tie into this, and the IPMI physical interfaces (called
|
||||
System Management Interfaces, or SMIs) also tie in here. This
|
||||
provides the kernelland interface for IPMI, but does not provide an
|
||||
interface for use by application processes.
|
||||
|
||||
ipmi_devintf - This provides a userland IOCTL interface for the IPMI
|
||||
driver, each open file for this device ties in to the message handler
|
||||
as an IPMI user.
|
||||
|
||||
ipmi_si - A driver for various system interfaces. This supports KCS,
|
||||
SMIC, and BT interfaces. Unless you have an SMBus interface or your
|
||||
own custom interface, you probably need to use this.
|
||||
|
||||
ipmi_smb - A driver for accessing BMCs on the SMBus. It uses the
|
||||
I2C kernel driver's SMBus interfaces to send and receive IPMI messages
|
||||
over the SMBus.
|
||||
|
||||
ipmi_watchdog - IPMI requires systems to have a very capable watchdog
|
||||
timer. This driver implements the standard Linux watchdog timer
|
||||
interface on top of the IPMI message handler.
|
||||
|
||||
ipmi_poweroff - Some systems support the ability to be turned off via
|
||||
IPMI commands.
|
||||
|
||||
These are all individually selectable via configuration options.
|
||||
|
||||
Note that the KCS-only interface has been removed. The af_ipmi driver
|
||||
is no longer supported and has been removed because it was impossible
|
||||
to do 32 bit emulation on 64-bit kernels with it.
|
||||
|
||||
Much documentation for the interface is in the include files. The
|
||||
IPMI include files are:
|
||||
|
||||
net/af_ipmi.h - Contains the socket interface.
|
||||
|
||||
linux/ipmi.h - Contains the user interface and IOCTL interface for IPMI.
|
||||
|
||||
linux/ipmi_smi.h - Contains the interface for system management interfaces
|
||||
(things that interface to IPMI controllers) to use.
|
||||
|
||||
linux/ipmi_msgdefs.h - General definitions for base IPMI messaging.
|
||||
|
||||
|
||||
Addressing
|
||||
----------
|
||||
|
||||
The IPMI addressing works much like IP addresses, you have an overlay
|
||||
to handle the different address types. The overlay is:
|
||||
|
||||
struct ipmi_addr
|
||||
{
|
||||
int addr_type;
|
||||
short channel;
|
||||
char data[IPMI_MAX_ADDR_SIZE];
|
||||
};
|
||||
|
||||
The addr_type determines what the address really is. The driver
|
||||
currently understands two different types of addresses.
|
||||
|
||||
"System Interface" addresses are defined as:
|
||||
|
||||
struct ipmi_system_interface_addr
|
||||
{
|
||||
int addr_type;
|
||||
short channel;
|
||||
};
|
||||
|
||||
and the type is IPMI_SYSTEM_INTERFACE_ADDR_TYPE. This is used for talking
|
||||
straight to the BMC on the current card. The channel must be
|
||||
IPMI_BMC_CHANNEL.
|
||||
|
||||
Messages that are destined to go out on the IPMB bus use the
|
||||
IPMI_IPMB_ADDR_TYPE address type. The format is
|
||||
|
||||
struct ipmi_ipmb_addr
|
||||
{
|
||||
int addr_type;
|
||||
short channel;
|
||||
unsigned char slave_addr;
|
||||
unsigned char lun;
|
||||
};
|
||||
|
||||
The "channel" here is generally zero, but some devices support more
|
||||
than one channel, it corresponds to the channel as defined in the IPMI
|
||||
spec.
|
||||
|
||||
|
||||
Messages
|
||||
--------
|
||||
|
||||
Messages are defined as:
|
||||
|
||||
struct ipmi_msg
|
||||
{
|
||||
unsigned char netfn;
|
||||
unsigned char lun;
|
||||
unsigned char cmd;
|
||||
unsigned char *data;
|
||||
int data_len;
|
||||
};
|
||||
|
||||
The driver takes care of adding/stripping the header information. The
|
||||
data portion is just the data to be send (do NOT put addressing info
|
||||
here) or the response. Note that the completion code of a response is
|
||||
the first item in "data", it is not stripped out because that is how
|
||||
all the messages are defined in the spec (and thus makes counting the
|
||||
offsets a little easier :-).
|
||||
|
||||
When using the IOCTL interface from userland, you must provide a block
|
||||
of data for "data", fill it, and set data_len to the length of the
|
||||
block of data, even when receiving messages. Otherwise the driver
|
||||
will have no place to put the message.
|
||||
|
||||
Messages coming up from the message handler in kernelland will come in
|
||||
as:
|
||||
|
||||
struct ipmi_recv_msg
|
||||
{
|
||||
struct list_head link;
|
||||
|
||||
/* The type of message as defined in the "Receive Types"
|
||||
defines above. */
|
||||
int recv_type;
|
||||
|
||||
ipmi_user_t *user;
|
||||
struct ipmi_addr addr;
|
||||
long msgid;
|
||||
struct ipmi_msg msg;
|
||||
|
||||
/* Call this when done with the message. It will presumably free
|
||||
the message and do any other necessary cleanup. */
|
||||
void (*done)(struct ipmi_recv_msg *msg);
|
||||
|
||||
/* Place-holder for the data, don't make any assumptions about
|
||||
the size or existence of this, since it may change. */
|
||||
unsigned char msg_data[IPMI_MAX_MSG_LENGTH];
|
||||
};
|
||||
|
||||
You should look at the receive type and handle the message
|
||||
appropriately.
|
||||
|
||||
|
||||
The Upper Layer Interface (Message Handler)
|
||||
-------------------------------------------
|
||||
|
||||
The upper layer of the interface provides the users with a consistent
|
||||
view of the IPMI interfaces. It allows multiple SMI interfaces to be
|
||||
addressed (because some boards actually have multiple BMCs on them)
|
||||
and the user should not have to care what type of SMI is below them.
|
||||
|
||||
|
||||
Creating the User
|
||||
|
||||
To user the message handler, you must first create a user using
|
||||
ipmi_create_user. The interface number specifies which SMI you want
|
||||
to connect to, and you must supply callback functions to be called
|
||||
when data comes in. The callback function can run at interrupt level,
|
||||
so be careful using the callbacks. This also allows to you pass in a
|
||||
piece of data, the handler_data, that will be passed back to you on
|
||||
all calls.
|
||||
|
||||
Once you are done, call ipmi_destroy_user() to get rid of the user.
|
||||
|
||||
From userland, opening the device automatically creates a user, and
|
||||
closing the device automatically destroys the user.
|
||||
|
||||
|
||||
Messaging
|
||||
|
||||
To send a message from kernel-land, the ipmi_request() call does
|
||||
pretty much all message handling. Most of the parameter are
|
||||
self-explanatory. However, it takes a "msgid" parameter. This is NOT
|
||||
the sequence number of messages. It is simply a long value that is
|
||||
passed back when the response for the message is returned. You may
|
||||
use it for anything you like.
|
||||
|
||||
Responses come back in the function pointed to by the ipmi_recv_hndl
|
||||
field of the "handler" that you passed in to ipmi_create_user().
|
||||
Remember again, these may be running at interrupt level. Remember to
|
||||
look at the receive type, too.
|
||||
|
||||
From userland, you fill out an ipmi_req_t structure and use the
|
||||
IPMICTL_SEND_COMMAND ioctl. For incoming stuff, you can use select()
|
||||
or poll() to wait for messages to come in. However, you cannot use
|
||||
read() to get them, you must call the IPMICTL_RECEIVE_MSG with the
|
||||
ipmi_recv_t structure to actually get the message. Remember that you
|
||||
must supply a pointer to a block of data in the msg.data field, and
|
||||
you must fill in the msg.data_len field with the size of the data.
|
||||
This gives the receiver a place to actually put the message.
|
||||
|
||||
If the message cannot fit into the data you provide, you will get an
|
||||
EMSGSIZE error and the driver will leave the data in the receive
|
||||
queue. If you want to get it and have it truncate the message, us
|
||||
the IPMICTL_RECEIVE_MSG_TRUNC ioctl.
|
||||
|
||||
When you send a command (which is defined by the lowest-order bit of
|
||||
the netfn per the IPMI spec) on the IPMB bus, the driver will
|
||||
automatically assign the sequence number to the command and save the
|
||||
command. If the response is not receive in the IPMI-specified 5
|
||||
seconds, it will generate a response automatically saying the command
|
||||
timed out. If an unsolicited response comes in (if it was after 5
|
||||
seconds, for instance), that response will be ignored.
|
||||
|
||||
In kernelland, after you receive a message and are done with it, you
|
||||
MUST call ipmi_free_recv_msg() on it, or you will leak messages. Note
|
||||
that you should NEVER mess with the "done" field of a message, that is
|
||||
required to properly clean up the message.
|
||||
|
||||
Note that when sending, there is an ipmi_request_supply_msgs() call
|
||||
that lets you supply the smi and receive message. This is useful for
|
||||
pieces of code that need to work even if the system is out of buffers
|
||||
(the watchdog timer uses this, for instance). You supply your own
|
||||
buffer and own free routines. This is not recommended for normal use,
|
||||
though, since it is tricky to manage your own buffers.
|
||||
|
||||
|
||||
Events and Incoming Commands
|
||||
|
||||
The driver takes care of polling for IPMI events and receiving
|
||||
commands (commands are messages that are not responses, they are
|
||||
commands that other things on the IPMB bus have sent you). To receive
|
||||
these, you must register for them, they will not automatically be sent
|
||||
to you.
|
||||
|
||||
To receive events, you must call ipmi_set_gets_events() and set the
|
||||
"val" to non-zero. Any events that have been received by the driver
|
||||
since startup will immediately be delivered to the first user that
|
||||
registers for events. After that, if multiple users are registered
|
||||
for events, they will all receive all events that come in.
|
||||
|
||||
For receiving commands, you have to individually register commands you
|
||||
want to receive. Call ipmi_register_for_cmd() and supply the netfn
|
||||
and command name for each command you want to receive. You also
|
||||
specify a bitmask of the channels you want to receive the command from
|
||||
(or use IPMI_CHAN_ALL for all channels if you don't care). Only one
|
||||
user may be registered for each netfn/cmd/channel, but different users
|
||||
may register for different commands, or the same command if the
|
||||
channel bitmasks do not overlap.
|
||||
|
||||
From userland, equivalent IOCTLs are provided to do these functions.
|
||||
|
||||
|
||||
The Lower Layer (SMI) Interface
|
||||
-------------------------------
|
||||
|
||||
As mentioned before, multiple SMI interfaces may be registered to the
|
||||
message handler, each of these is assigned an interface number when
|
||||
they register with the message handler. They are generally assigned
|
||||
in the order they register, although if an SMI unregisters and then
|
||||
another one registers, all bets are off.
|
||||
|
||||
The ipmi_smi.h defines the interface for management interfaces, see
|
||||
that for more details.
|
||||
|
||||
|
||||
The SI Driver
|
||||
-------------
|
||||
|
||||
The SI driver allows up to 4 KCS or SMIC interfaces to be configured
|
||||
in the system. By default, scan the ACPI tables for interfaces, and
|
||||
if it doesn't find any the driver will attempt to register one KCS
|
||||
interface at the spec-specified I/O port 0xca2 without interrupts.
|
||||
You can change this at module load time (for a module) with:
|
||||
|
||||
modprobe ipmi_si.o type=<type1>,<type2>....
|
||||
ports=<port1>,<port2>... addrs=<addr1>,<addr2>...
|
||||
irqs=<irq1>,<irq2>... trydefaults=[0|1]
|
||||
regspacings=<sp1>,<sp2>,... regsizes=<size1>,<size2>,...
|
||||
regshifts=<shift1>,<shift2>,...
|
||||
slave_addrs=<addr1>,<addr2>,...
|
||||
force_kipmid=<enable1>,<enable2>,...
|
||||
unload_when_empty=[0|1]
|
||||
|
||||
Each of these except si_trydefaults is a list, the first item for the
|
||||
first interface, second item for the second interface, etc.
|
||||
|
||||
The si_type may be either "kcs", "smic", or "bt". If you leave it blank, it
|
||||
defaults to "kcs".
|
||||
|
||||
If you specify si_addrs as non-zero for an interface, the driver will
|
||||
use the memory address given as the address of the device. This
|
||||
overrides si_ports.
|
||||
|
||||
If you specify si_ports as non-zero for an interface, the driver will
|
||||
use the I/O port given as the device address.
|
||||
|
||||
If you specify si_irqs as non-zero for an interface, the driver will
|
||||
attempt to use the given interrupt for the device.
|
||||
|
||||
si_trydefaults sets whether the standard IPMI interface at 0xca2 and
|
||||
any interfaces specified by ACPE are tried. By default, the driver
|
||||
tries it, set this value to zero to turn this off.
|
||||
|
||||
The next three parameters have to do with register layout. The
|
||||
registers used by the interfaces may not appear at successive
|
||||
locations and they may not be in 8-bit registers. These parameters
|
||||
allow the layout of the data in the registers to be more precisely
|
||||
specified.
|
||||
|
||||
The regspacings parameter give the number of bytes between successive
|
||||
register start addresses. For instance, if the regspacing is set to 4
|
||||
and the start address is 0xca2, then the address for the second
|
||||
register would be 0xca6. This defaults to 1.
|
||||
|
||||
The regsizes parameter gives the size of a register, in bytes. The
|
||||
data used by IPMI is 8-bits wide, but it may be inside a larger
|
||||
register. This parameter allows the read and write type to specified.
|
||||
It may be 1, 2, 4, or 8. The default is 1.
|
||||
|
||||
Since the register size may be larger than 32 bits, the IPMI data may not
|
||||
be in the lower 8 bits. The regshifts parameter give the amount to shift
|
||||
the data to get to the actual IPMI data.
|
||||
|
||||
The slave_addrs specifies the IPMI address of the local BMC. This is
|
||||
usually 0x20 and the driver defaults to that, but in case it's not, it
|
||||
can be specified when the driver starts up.
|
||||
|
||||
The force_ipmid parameter forcefully enables (if set to 1) or disables
|
||||
(if set to 0) the kernel IPMI daemon. Normally this is auto-detected
|
||||
by the driver, but systems with broken interrupts might need an enable,
|
||||
or users that don't want the daemon (don't need the performance, don't
|
||||
want the CPU hit) can disable it.
|
||||
|
||||
If unload_when_empty is set to 1, the driver will be unloaded if it
|
||||
doesn't find any interfaces or all the interfaces fail to work. The
|
||||
default is one. Setting to 0 is useful with the hotmod, but is
|
||||
obviously only useful for modules.
|
||||
|
||||
When compiled into the kernel, the parameters can be specified on the
|
||||
kernel command line as:
|
||||
|
||||
ipmi_si.type=<type1>,<type2>...
|
||||
ipmi_si.ports=<port1>,<port2>... ipmi_si.addrs=<addr1>,<addr2>...
|
||||
ipmi_si.irqs=<irq1>,<irq2>... ipmi_si.trydefaults=[0|1]
|
||||
ipmi_si.regspacings=<sp1>,<sp2>,...
|
||||
ipmi_si.regsizes=<size1>,<size2>,...
|
||||
ipmi_si.regshifts=<shift1>,<shift2>,...
|
||||
ipmi_si.slave_addrs=<addr1>,<addr2>,...
|
||||
ipmi_si.force_kipmid=<enable1>,<enable2>,...
|
||||
|
||||
It works the same as the module parameters of the same names.
|
||||
|
||||
By default, the driver will attempt to detect any device specified by
|
||||
ACPI, and if none of those then a KCS device at the spec-specified
|
||||
0xca2. If you want to turn this off, set the "trydefaults" option to
|
||||
false.
|
||||
|
||||
If you have high-res timers compiled into the kernel, the driver will
|
||||
use them to provide much better performance. Note that if you do not
|
||||
have high-res timers enabled in the kernel and you don't have
|
||||
interrupts enabled, the driver will run VERY slowly. Don't blame me,
|
||||
these interfaces suck.
|
||||
|
||||
The driver supports a hot add and remove of interfaces. This way,
|
||||
interfaces can be added or removed after the kernel is up and running.
|
||||
This is done using /sys/modules/ipmi_si/hotmod, which is a write-only
|
||||
parameter. You write a string to this interface. The string has the
|
||||
format:
|
||||
<op1>[:op2[:op3...]]
|
||||
The "op"s are:
|
||||
add|remove,kcs|bt|smic,mem|i/o,<address>[,<opt1>[,<opt2>[,...]]]
|
||||
You can specify more than one interface on the line. The "opt"s are:
|
||||
rsp=<regspacing>
|
||||
rsi=<regsize>
|
||||
rsh=<regshift>
|
||||
irq=<irq>
|
||||
ipmb=<ipmb slave addr>
|
||||
and these have the same meanings as discussed above. Note that you
|
||||
can also use this on the kernel command line for a more compact format
|
||||
for specifying an interface. Note that when removing an interface,
|
||||
only the first three parameters (si type, address type, and address)
|
||||
are used for the comparison. Any options are ignored for removing.
|
||||
|
||||
The SMBus Driver
|
||||
----------------
|
||||
|
||||
The SMBus driver allows up to 4 SMBus devices to be configured in the
|
||||
system. By default, the driver will register any SMBus interfaces it finds
|
||||
in the I2C address range of 0x20 to 0x4f on any adapter. You can change this
|
||||
at module load time (for a module) with:
|
||||
|
||||
modprobe ipmi_smb.o
|
||||
addr=<adapter1>,<i2caddr1>[,<adapter2>,<i2caddr2>[,...]]
|
||||
dbg=<flags1>,<flags2>...
|
||||
[defaultprobe=1] [dbg_probe=1]
|
||||
|
||||
The addresses are specified in pairs, the first is the adapter ID and the
|
||||
second is the I2C address on that adapter.
|
||||
|
||||
The debug flags are bit flags for each BMC found, they are:
|
||||
IPMI messages: 1, driver state: 2, timing: 4, I2C probe: 8
|
||||
|
||||
Setting smb_defaultprobe to zero disabled the default probing of SMBus
|
||||
interfaces at address range 0x20 to 0x4f. This means that only the
|
||||
BMCs specified on the smb_addr line will be detected.
|
||||
|
||||
Setting smb_dbg_probe to 1 will enable debugging of the probing and
|
||||
detection process for BMCs on the SMBusses.
|
||||
|
||||
Discovering the IPMI compliant BMC on the SMBus can cause devices
|
||||
on the I2C bus to fail. The SMBus driver writes a "Get Device ID" IPMI
|
||||
message as a block write to the I2C bus and waits for a response.
|
||||
This action can be detrimental to some I2C devices. It is highly recommended
|
||||
that the known I2c address be given to the SMBus driver in the smb_addr
|
||||
parameter. The default address range will not be used when a smb_addr
|
||||
parameter is provided.
|
||||
|
||||
When compiled into the kernel, the addresses can be specified on the
|
||||
kernel command line as:
|
||||
|
||||
ipmb_smb.addr=<adapter1>,<i2caddr1>[,<adapter2>,<i2caddr2>[,...]]
|
||||
ipmi_smb.dbg=<flags1>,<flags2>...
|
||||
ipmi_smb.defaultprobe=0 ipmi_smb.dbg_probe=1
|
||||
|
||||
These are the same options as on the module command line.
|
||||
|
||||
Note that you might need some I2C changes if CONFIG_IPMI_PANIC_EVENT
|
||||
is enabled along with this, so the I2C driver knows to run to
|
||||
completion during sending a panic event.
|
||||
|
||||
|
||||
Other Pieces
|
||||
------------
|
||||
|
||||
Watchdog
|
||||
--------
|
||||
|
||||
A watchdog timer is provided that implements the Linux-standard
|
||||
watchdog timer interface. It has three module parameters that can be
|
||||
used to control it:
|
||||
|
||||
modprobe ipmi_watchdog timeout=<t> pretimeout=<t> action=<action type>
|
||||
preaction=<preaction type> preop=<preop type> start_now=x
|
||||
nowayout=x ifnum_to_use=n
|
||||
|
||||
ifnum_to_use specifies which interface the watchdog timer should use.
|
||||
The default is -1, which means to pick the first one registered.
|
||||
|
||||
The timeout is the number of seconds to the action, and the pretimeout
|
||||
is the amount of seconds before the reset that the pre-timeout panic will
|
||||
occur (if pretimeout is zero, then pretimeout will not be enabled). Note
|
||||
that the pretimeout is the time before the final timeout. So if the
|
||||
timeout is 50 seconds and the pretimeout is 10 seconds, then the pretimeout
|
||||
will occur in 40 second (10 seconds before the timeout).
|
||||
|
||||
The action may be "reset", "power_cycle", or "power_off", and
|
||||
specifies what to do when the timer times out, and defaults to
|
||||
"reset".
|
||||
|
||||
The preaction may be "pre_smi" for an indication through the SMI
|
||||
interface, "pre_int" for an indication through the SMI with an
|
||||
interrupts, and "pre_nmi" for a NMI on a preaction. This is how
|
||||
the driver is informed of the pretimeout.
|
||||
|
||||
The preop may be set to "preop_none" for no operation on a pretimeout,
|
||||
"preop_panic" to set the preoperation to panic, or "preop_give_data"
|
||||
to provide data to read from the watchdog device when the pretimeout
|
||||
occurs. A "pre_nmi" setting CANNOT be used with "preop_give_data"
|
||||
because you can't do data operations from an NMI.
|
||||
|
||||
When preop is set to "preop_give_data", one byte comes ready to read
|
||||
on the device when the pretimeout occurs. Select and fasync work on
|
||||
the device, as well.
|
||||
|
||||
If start_now is set to 1, the watchdog timer will start running as
|
||||
soon as the driver is loaded.
|
||||
|
||||
If nowayout is set to 1, the watchdog timer will not stop when the
|
||||
watchdog device is closed. The default value of nowayout is true
|
||||
if the CONFIG_WATCHDOG_NOWAYOUT option is enabled, or false if not.
|
||||
|
||||
When compiled into the kernel, the kernel command line is available
|
||||
for configuring the watchdog:
|
||||
|
||||
ipmi_watchdog.timeout=<t> ipmi_watchdog.pretimeout=<t>
|
||||
ipmi_watchdog.action=<action type>
|
||||
ipmi_watchdog.preaction=<preaction type>
|
||||
ipmi_watchdog.preop=<preop type>
|
||||
ipmi_watchdog.start_now=x
|
||||
ipmi_watchdog.nowayout=x
|
||||
|
||||
The options are the same as the module parameter options.
|
||||
|
||||
The watchdog will panic and start a 120 second reset timeout if it
|
||||
gets a pre-action. During a panic or a reboot, the watchdog will
|
||||
start a 120 timer if it is running to make sure the reboot occurs.
|
||||
|
||||
Note that if you use the NMI preaction for the watchdog, you MUST
|
||||
NOT use nmi watchdog mode 1. If you use the NMI watchdog, you
|
||||
must use mode 2.
|
||||
|
||||
Once you open the watchdog timer, you must write a 'V' character to the
|
||||
device to close it, or the timer will not stop. This is a new semantic
|
||||
for the driver, but makes it consistent with the rest of the watchdog
|
||||
drivers in Linux.
|
||||
|
||||
|
||||
Panic Timeouts
|
||||
--------------
|
||||
|
||||
The OpenIPMI driver supports the ability to put semi-custom and custom
|
||||
events in the system event log if a panic occurs. if you enable the
|
||||
'Generate a panic event to all BMCs on a panic' option, you will get
|
||||
one event on a panic in a standard IPMI event format. If you enable
|
||||
the 'Generate OEM events containing the panic string' option, you will
|
||||
also get a bunch of OEM events holding the panic string.
|
||||
|
||||
|
||||
The field settings of the events are:
|
||||
* Generator ID: 0x21 (kernel)
|
||||
* EvM Rev: 0x03 (this event is formatting in IPMI 1.0 format)
|
||||
* Sensor Type: 0x20 (OS critical stop sensor)
|
||||
* Sensor #: The first byte of the panic string (0 if no panic string)
|
||||
* Event Dir | Event Type: 0x6f (Assertion, sensor-specific event info)
|
||||
* Event Data 1: 0xa1 (Runtime stop in OEM bytes 2 and 3)
|
||||
* Event data 2: second byte of panic string
|
||||
* Event data 3: third byte of panic string
|
||||
See the IPMI spec for the details of the event layout. This event is
|
||||
always sent to the local management controller. It will handle routing
|
||||
the message to the right place
|
||||
|
||||
Other OEM events have the following format:
|
||||
Record ID (bytes 0-1): Set by the SEL.
|
||||
Record type (byte 2): 0xf0 (OEM non-timestamped)
|
||||
byte 3: The slave address of the card saving the panic
|
||||
byte 4: A sequence number (starting at zero)
|
||||
The rest of the bytes (11 bytes) are the panic string. If the panic string
|
||||
is longer than 11 bytes, multiple messages will be sent with increasing
|
||||
sequence numbers.
|
||||
|
||||
Because you cannot send OEM events using the standard interface, this
|
||||
function will attempt to find an SEL and add the events there. It
|
||||
will first query the capabilities of the local management controller.
|
||||
If it has an SEL, then they will be stored in the SEL of the local
|
||||
management controller. If not, and the local management controller is
|
||||
an event generator, the event receiver from the local management
|
||||
controller will be queried and the events sent to the SEL on that
|
||||
device. Otherwise, the events go nowhere since there is nowhere to
|
||||
send them.
|
||||
|
||||
|
||||
Poweroff
|
||||
--------
|
||||
|
||||
If the poweroff capability is selected, the IPMI driver will install
|
||||
a shutdown function into the standard poweroff function pointer. This
|
||||
is in the ipmi_poweroff module. When the system requests a powerdown,
|
||||
it will send the proper IPMI commands to do this. This is supported on
|
||||
several platforms.
|
||||
|
||||
There is a module parameter named "poweroff_powercycle" that may
|
||||
either be zero (do a power down) or non-zero (do a power cycle, power
|
||||
the system off, then power it on in a few seconds). Setting
|
||||
ipmi_poweroff.poweroff_control=x will do the same thing on the kernel
|
||||
command line. The parameter is also available via the proc filesystem
|
||||
in /proc/sys/dev/ipmi/poweroff_powercycle. Note that if the system
|
||||
does not support power cycling, it will always do the power off.
|
||||
|
||||
The "ifnum_to_use" parameter specifies which interface the poweroff
|
||||
code should use. The default is -1, which means to pick the first one
|
||||
registered.
|
||||
|
||||
Note that if you have ACPI enabled, the system will prefer using ACPI to
|
||||
power off.
|
||||
37
Documentation/IRQ-affinity.txt
Normal file
37
Documentation/IRQ-affinity.txt
Normal file
@ -0,0 +1,37 @@
|
||||
|
||||
SMP IRQ affinity, started by Ingo Molnar <mingo@redhat.com>
|
||||
|
||||
|
||||
/proc/irq/IRQ#/smp_affinity specifies which target CPUs are permitted
|
||||
for a given IRQ source. It's a bitmask of allowed CPUs. It's not allowed
|
||||
to turn off all CPUs, and if an IRQ controller does not support IRQ
|
||||
affinity then the value will not change from the default 0xffffffff.
|
||||
|
||||
Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
|
||||
the IRQ to CPU4-7 (this is an 8-CPU SMP box):
|
||||
|
||||
[root@moon 44]# cat smp_affinity
|
||||
ffffffff
|
||||
[root@moon 44]# echo 0f > smp_affinity
|
||||
[root@moon 44]# cat smp_affinity
|
||||
0000000f
|
||||
[root@moon 44]# ping -f h
|
||||
PING hell (195.4.7.3): 56 data bytes
|
||||
...
|
||||
--- hell ping statistics ---
|
||||
6029 packets transmitted, 6027 packets received, 0% packet loss
|
||||
round-trip min/avg/max = 0.1/0.1/0.4 ms
|
||||
[root@moon 44]# cat /proc/interrupts | grep 44:
|
||||
44: 0 1785 1785 1783 1783 1
|
||||
1 0 IO-APIC-level eth1
|
||||
[root@moon 44]# echo f0 > smp_affinity
|
||||
[root@moon 44]# ping -f h
|
||||
PING hell (195.4.7.3): 56 data bytes
|
||||
..
|
||||
--- hell ping statistics ---
|
||||
2779 packets transmitted, 2777 packets received, 0% packet loss
|
||||
round-trip min/avg/max = 0.1/0.5/585.4 ms
|
||||
[root@moon 44]# cat /proc/interrupts | grep 44:
|
||||
44: 1068 1785 1785 1784 1784 1069 1070 1069 IO-APIC-level eth1
|
||||
[root@moon 44]#
|
||||
|
||||
22
Documentation/IRQ.txt
Normal file
22
Documentation/IRQ.txt
Normal file
@ -0,0 +1,22 @@
|
||||
What is an IRQ?
|
||||
|
||||
An IRQ is an interrupt request from a device.
|
||||
Currently they can come in over a pin, or over a packet.
|
||||
Several devices may be connected to the same pin thus
|
||||
sharing an IRQ.
|
||||
|
||||
An IRQ number is a kernel identifier used to talk about a hardware
|
||||
interrupt source. Typically this is an index into the global irq_desc
|
||||
array, but except for what linux/interrupt.h implements the details
|
||||
are architecture specific.
|
||||
|
||||
An IRQ number is an enumeration of the possible interrupt sources on a
|
||||
machine. Typically what is enumerated is the number of input pins on
|
||||
all of the interrupt controller in the system. In the case of ISA
|
||||
what is enumerated are the 16 input pins on the two i8259 interrupt
|
||||
controllers.
|
||||
|
||||
Architectures can assign additional meaning to the IRQ numbers, and
|
||||
are encouraged to in the case where there is any manual configuration
|
||||
of the hardware involved. The ISA IRQs are a classic example of
|
||||
assigning this kind of additional meaning.
|
||||
572
Documentation/MSI-HOWTO.txt
Normal file
572
Documentation/MSI-HOWTO.txt
Normal file
@ -0,0 +1,572 @@
|
||||
The MSI Driver Guide HOWTO
|
||||
Tom L Nguyen tom.l.nguyen@intel.com
|
||||
10/03/2003
|
||||
Revised Feb 12, 2004 by Martine Silbermann
|
||||
email: Martine.Silbermann@hp.com
|
||||
Revised Jun 25, 2004 by Tom L Nguyen
|
||||
|
||||
1. About this guide
|
||||
|
||||
This guide describes the basics of Message Signaled Interrupts (MSI),
|
||||
the advantages of using MSI over traditional interrupt mechanisms,
|
||||
and how to enable your driver to use MSI or MSI-X. Also included is
|
||||
a Frequently Asked Questions (FAQ) section.
|
||||
|
||||
1.1 Terminology
|
||||
|
||||
PCI devices can be single-function or multi-function. In either case,
|
||||
when this text talks about enabling or disabling MSI on a "device
|
||||
function," it is referring to one specific PCI device and function and
|
||||
not to all functions on a PCI device (unless the PCI device has only
|
||||
one function).
|
||||
|
||||
2. Copyright 2003 Intel Corporation
|
||||
|
||||
3. What is MSI/MSI-X?
|
||||
|
||||
Message Signaled Interrupt (MSI), as described in the PCI Local Bus
|
||||
Specification Revision 2.3 or later, is an optional feature, and a
|
||||
required feature for PCI Express devices. MSI enables a device function
|
||||
to request service by sending an Inbound Memory Write on its PCI bus to
|
||||
the FSB as a Message Signal Interrupt transaction. Because MSI is
|
||||
generated in the form of a Memory Write, all transaction conditions,
|
||||
such as a Retry, Master-Abort, Target-Abort or normal completion, are
|
||||
supported.
|
||||
|
||||
A PCI device that supports MSI must also support pin IRQ assertion
|
||||
interrupt mechanism to provide backward compatibility for systems that
|
||||
do not support MSI. In systems which support MSI, the bus driver is
|
||||
responsible for initializing the message address and message data of
|
||||
the device function's MSI/MSI-X capability structure during device
|
||||
initial configuration.
|
||||
|
||||
An MSI capable device function indicates MSI support by implementing
|
||||
the MSI/MSI-X capability structure in its PCI capability list. The
|
||||
device function may implement both the MSI capability structure and
|
||||
the MSI-X capability structure; however, the bus driver should not
|
||||
enable both.
|
||||
|
||||
The MSI capability structure contains Message Control register,
|
||||
Message Address register and Message Data register. These registers
|
||||
provide the bus driver control over MSI. The Message Control register
|
||||
indicates the MSI capability supported by the device. The Message
|
||||
Address register specifies the target address and the Message Data
|
||||
register specifies the characteristics of the message. To request
|
||||
service, the device function writes the content of the Message Data
|
||||
register to the target address. The device and its software driver
|
||||
are prohibited from writing to these registers.
|
||||
|
||||
The MSI-X capability structure is an optional extension to MSI. It
|
||||
uses an independent and separate capability structure. There are
|
||||
some key advantages to implementing the MSI-X capability structure
|
||||
over the MSI capability structure as described below.
|
||||
|
||||
- Support a larger maximum number of vectors per function.
|
||||
|
||||
- Provide the ability for system software to configure
|
||||
each vector with an independent message address and message
|
||||
data, specified by a table that resides in Memory Space.
|
||||
|
||||
- MSI and MSI-X both support per-vector masking. Per-vector
|
||||
masking is an optional extension of MSI but a required
|
||||
feature for MSI-X. Per-vector masking provides the kernel the
|
||||
ability to mask/unmask a single MSI while running its
|
||||
interrupt service routine. If per-vector masking is
|
||||
not supported, then the device driver should provide the
|
||||
hardware/software synchronization to ensure that the device
|
||||
generates MSI when the driver wants it to do so.
|
||||
|
||||
4. Why use MSI?
|
||||
|
||||
As a benefit to the simplification of board design, MSI allows board
|
||||
designers to remove out-of-band interrupt routing. MSI is another
|
||||
step towards a legacy-free environment.
|
||||
|
||||
Due to increasing pressure on chipset and processor packages to
|
||||
reduce pin count, the need for interrupt pins is expected to
|
||||
diminish over time. Devices, due to pin constraints, may implement
|
||||
messages to increase performance.
|
||||
|
||||
PCI Express endpoints uses INTx emulation (in-band messages) instead
|
||||
of IRQ pin assertion. Using INTx emulation requires interrupt
|
||||
sharing among devices connected to the same node (PCI bridge) while
|
||||
MSI is unique (non-shared) and does not require BIOS configuration
|
||||
support. As a result, the PCI Express technology requires MSI
|
||||
support for better interrupt performance.
|
||||
|
||||
Using MSI enables the device functions to support two or more
|
||||
vectors, which can be configured to target different CPUs to
|
||||
increase scalability.
|
||||
|
||||
5. Configuring a driver to use MSI/MSI-X
|
||||
|
||||
By default, the kernel will not enable MSI/MSI-X on all devices that
|
||||
support this capability. The CONFIG_PCI_MSI kernel option
|
||||
must be selected to enable MSI/MSI-X support.
|
||||
|
||||
5.1 Including MSI/MSI-X support into the kernel
|
||||
|
||||
To allow MSI/MSI-X capable device drivers to selectively enable
|
||||
MSI/MSI-X (using pci_enable_msi()/pci_enable_msix() as described
|
||||
below), the VECTOR based scheme needs to be enabled by setting
|
||||
CONFIG_PCI_MSI during kernel config.
|
||||
|
||||
Since the target of the inbound message is the local APIC, providing
|
||||
CONFIG_X86_LOCAL_APIC must be enabled as well as CONFIG_PCI_MSI.
|
||||
|
||||
5.2 Configuring for MSI support
|
||||
|
||||
Due to the non-contiguous fashion in vector assignment of the
|
||||
existing Linux kernel, this version does not support multiple
|
||||
messages regardless of a device function is capable of supporting
|
||||
more than one vector. To enable MSI on a device function's MSI
|
||||
capability structure requires a device driver to call the function
|
||||
pci_enable_msi() explicitly.
|
||||
|
||||
5.2.1 API pci_enable_msi
|
||||
|
||||
int pci_enable_msi(struct pci_dev *dev)
|
||||
|
||||
With this new API, a device driver that wants to have MSI
|
||||
enabled on its device function must call this API to enable MSI.
|
||||
A successful call will initialize the MSI capability structure
|
||||
with ONE vector, regardless of whether a device function is
|
||||
capable of supporting multiple messages. This vector replaces the
|
||||
pre-assigned dev->irq with a new MSI vector. To avoid a conflict
|
||||
of the new assigned vector with existing pre-assigned vector requires
|
||||
a device driver to call this API before calling request_irq().
|
||||
|
||||
5.2.2 API pci_disable_msi
|
||||
|
||||
void pci_disable_msi(struct pci_dev *dev)
|
||||
|
||||
This API should always be used to undo the effect of pci_enable_msi()
|
||||
when a device driver is unloading. This API restores dev->irq with
|
||||
the pre-assigned IOAPIC vector and switches a device's interrupt
|
||||
mode to PCI pin-irq assertion/INTx emulation mode.
|
||||
|
||||
Note that a device driver should always call free_irq() on the MSI vector
|
||||
that it has done request_irq() on before calling this API. Failure to do
|
||||
so results in a BUG_ON() and a device will be left with MSI enabled and
|
||||
leaks its vector.
|
||||
|
||||
5.2.3 MSI mode vs. legacy mode diagram
|
||||
|
||||
The below diagram shows the events which switch the interrupt
|
||||
mode on the MSI-capable device function between MSI mode and
|
||||
PIN-IRQ assertion mode.
|
||||
|
||||
------------ pci_enable_msi ------------------------
|
||||
| | <=============== | |
|
||||
| MSI MODE | | PIN-IRQ ASSERTION MODE |
|
||||
| | ===============> | |
|
||||
------------ pci_disable_msi ------------------------
|
||||
|
||||
|
||||
Figure 1. MSI Mode vs. Legacy Mode
|
||||
|
||||
In Figure 1, a device operates by default in legacy mode. Legacy
|
||||
in this context means PCI pin-irq assertion or PCI-Express INTx
|
||||
emulation. A successful MSI request (using pci_enable_msi()) switches
|
||||
a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector
|
||||
stored in dev->irq will be saved by the PCI subsystem and a new
|
||||
assigned MSI vector will replace dev->irq.
|
||||
|
||||
To return back to its default mode, a device driver should always call
|
||||
pci_disable_msi() to undo the effect of pci_enable_msi(). Note that a
|
||||
device driver should always call free_irq() on the MSI vector it has
|
||||
done request_irq() on before calling pci_disable_msi(). Failure to do
|
||||
so results in a BUG_ON() and a device will be left with MSI enabled and
|
||||
leaks its vector. Otherwise, the PCI subsystem restores a device's
|
||||
dev->irq with a pre-assigned IOAPIC vector and marks the released
|
||||
MSI vector as unused.
|
||||
|
||||
Once being marked as unused, there is no guarantee that the PCI
|
||||
subsystem will reserve this MSI vector for a device. Depending on
|
||||
the availability of current PCI vector resources and the number of
|
||||
MSI/MSI-X requests from other drivers, this MSI may be re-assigned.
|
||||
|
||||
For the case where the PCI subsystem re-assigns this MSI vector to
|
||||
another driver, a request to switch back to MSI mode may result
|
||||
in being assigned a different MSI vector or a failure if no more
|
||||
vectors are available.
|
||||
|
||||
5.3 Configuring for MSI-X support
|
||||
|
||||
Due to the ability of the system software to configure each vector of
|
||||
the MSI-X capability structure with an independent message address
|
||||
and message data, the non-contiguous fashion in vector assignment of
|
||||
the existing Linux kernel has no impact on supporting multiple
|
||||
messages on an MSI-X capable device functions. To enable MSI-X on
|
||||
a device function's MSI-X capability structure requires its device
|
||||
driver to call the function pci_enable_msix() explicitly.
|
||||
|
||||
The function pci_enable_msix(), once invoked, enables either
|
||||
all or nothing, depending on the current availability of PCI vector
|
||||
resources. If the PCI vector resources are available for the number
|
||||
of vectors requested by a device driver, this function will configure
|
||||
the MSI-X table of the MSI-X capability structure of a device with
|
||||
requested messages. To emphasize this reason, for example, a device
|
||||
may be capable for supporting the maximum of 32 vectors while its
|
||||
software driver usually may request 4 vectors. It is recommended
|
||||
that the device driver should call this function once during the
|
||||
initialization phase of the device driver.
|
||||
|
||||
Unlike the function pci_enable_msi(), the function pci_enable_msix()
|
||||
does not replace the pre-assigned IOAPIC dev->irq with a new MSI
|
||||
vector because the PCI subsystem writes the 1:1 vector-to-entry mapping
|
||||
into the field vector of each element contained in a second argument.
|
||||
Note that the pre-assigned IOAPIC dev->irq is valid only if the device
|
||||
operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt at
|
||||
using dev->irq by the device driver to request for interrupt service
|
||||
may result in unpredictable behavior.
|
||||
|
||||
For each MSI-X vector granted, a device driver is responsible for calling
|
||||
other functions like request_irq(), enable_irq(), etc. to enable
|
||||
this vector with its corresponding interrupt service handler. It is
|
||||
a device driver's choice to assign all vectors with the same
|
||||
interrupt service handler or each vector with a unique interrupt
|
||||
service handler.
|
||||
|
||||
5.3.1 Handling MMIO address space of MSI-X Table
|
||||
|
||||
The PCI 3.0 specification has implementation notes that MMIO address
|
||||
space for a device's MSI-X structure should be isolated so that the
|
||||
software system can set different pages for controlling accesses to the
|
||||
MSI-X structure. The implementation of MSI support requires the PCI
|
||||
subsystem, not a device driver, to maintain full control of the MSI-X
|
||||
table/MSI-X PBA (Pending Bit Array) and MMIO address space of the MSI-X
|
||||
table/MSI-X PBA. A device driver is prohibited from requesting the MMIO
|
||||
address space of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem
|
||||
will fail enabling MSI-X on its hardware device when it calls the function
|
||||
pci_enable_msix().
|
||||
|
||||
5.3.2 Handling MSI-X allocation
|
||||
|
||||
Determining the number of MSI-X vectors allocated to a function is
|
||||
dependent on the number of MSI capable devices and MSI-X capable
|
||||
devices populated in the system. The policy of allocating MSI-X
|
||||
vectors to a function is defined as the following:
|
||||
|
||||
#of MSI-X vectors allocated to a function = (x - y)/z where
|
||||
|
||||
x = The number of available PCI vector resources by the time
|
||||
the device driver calls pci_enable_msix(). The PCI vector
|
||||
resources is the sum of the number of unassigned vectors
|
||||
(new) and the number of released vectors when any MSI/MSI-X
|
||||
device driver switches its hardware device back to a legacy
|
||||
mode or is hot-removed. The number of unassigned vectors
|
||||
may exclude some vectors reserved, as defined in parameter
|
||||
NR_HP_RESERVED_VECTORS, for the case where the system is
|
||||
capable of supporting hot-add/hot-remove operations. Users
|
||||
may change the value defined in NR_HR_RESERVED_VECTORS to
|
||||
meet their specific needs.
|
||||
|
||||
y = The number of MSI capable devices populated in the system.
|
||||
This policy ensures that each MSI capable device has its
|
||||
vector reserved to avoid the case where some MSI-X capable
|
||||
drivers may attempt to claim all available vector resources.
|
||||
|
||||
z = The number of MSI-X capable devices populated in the system.
|
||||
This policy ensures that maximum (x - y) is distributed
|
||||
evenly among MSI-X capable devices.
|
||||
|
||||
Note that the PCI subsystem scans y and z during a bus enumeration.
|
||||
When the PCI subsystem completes configuring MSI/MSI-X capability
|
||||
structure of a device as requested by its device driver, y/z is
|
||||
decremented accordingly.
|
||||
|
||||
5.3.3 Handling MSI-X shortages
|
||||
|
||||
For the case where fewer MSI-X vectors are allocated to a function
|
||||
than requested, the function pci_enable_msix() will return the
|
||||
maximum number of MSI-X vectors available to the caller. A device
|
||||
driver may re-send its request with fewer or equal vectors indicated
|
||||
in the return. For example, if a device driver requests 5 vectors, but
|
||||
the number of available vectors is 3 vectors, a value of 3 will be
|
||||
returned as a result of pci_enable_msix() call. A function could be
|
||||
designed for its driver to use only 3 MSI-X table entries as
|
||||
different combinations as ABC--, A-B-C, A--CB, etc. Note that this
|
||||
patch does not support multiple entries with the same vector. Such
|
||||
attempt by a device driver to use 5 MSI-X table entries with 3 vectors
|
||||
as ABBCC, AABCC, BCCBA, etc will result as a failure by the function
|
||||
pci_enable_msix(). Below are the reasons why supporting multiple
|
||||
entries with the same vector is an undesirable solution.
|
||||
|
||||
- The PCI subsystem cannot determine the entry that
|
||||
generated the message to mask/unmask MSI while handling
|
||||
software driver ISR. Attempting to walk through all MSI-X
|
||||
table entries (2048 max) to mask/unmask any match vector
|
||||
is an undesirable solution.
|
||||
|
||||
- Walking through all MSI-X table entries (2048 max) to handle
|
||||
SMP affinity of any match vector is an undesirable solution.
|
||||
|
||||
5.3.4 API pci_enable_msix
|
||||
|
||||
int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
|
||||
|
||||
This API enables a device driver to request the PCI subsystem
|
||||
to enable MSI-X messages on its hardware device. Depending on
|
||||
the availability of PCI vectors resources, the PCI subsystem enables
|
||||
either all or none of the requested vectors.
|
||||
|
||||
Argument 'dev' points to the device (pci_dev) structure.
|
||||
|
||||
Argument 'entries' is a pointer to an array of msix_entry structs.
|
||||
The number of entries is indicated in argument 'nvec'.
|
||||
struct msix_entry is defined in /driver/pci/msi.h:
|
||||
|
||||
struct msix_entry {
|
||||
u16 vector; /* kernel uses to write alloc vector */
|
||||
u16 entry; /* driver uses to specify entry */
|
||||
};
|
||||
|
||||
A device driver is responsible for initializing the field 'entry' of
|
||||
each element with a unique entry supported by MSI-X table. Otherwise,
|
||||
-EINVAL will be returned as a result. A successful return of zero
|
||||
indicates the PCI subsystem completed initializing each of the requested
|
||||
entries of the MSI-X table with message address and message data.
|
||||
Last but not least, the PCI subsystem will write the 1:1
|
||||
vector-to-entry mapping into the field 'vector' of each element. A
|
||||
device driver is responsible for keeping track of allocated MSI-X
|
||||
vectors in its internal data structure.
|
||||
|
||||
A return of zero indicates that the number of MSI-X vectors was
|
||||
successfully allocated. A return of greater than zero indicates
|
||||
MSI-X vector shortage. Or a return of less than zero indicates
|
||||
a failure. This failure may be a result of duplicate entries
|
||||
specified in second argument, or a result of no available vector,
|
||||
or a result of failing to initialize MSI-X table entries.
|
||||
|
||||
5.3.5 API pci_disable_msix
|
||||
|
||||
void pci_disable_msix(struct pci_dev *dev)
|
||||
|
||||
This API should always be used to undo the effect of pci_enable_msix()
|
||||
when a device driver is unloading. Note that a device driver should
|
||||
always call free_irq() on all MSI-X vectors it has done request_irq()
|
||||
on before calling this API. Failure to do so results in a BUG_ON() and
|
||||
a device will be left with MSI-X enabled and leaks its vectors.
|
||||
|
||||
5.3.6 MSI-X mode vs. legacy mode diagram
|
||||
|
||||
The below diagram shows the events which switch the interrupt
|
||||
mode on the MSI-X capable device function between MSI-X mode and
|
||||
PIN-IRQ assertion mode (legacy).
|
||||
|
||||
------------ pci_enable_msix(,,n) ------------------------
|
||||
| | <=============== | |
|
||||
| MSI-X MODE | | PIN-IRQ ASSERTION MODE |
|
||||
| | ===============> | |
|
||||
------------ pci_disable_msix ------------------------
|
||||
|
||||
Figure 2. MSI-X Mode vs. Legacy Mode
|
||||
|
||||
In Figure 2, a device operates by default in legacy mode. A
|
||||
successful MSI-X request (using pci_enable_msix()) switches a
|
||||
device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector
|
||||
stored in dev->irq will be saved by the PCI subsystem; however,
|
||||
unlike MSI mode, the PCI subsystem will not replace dev->irq with
|
||||
assigned MSI-X vector because the PCI subsystem already writes the 1:1
|
||||
vector-to-entry mapping into the field 'vector' of each element
|
||||
specified in second argument.
|
||||
|
||||
To return back to its default mode, a device driver should always call
|
||||
pci_disable_msix() to undo the effect of pci_enable_msix(). Note that
|
||||
a device driver should always call free_irq() on all MSI-X vectors it
|
||||
has done request_irq() on before calling pci_disable_msix(). Failure
|
||||
to do so results in a BUG_ON() and a device will be left with MSI-X
|
||||
enabled and leaks its vectors. Otherwise, the PCI subsystem switches a
|
||||
device function's interrupt mode from MSI-X mode to legacy mode and
|
||||
marks all allocated MSI-X vectors as unused.
|
||||
|
||||
Once being marked as unused, there is no guarantee that the PCI
|
||||
subsystem will reserve these MSI-X vectors for a device. Depending on
|
||||
the availability of current PCI vector resources and the number of
|
||||
MSI/MSI-X requests from other drivers, these MSI-X vectors may be
|
||||
re-assigned.
|
||||
|
||||
For the case where the PCI subsystem re-assigned these MSI-X vectors
|
||||
to other drivers, a request to switch back to MSI-X mode may result
|
||||
being assigned with another set of MSI-X vectors or a failure if no
|
||||
more vectors are available.
|
||||
|
||||
5.4 Handling function implementing both MSI and MSI-X capabilities
|
||||
|
||||
For the case where a function implements both MSI and MSI-X
|
||||
capabilities, the PCI subsystem enables a device to run either in MSI
|
||||
mode or MSI-X mode but not both. A device driver determines whether it
|
||||
wants MSI or MSI-X enabled on its hardware device. Once a device
|
||||
driver requests for MSI, for example, it is prohibited from requesting
|
||||
MSI-X; in other words, a device driver is not permitted to ping-pong
|
||||
between MSI mod MSI-X mode during a run-time.
|
||||
|
||||
5.5 Hardware requirements for MSI/MSI-X support
|
||||
|
||||
MSI/MSI-X support requires support from both system hardware and
|
||||
individual hardware device functions.
|
||||
|
||||
5.5.1 System hardware support
|
||||
|
||||
Since the target of MSI address is the local APIC CPU, enabling
|
||||
MSI/MSI-X support in the Linux kernel is dependent on whether existing
|
||||
system hardware supports local APIC. Users should verify that their
|
||||
system supports local APIC operation by testing that it runs when
|
||||
CONFIG_X86_LOCAL_APIC=y.
|
||||
|
||||
In SMP environment, CONFIG_X86_LOCAL_APIC is automatically set;
|
||||
however, in UP environment, users must manually set
|
||||
CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting
|
||||
CONFIG_PCI_MSI enables the VECTOR based scheme and the option for
|
||||
MSI-capable device drivers to selectively enable MSI/MSI-X.
|
||||
|
||||
Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X
|
||||
vector is allocated new during runtime and MSI/MSI-X support does not
|
||||
depend on BIOS support. This key independency enables MSI/MSI-X
|
||||
support on future IOxAPIC free platforms.
|
||||
|
||||
5.5.2 Device hardware support
|
||||
|
||||
The hardware device function supports MSI by indicating the
|
||||
MSI/MSI-X capability structure on its PCI capability list. By
|
||||
default, this capability structure will not be initialized by
|
||||
the kernel to enable MSI during the system boot. In other words,
|
||||
the device function is running on its default pin assertion mode.
|
||||
Note that in many cases the hardware supporting MSI have bugs,
|
||||
which may result in system hangs. The software driver of specific
|
||||
MSI-capable hardware is responsible for deciding whether to call
|
||||
pci_enable_msi or not. A return of zero indicates the kernel
|
||||
successfully initialized the MSI/MSI-X capability structure of the
|
||||
device function. The device function is now running on MSI/MSI-X mode.
|
||||
|
||||
5.6 How to tell whether MSI/MSI-X is enabled on device function
|
||||
|
||||
At the driver level, a return of zero from the function call of
|
||||
pci_enable_msi()/pci_enable_msix() indicates to a device driver that
|
||||
its device function is initialized successfully and ready to run in
|
||||
MSI/MSI-X mode.
|
||||
|
||||
At the user level, users can use the command 'cat /proc/interrupts'
|
||||
to display the vectors allocated for devices and their interrupt
|
||||
MSI/MSI-X modes ("PCI-MSI"/"PCI-MSI-X"). Below shows MSI mode is
|
||||
enabled on a SCSI Adaptec 39320D Ultra320 controller.
|
||||
|
||||
CPU0 CPU1
|
||||
0: 324639 0 IO-APIC-edge timer
|
||||
1: 1186 0 IO-APIC-edge i8042
|
||||
2: 0 0 XT-PIC cascade
|
||||
12: 2797 0 IO-APIC-edge i8042
|
||||
14: 6543 0 IO-APIC-edge ide0
|
||||
15: 1 0 IO-APIC-edge ide1
|
||||
169: 0 0 IO-APIC-level uhci-hcd
|
||||
185: 0 0 IO-APIC-level uhci-hcd
|
||||
193: 138 10 PCI-MSI aic79xx
|
||||
201: 30 0 PCI-MSI aic79xx
|
||||
225: 30 0 IO-APIC-level aic7xxx
|
||||
233: 30 0 IO-APIC-level aic7xxx
|
||||
NMI: 0 0
|
||||
LOC: 324553 325068
|
||||
ERR: 0
|
||||
MIS: 0
|
||||
|
||||
6. MSI quirks
|
||||
|
||||
Several PCI chipsets or devices are known to not support MSI.
|
||||
The PCI stack provides 3 possible levels of MSI disabling:
|
||||
* on a single device
|
||||
* on all devices behind a specific bridge
|
||||
* globally
|
||||
|
||||
6.1. Disabling MSI on a single device
|
||||
|
||||
Under some circumstances, it might be required to disable MSI on a
|
||||
single device, It may be achived by either not calling pci_enable_msi()
|
||||
or all, or setting the pci_dev->no_msi flag before (most of the time
|
||||
in a quirk).
|
||||
|
||||
6.2. Disabling MSI below a bridge
|
||||
|
||||
The vast majority of MSI quirks are required by PCI bridges not
|
||||
being able to route MSI between busses. In this case, MSI have to be
|
||||
disabled on all devices behind this bridge. It is achieves by setting
|
||||
the PCI_BUS_FLAGS_NO_MSI flag in the pci_bus->bus_flags of the bridge
|
||||
subordinate bus. There is no need to set the same flag on bridges that
|
||||
are below the broken brigde. When pci_enable_msi() is called to enable
|
||||
MSI on a device, pci_msi_supported() takes care of checking the NO_MSI
|
||||
flag in all parent busses of the device.
|
||||
|
||||
Some bridges actually support dynamic MSI support enabling/disabling
|
||||
by changing some bits in their PCI configuration space (especially
|
||||
the Hypertransport chipsets such as the nVidia nForce and Serverworks
|
||||
HT2000). It may then be required to update the NO_MSI flag on the
|
||||
corresponding devices in the sysfs hierarchy. To enable MSI support
|
||||
on device "0000:00:0e", do:
|
||||
|
||||
echo 1 > /sys/bus/pci/devices/0000:00:0e/msi_bus
|
||||
|
||||
To disable MSI support, echo 0 instead of 1. Note that it should be
|
||||
used with caution since changing this value might break interrupts.
|
||||
|
||||
6.3. Disabling MSI globally
|
||||
|
||||
Some extreme cases may require to disable MSI globally on the system.
|
||||
For now, the only known case is a Serverworks PCI-X chipsets (MSI are
|
||||
not supported on several busses that are not all connected to the
|
||||
chipset in the Linux PCI hierarchy). In the vast majority of other
|
||||
cases, disabling only behind a specific bridge is enough.
|
||||
|
||||
For debugging purpose, the user may also pass pci=nomsi on the kernel
|
||||
command-line to explicitly disable MSI globally. But, once the appro-
|
||||
priate quirks are added to the kernel, this option should not be
|
||||
required anymore.
|
||||
|
||||
6.4. Finding why MSI cannot be enabled on a device
|
||||
|
||||
Assuming that MSI are not enabled on a device, you should look at
|
||||
dmesg to find messages that quirks may output when disabling MSI
|
||||
on some devices, some bridges or even globally.
|
||||
Then, lspci -t gives the list of bridges above a device. Reading
|
||||
/sys/bus/pci/devices/0000:00:0e/msi_bus will tell you whether MSI
|
||||
are enabled (1) or disabled (0). In 0 is found in a single bridge
|
||||
msi_bus file above the device, MSI cannot be enabled.
|
||||
|
||||
7. FAQ
|
||||
|
||||
Q1. Are there any limitations on using the MSI?
|
||||
|
||||
A1. If the PCI device supports MSI and conforms to the
|
||||
specification and the platform supports the APIC local bus,
|
||||
then using MSI should work.
|
||||
|
||||
Q2. Will it work on all the Pentium processors (P3, P4, Xeon,
|
||||
AMD processors)? In P3 IPI's are transmitted on the APIC local
|
||||
bus and in P4 and Xeon they are transmitted on the system
|
||||
bus. Are there any implications with this?
|
||||
|
||||
A2. MSI support enables a PCI device sending an inbound
|
||||
memory write (0xfeexxxxx as target address) on its PCI bus
|
||||
directly to the FSB. Since the message address has a
|
||||
redirection hint bit cleared, it should work.
|
||||
|
||||
Q3. The target address 0xfeexxxxx will be translated by the
|
||||
Host Bridge into an interrupt message. Are there any
|
||||
limitations on the chipsets such as Intel 8xx, Intel e7xxx,
|
||||
or VIA?
|
||||
|
||||
A3. If these chipsets support an inbound memory write with
|
||||
target address set as 0xfeexxxxx, as conformed to PCI
|
||||
specification 2.3 or latest, then it should work.
|
||||
|
||||
Q4. From the driver point of view, if the MSI is lost because
|
||||
of errors occurring during inbound memory write, then it may
|
||||
wait forever. Is there a mechanism for it to recover?
|
||||
|
||||
A4. Since the target of the transaction is an inbound memory
|
||||
write, all transaction termination conditions (Retry,
|
||||
Master-Abort, Target-Abort, or normal completion) are
|
||||
supported. A device sending an MSI must abide by all the PCI
|
||||
rules and conditions regarding that inbound memory write. So,
|
||||
if a retry is signaled it must retry, etc... We believe that
|
||||
the recommendation for Abort is also a retry (refer to PCI
|
||||
specification 2.3 or latest).
|
||||
276
Documentation/ManagementStyle
Normal file
276
Documentation/ManagementStyle
Normal file
@ -0,0 +1,276 @@
|
||||
|
||||
Linux kernel management style
|
||||
|
||||
This is a short document describing the preferred (or made up, depending
|
||||
on who you ask) management style for the linux kernel. It's meant to
|
||||
mirror the CodingStyle document to some degree, and mainly written to
|
||||
avoid answering (*) the same (or similar) questions over and over again.
|
||||
|
||||
Management style is very personal and much harder to quantify than
|
||||
simple coding style rules, so this document may or may not have anything
|
||||
to do with reality. It started as a lark, but that doesn't mean that it
|
||||
might not actually be true. You'll have to decide for yourself.
|
||||
|
||||
Btw, when talking about "kernel manager", it's all about the technical
|
||||
lead persons, not the people who do traditional management inside
|
||||
companies. If you sign purchase orders or you have any clue about the
|
||||
budget of your group, you're almost certainly not a kernel manager.
|
||||
These suggestions may or may not apply to you.
|
||||
|
||||
First off, I'd suggest buying "Seven Habits of Highly Successful
|
||||
People", and NOT read it. Burn it, it's a great symbolic gesture.
|
||||
|
||||
(*) This document does so not so much by answering the question, but by
|
||||
making it painfully obvious to the questioner that we don't have a clue
|
||||
to what the answer is.
|
||||
|
||||
Anyway, here goes:
|
||||
|
||||
|
||||
Chapter 1: Decisions
|
||||
|
||||
Everybody thinks managers make decisions, and that decision-making is
|
||||
important. The bigger and more painful the decision, the bigger the
|
||||
manager must be to make it. That's very deep and obvious, but it's not
|
||||
actually true.
|
||||
|
||||
The name of the game is to _avoid_ having to make a decision. In
|
||||
particular, if somebody tells you "choose (a) or (b), we really need you
|
||||
to decide on this", you're in trouble as a manager. The people you
|
||||
manage had better know the details better than you, so if they come to
|
||||
you for a technical decision, you're screwed. You're clearly not
|
||||
competent to make that decision for them.
|
||||
|
||||
(Corollary:if the people you manage don't know the details better than
|
||||
you, you're also screwed, although for a totally different reason.
|
||||
Namely that you are in the wrong job, and that _they_ should be managing
|
||||
your brilliance instead).
|
||||
|
||||
So the name of the game is to _avoid_ decisions, at least the big and
|
||||
painful ones. Making small and non-consequential decisions is fine, and
|
||||
makes you look like you know what you're doing, so what a kernel manager
|
||||
needs to do is to turn the big and painful ones into small things where
|
||||
nobody really cares.
|
||||
|
||||
It helps to realize that the key difference between a big decision and a
|
||||
small one is whether you can fix your decision afterwards. Any decision
|
||||
can be made small by just always making sure that if you were wrong (and
|
||||
you _will_ be wrong), you can always undo the damage later by
|
||||
backtracking. Suddenly, you get to be doubly managerial for making
|
||||
_two_ inconsequential decisions - the wrong one _and_ the right one.
|
||||
|
||||
And people will even see that as true leadership (*cough* bullshit
|
||||
*cough*).
|
||||
|
||||
Thus the key to avoiding big decisions becomes to just avoiding to do
|
||||
things that can't be undone. Don't get ushered into a corner from which
|
||||
you cannot escape. A cornered rat may be dangerous - a cornered manager
|
||||
is just pitiful.
|
||||
|
||||
It turns out that since nobody would be stupid enough to ever really let
|
||||
a kernel manager have huge fiscal responsibility _anyway_, it's usually
|
||||
fairly easy to backtrack. Since you're not going to be able to waste
|
||||
huge amounts of money that you might not be able to repay, the only
|
||||
thing you can backtrack on is a technical decision, and there
|
||||
back-tracking is very easy: just tell everybody that you were an
|
||||
incompetent nincompoop, say you're sorry, and undo all the worthless
|
||||
work you had people work on for the last year. Suddenly the decision
|
||||
you made a year ago wasn't a big decision after all, since it could be
|
||||
easily undone.
|
||||
|
||||
It turns out that some people have trouble with this approach, for two
|
||||
reasons:
|
||||
- admitting you were an idiot is harder than it looks. We all like to
|
||||
maintain appearances, and coming out in public to say that you were
|
||||
wrong is sometimes very hard indeed.
|
||||
- having somebody tell you that what you worked on for the last year
|
||||
wasn't worthwhile after all can be hard on the poor lowly engineers
|
||||
too, and while the actual _work_ was easy enough to undo by just
|
||||
deleting it, you may have irrevocably lost the trust of that
|
||||
engineer. And remember: "irrevocable" was what we tried to avoid in
|
||||
the first place, and your decision ended up being a big one after
|
||||
all.
|
||||
|
||||
Happily, both of these reasons can be mitigated effectively by just
|
||||
admitting up-front that you don't have a friggin' clue, and telling
|
||||
people ahead of the fact that your decision is purely preliminary, and
|
||||
might be the wrong thing. You should always reserve the right to change
|
||||
your mind, and make people very _aware_ of that. And it's much easier
|
||||
to admit that you are stupid when you haven't _yet_ done the really
|
||||
stupid thing.
|
||||
|
||||
Then, when it really does turn out to be stupid, people just roll their
|
||||
eyes and say "Oops, he did it again".
|
||||
|
||||
This preemptive admission of incompetence might also make the people who
|
||||
actually do the work also think twice about whether it's worth doing or
|
||||
not. After all, if _they_ aren't certain whether it's a good idea, you
|
||||
sure as hell shouldn't encourage them by promising them that what they
|
||||
work on will be included. Make them at least think twice before they
|
||||
embark on a big endeavor.
|
||||
|
||||
Remember: they'd better know more about the details than you do, and
|
||||
they usually already think they have the answer to everything. The best
|
||||
thing you can do as a manager is not to instill confidence, but rather a
|
||||
healthy dose of critical thinking on what they do.
|
||||
|
||||
Btw, another way to avoid a decision is to plaintively just whine "can't
|
||||
we just do both?" and look pitiful. Trust me, it works. If it's not
|
||||
clear which approach is better, they'll eventually figure it out. The
|
||||
answer may end up being that both teams get so frustrated by the
|
||||
situation that they just give up.
|
||||
|
||||
That may sound like a failure, but it's usually a sign that there was
|
||||
something wrong with both projects, and the reason the people involved
|
||||
couldn't decide was that they were both wrong. You end up coming up
|
||||
smelling like roses, and you avoided yet another decision that you could
|
||||
have screwed up on.
|
||||
|
||||
|
||||
Chapter 2: People
|
||||
|
||||
Most people are idiots, and being a manager means you'll have to deal
|
||||
with it, and perhaps more importantly, that _they_ have to deal with
|
||||
_you_.
|
||||
|
||||
It turns out that while it's easy to undo technical mistakes, it's not
|
||||
as easy to undo personality disorders. You just have to live with
|
||||
theirs - and yours.
|
||||
|
||||
However, in order to prepare yourself as a kernel manager, it's best to
|
||||
remember not to burn any bridges, bomb any innocent villagers, or
|
||||
alienate too many kernel developers. It turns out that alienating people
|
||||
is fairly easy, and un-alienating them is hard. Thus "alienating"
|
||||
immediately falls under the heading of "not reversible", and becomes a
|
||||
no-no according to Chapter 1.
|
||||
|
||||
There's just a few simple rules here:
|
||||
(1) don't call people d*ckheads (at least not in public)
|
||||
(2) learn how to apologize when you forgot rule (1)
|
||||
|
||||
The problem with #1 is that it's very easy to do, since you can say
|
||||
"you're a d*ckhead" in millions of different ways (*), sometimes without
|
||||
even realizing it, and almost always with a white-hot conviction that
|
||||
you are right.
|
||||
|
||||
And the more convinced you are that you are right (and let's face it,
|
||||
you can call just about _anybody_ a d*ckhead, and you often _will_ be
|
||||
right), the harder it ends up being to apologize afterwards.
|
||||
|
||||
To solve this problem, you really only have two options:
|
||||
- get really good at apologies
|
||||
- spread the "love" out so evenly that nobody really ends up feeling
|
||||
like they get unfairly targeted. Make it inventive enough, and they
|
||||
might even be amused.
|
||||
|
||||
The option of being unfailingly polite really doesn't exist. Nobody will
|
||||
trust somebody who is so clearly hiding his true character.
|
||||
|
||||
(*) Paul Simon sang "Fifty Ways to Lose Your Lover", because quite
|
||||
frankly, "A Million Ways to Tell a Developer He Is a D*ckhead" doesn't
|
||||
scan nearly as well. But I'm sure he thought about it.
|
||||
|
||||
|
||||
Chapter 3: People II - the Good Kind
|
||||
|
||||
While it turns out that most people are idiots, the corollary to that is
|
||||
sadly that you are one too, and that while we can all bask in the secure
|
||||
knowledge that we're better than the average person (let's face it,
|
||||
nobody ever believes that they're average or below-average), we should
|
||||
also admit that we're not the sharpest knife around, and there will be
|
||||
other people that are less of an idiot that you are.
|
||||
|
||||
Some people react badly to smart people. Others take advantage of them.
|
||||
|
||||
Make sure that you, as a kernel maintainer, are in the second group.
|
||||
Suck up to them, because they are the people who will make your job
|
||||
easier. In particular, they'll be able to make your decisions for you,
|
||||
which is what the game is all about.
|
||||
|
||||
So when you find somebody smarter than you are, just coast along. Your
|
||||
management responsibilities largely become ones of saying "Sounds like a
|
||||
good idea - go wild", or "That sounds good, but what about xxx?". The
|
||||
second version in particular is a great way to either learn something
|
||||
new about "xxx" or seem _extra_ managerial by pointing out something the
|
||||
smarter person hadn't thought about. In either case, you win.
|
||||
|
||||
One thing to look out for is to realize that greatness in one area does
|
||||
not necessarily translate to other areas. So you might prod people in
|
||||
specific directions, but let's face it, they might be good at what they
|
||||
do, and suck at everything else. The good news is that people tend to
|
||||
naturally gravitate back to what they are good at, so it's not like you
|
||||
are doing something irreversible when you _do_ prod them in some
|
||||
direction, just don't push too hard.
|
||||
|
||||
|
||||
Chapter 4: Placing blame
|
||||
|
||||
Things will go wrong, and people want somebody to blame. Tag, you're it.
|
||||
|
||||
It's not actually that hard to accept the blame, especially if people
|
||||
kind of realize that it wasn't _all_ your fault. Which brings us to the
|
||||
best way of taking the blame: do it for another guy. You'll feel good
|
||||
for taking the fall, he'll feel good about not getting blamed, and the
|
||||
guy who lost his whole 36GB porn-collection because of your incompetence
|
||||
will grudgingly admit that you at least didn't try to weasel out of it.
|
||||
|
||||
Then make the developer who really screwed up (if you can find him) know
|
||||
_in_private_ that he screwed up. Not just so he can avoid it in the
|
||||
future, but so that he knows he owes you one. And, perhaps even more
|
||||
importantly, he's also likely the person who can fix it. Because, let's
|
||||
face it, it sure ain't you.
|
||||
|
||||
Taking the blame is also why you get to be manager in the first place.
|
||||
It's part of what makes people trust you, and allow you the potential
|
||||
glory, because you're the one who gets to say "I screwed up". And if
|
||||
you've followed the previous rules, you'll be pretty good at saying that
|
||||
by now.
|
||||
|
||||
|
||||
Chapter 5: Things to avoid
|
||||
|
||||
There's one thing people hate even more than being called "d*ckhead",
|
||||
and that is being called a "d*ckhead" in a sanctimonious voice. The
|
||||
first you can apologize for, the second one you won't really get the
|
||||
chance. They likely will no longer be listening even if you otherwise
|
||||
do a good job.
|
||||
|
||||
We all think we're better than anybody else, which means that when
|
||||
somebody else puts on airs, it _really_ rubs us the wrong way. You may
|
||||
be morally and intellectually superior to everybody around you, but
|
||||
don't try to make it too obvious unless you really _intend_ to irritate
|
||||
somebody (*).
|
||||
|
||||
Similarly, don't be too polite or subtle about things. Politeness easily
|
||||
ends up going overboard and hiding the problem, and as they say, "On the
|
||||
internet, nobody can hear you being subtle". Use a big blunt object to
|
||||
hammer the point in, because you can't really depend on people getting
|
||||
your point otherwise.
|
||||
|
||||
Some humor can help pad both the bluntness and the moralizing. Going
|
||||
overboard to the point of being ridiculous can drive a point home
|
||||
without making it painful to the recipient, who just thinks you're being
|
||||
silly. It can thus help get through the personal mental block we all
|
||||
have about criticism.
|
||||
|
||||
(*) Hint: internet newsgroups that are not directly related to your work
|
||||
are great ways to take out your frustrations at other people. Write
|
||||
insulting posts with a sneer just to get into a good flame every once in
|
||||
a while, and you'll feel cleansed. Just don't crap too close to home.
|
||||
|
||||
|
||||
Chapter 6: Why me?
|
||||
|
||||
Since your main responsibility seems to be to take the blame for other
|
||||
peoples mistakes, and make it painfully obvious to everybody else that
|
||||
you're incompetent, the obvious question becomes one of why do it in the
|
||||
first place?
|
||||
|
||||
First off, while you may or may not get screaming teenage girls (or
|
||||
boys, let's not be judgmental or sexist here) knocking on your dressing
|
||||
room door, you _will_ get an immense feeling of personal accomplishment
|
||||
for being "in charge". Never mind the fact that you're really leading
|
||||
by trying to keep up with everybody else and running after them as fast
|
||||
as you can. Everybody will still think you're the person in charge.
|
||||
|
||||
It's a great job if you can hack it.
|
||||
217
Documentation/PCIEBUS-HOWTO.txt
Normal file
217
Documentation/PCIEBUS-HOWTO.txt
Normal file
@ -0,0 +1,217 @@
|
||||
The PCI Express Port Bus Driver Guide HOWTO
|
||||
Tom L Nguyen tom.l.nguyen@intel.com
|
||||
11/03/2004
|
||||
|
||||
1. About this guide
|
||||
|
||||
This guide describes the basics of the PCI Express Port Bus driver
|
||||
and provides information on how to enable the service drivers to
|
||||
register/unregister with the PCI Express Port Bus Driver.
|
||||
|
||||
2. Copyright 2004 Intel Corporation
|
||||
|
||||
3. What is the PCI Express Port Bus Driver
|
||||
|
||||
A PCI Express Port is a logical PCI-PCI Bridge structure. There
|
||||
are two types of PCI Express Port: the Root Port and the Switch
|
||||
Port. The Root Port originates a PCI Express link from a PCI Express
|
||||
Root Complex and the Switch Port connects PCI Express links to
|
||||
internal logical PCI buses. The Switch Port, which has its secondary
|
||||
bus representing the switch's internal routing logic, is called the
|
||||
switch's Upstream Port. The switch's Downstream Port is bridging from
|
||||
switch's internal routing bus to a bus representing the downstream
|
||||
PCI Express link from the PCI Express Switch.
|
||||
|
||||
A PCI Express Port can provide up to four distinct functions,
|
||||
referred to in this document as services, depending on its port type.
|
||||
PCI Express Port's services include native hotplug support (HP),
|
||||
power management event support (PME), advanced error reporting
|
||||
support (AER), and virtual channel support (VC). These services may
|
||||
be handled by a single complex driver or be individually distributed
|
||||
and handled by corresponding service drivers.
|
||||
|
||||
4. Why use the PCI Express Port Bus Driver?
|
||||
|
||||
In existing Linux kernels, the Linux Device Driver Model allows a
|
||||
physical device to be handled by only a single driver. The PCI
|
||||
Express Port is a PCI-PCI Bridge device with multiple distinct
|
||||
services. To maintain a clean and simple solution each service
|
||||
may have its own software service driver. In this case several
|
||||
service drivers will compete for a single PCI-PCI Bridge device.
|
||||
For example, if the PCI Express Root Port native hotplug service
|
||||
driver is loaded first, it claims a PCI-PCI Bridge Root Port. The
|
||||
kernel therefore does not load other service drivers for that Root
|
||||
Port. In other words, it is impossible to have multiple service
|
||||
drivers load and run on a PCI-PCI Bridge device simultaneously
|
||||
using the current driver model.
|
||||
|
||||
To enable multiple service drivers running simultaneously requires
|
||||
having a PCI Express Port Bus driver, which manages all populated
|
||||
PCI Express Ports and distributes all provided service requests
|
||||
to the corresponding service drivers as required. Some key
|
||||
advantages of using the PCI Express Port Bus driver are listed below:
|
||||
|
||||
- Allow multiple service drivers to run simultaneously on
|
||||
a PCI-PCI Bridge Port device.
|
||||
|
||||
- Allow service drivers implemented in an independent
|
||||
staged approach.
|
||||
|
||||
- Allow one service driver to run on multiple PCI-PCI Bridge
|
||||
Port devices.
|
||||
|
||||
- Manage and distribute resources of a PCI-PCI Bridge Port
|
||||
device to requested service drivers.
|
||||
|
||||
5. Configuring the PCI Express Port Bus Driver vs. Service Drivers
|
||||
|
||||
5.1 Including the PCI Express Port Bus Driver Support into the Kernel
|
||||
|
||||
Including the PCI Express Port Bus driver depends on whether the PCI
|
||||
Express support is included in the kernel config. The kernel will
|
||||
automatically include the PCI Express Port Bus driver as a kernel
|
||||
driver when the PCI Express support is enabled in the kernel.
|
||||
|
||||
5.2 Enabling Service Driver Support
|
||||
|
||||
PCI device drivers are implemented based on Linux Device Driver Model.
|
||||
All service drivers are PCI device drivers. As discussed above, it is
|
||||
impossible to load any service driver once the kernel has loaded the
|
||||
PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver
|
||||
Model requires some minimal changes on existing service drivers that
|
||||
imposes no impact on the functionality of existing service drivers.
|
||||
|
||||
A service driver is required to use the two APIs shown below to
|
||||
register its service with the PCI Express Port Bus driver (see
|
||||
section 5.2.1 & 5.2.2). It is important that a service driver
|
||||
initializes the pcie_port_service_driver data structure, included in
|
||||
header file /include/linux/pcieport_if.h, before calling these APIs.
|
||||
Failure to do so will result an identity mismatch, which prevents
|
||||
the PCI Express Port Bus driver from loading a service driver.
|
||||
|
||||
5.2.1 pcie_port_service_register
|
||||
|
||||
int pcie_port_service_register(struct pcie_port_service_driver *new)
|
||||
|
||||
This API replaces the Linux Driver Model's pci_module_init API. A
|
||||
service driver should always calls pcie_port_service_register at
|
||||
module init. Note that after service driver being loaded, calls
|
||||
such as pci_enable_device(dev) and pci_set_master(dev) are no longer
|
||||
necessary since these calls are executed by the PCI Port Bus driver.
|
||||
|
||||
5.2.2 pcie_port_service_unregister
|
||||
|
||||
void pcie_port_service_unregister(struct pcie_port_service_driver *new)
|
||||
|
||||
pcie_port_service_unregister replaces the Linux Driver Model's
|
||||
pci_unregister_driver. It's always called by service driver when a
|
||||
module exits.
|
||||
|
||||
5.2.3 Sample Code
|
||||
|
||||
Below is sample service driver code to initialize the port service
|
||||
driver data structure.
|
||||
|
||||
static struct pcie_port_service_id service_id[] = { {
|
||||
.vendor = PCI_ANY_ID,
|
||||
.device = PCI_ANY_ID,
|
||||
.port_type = PCIE_RC_PORT,
|
||||
.service_type = PCIE_PORT_SERVICE_AER,
|
||||
}, { /* end: all zeroes */ }
|
||||
};
|
||||
|
||||
static struct pcie_port_service_driver root_aerdrv = {
|
||||
.name = (char *)device_name,
|
||||
.id_table = &service_id[0],
|
||||
|
||||
.probe = aerdrv_load,
|
||||
.remove = aerdrv_unload,
|
||||
|
||||
.suspend = aerdrv_suspend,
|
||||
.resume = aerdrv_resume,
|
||||
};
|
||||
|
||||
Below is a sample code for registering/unregistering a service
|
||||
driver.
|
||||
|
||||
static int __init aerdrv_service_init(void)
|
||||
{
|
||||
int retval = 0;
|
||||
|
||||
retval = pcie_port_service_register(&root_aerdrv);
|
||||
if (!retval) {
|
||||
/*
|
||||
* FIX ME
|
||||
*/
|
||||
}
|
||||
return retval;
|
||||
}
|
||||
|
||||
static void __exit aerdrv_service_exit(void)
|
||||
{
|
||||
pcie_port_service_unregister(&root_aerdrv);
|
||||
}
|
||||
|
||||
module_init(aerdrv_service_init);
|
||||
module_exit(aerdrv_service_exit);
|
||||
|
||||
6. Possible Resource Conflicts
|
||||
|
||||
Since all service drivers of a PCI-PCI Bridge Port device are
|
||||
allowed to run simultaneously, below lists a few of possible resource
|
||||
conflicts with proposed solutions.
|
||||
|
||||
6.1 MSI Vector Resource
|
||||
|
||||
The MSI capability structure enables a device software driver to call
|
||||
pci_enable_msi to request MSI based interrupts. Once MSI interrupts
|
||||
are enabled on a device, it stays in this mode until a device driver
|
||||
calls pci_disable_msi to disable MSI interrupts and revert back to
|
||||
INTx emulation mode. Since service drivers of the same PCI-PCI Bridge
|
||||
port share the same physical device, if an individual service driver
|
||||
calls pci_enable_msi/pci_disable_msi it may result unpredictable
|
||||
behavior. For example, two service drivers run simultaneously on the
|
||||
same physical Root Port. Both service drivers call pci_enable_msi to
|
||||
request MSI based interrupts. A service driver may not know whether
|
||||
any other service drivers have run on this Root Port. If either one
|
||||
of them calls pci_disable_msi, it puts the other service driver
|
||||
in a wrong interrupt mode.
|
||||
|
||||
To avoid this situation all service drivers are not permitted to
|
||||
switch interrupt mode on its device. The PCI Express Port Bus driver
|
||||
is responsible for determining the interrupt mode and this should be
|
||||
transparent to service drivers. Service drivers need to know only
|
||||
the vector IRQ assigned to the field irq of struct pcie_device, which
|
||||
is passed in when the PCI Express Port Bus driver probes each service
|
||||
driver. Service drivers should use (struct pcie_device*)dev->irq to
|
||||
call request_irq/free_irq. In addition, the interrupt mode is stored
|
||||
in the field interrupt_mode of struct pcie_device.
|
||||
|
||||
6.2 MSI-X Vector Resources
|
||||
|
||||
Similar to the MSI a device driver for an MSI-X capable device can
|
||||
call pci_enable_msix to request MSI-X interrupts. All service drivers
|
||||
are not permitted to switch interrupt mode on its device. The PCI
|
||||
Express Port Bus driver is responsible for determining the interrupt
|
||||
mode and this should be transparent to service drivers. Any attempt
|
||||
by service driver to call pci_enable_msix/pci_disable_msix may
|
||||
result unpredictable behavior. Service drivers should use
|
||||
(struct pcie_device*)dev->irq and call request_irq/free_irq.
|
||||
|
||||
6.3 PCI Memory/IO Mapped Regions
|
||||
|
||||
Service drivers for PCI Express Power Management (PME), Advanced
|
||||
Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access
|
||||
PCI configuration space on the PCI Express port. In all cases the
|
||||
registers accessed are independent of each other. This patch assumes
|
||||
that all service drivers will be well behaved and not overwrite
|
||||
other service driver's configuration settings.
|
||||
|
||||
6.4 PCI Config Registers
|
||||
|
||||
Each service driver runs its PCI config operations on its own
|
||||
capability structure except the PCI Express capability structure, in
|
||||
which Root Control register and Device Control register are shared
|
||||
between PME and AER. This patch assumes that all service drivers
|
||||
will be well behaved and not overwrite other service driver's
|
||||
configuration settings.
|
||||
112
Documentation/RCU/NMI-RCU.txt
Normal file
112
Documentation/RCU/NMI-RCU.txt
Normal file
@ -0,0 +1,112 @@
|
||||
Using RCU to Protect Dynamic NMI Handlers
|
||||
|
||||
|
||||
Although RCU is usually used to protect read-mostly data structures,
|
||||
it is possible to use RCU to provide dynamic non-maskable interrupt
|
||||
handlers, as well as dynamic irq handlers. This document describes
|
||||
how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer
|
||||
work in "arch/i386/oprofile/nmi_timer_int.c" and in
|
||||
"arch/i386/kernel/traps.c".
|
||||
|
||||
The relevant pieces of code are listed below, each followed by a
|
||||
brief explanation.
|
||||
|
||||
static int dummy_nmi_callback(struct pt_regs *regs, int cpu)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
The dummy_nmi_callback() function is a "dummy" NMI handler that does
|
||||
nothing, but returns zero, thus saying that it did nothing, allowing
|
||||
the NMI handler to take the default machine-specific action.
|
||||
|
||||
static nmi_callback_t nmi_callback = dummy_nmi_callback;
|
||||
|
||||
This nmi_callback variable is a global function pointer to the current
|
||||
NMI handler.
|
||||
|
||||
fastcall void do_nmi(struct pt_regs * regs, long error_code)
|
||||
{
|
||||
int cpu;
|
||||
|
||||
nmi_enter();
|
||||
|
||||
cpu = smp_processor_id();
|
||||
++nmi_count(cpu);
|
||||
|
||||
if (!rcu_dereference(nmi_callback)(regs, cpu))
|
||||
default_do_nmi(regs);
|
||||
|
||||
nmi_exit();
|
||||
}
|
||||
|
||||
The do_nmi() function processes each NMI. It first disables preemption
|
||||
in the same way that a hardware irq would, then increments the per-CPU
|
||||
count of NMIs. It then invokes the NMI handler stored in the nmi_callback
|
||||
function pointer. If this handler returns zero, do_nmi() invokes the
|
||||
default_do_nmi() function to handle a machine-specific NMI. Finally,
|
||||
preemption is restored.
|
||||
|
||||
Strictly speaking, rcu_dereference() is not needed, since this code runs
|
||||
only on i386, which does not need rcu_dereference() anyway. However,
|
||||
it is a good documentation aid, particularly for anyone attempting to
|
||||
do something similar on Alpha.
|
||||
|
||||
Quick Quiz: Why might the rcu_dereference() be necessary on Alpha,
|
||||
given that the code referenced by the pointer is read-only?
|
||||
|
||||
|
||||
Back to the discussion of NMI and RCU...
|
||||
|
||||
void set_nmi_callback(nmi_callback_t callback)
|
||||
{
|
||||
rcu_assign_pointer(nmi_callback, callback);
|
||||
}
|
||||
|
||||
The set_nmi_callback() function registers an NMI handler. Note that any
|
||||
data that is to be used by the callback must be initialized up -before-
|
||||
the call to set_nmi_callback(). On architectures that do not order
|
||||
writes, the rcu_assign_pointer() ensures that the NMI handler sees the
|
||||
initialized values.
|
||||
|
||||
void unset_nmi_callback(void)
|
||||
{
|
||||
rcu_assign_pointer(nmi_callback, dummy_nmi_callback);
|
||||
}
|
||||
|
||||
This function unregisters an NMI handler, restoring the original
|
||||
dummy_nmi_handler(). However, there may well be an NMI handler
|
||||
currently executing on some other CPU. We therefore cannot free
|
||||
up any data structures used by the old NMI handler until execution
|
||||
of it completes on all other CPUs.
|
||||
|
||||
One way to accomplish this is via synchronize_sched(), perhaps as
|
||||
follows:
|
||||
|
||||
unset_nmi_callback();
|
||||
synchronize_sched();
|
||||
kfree(my_nmi_data);
|
||||
|
||||
This works because synchronize_sched() blocks until all CPUs complete
|
||||
any preemption-disabled segments of code that they were executing.
|
||||
Since NMI handlers disable preemption, synchronize_sched() is guaranteed
|
||||
not to return until all ongoing NMI handlers exit. It is therefore safe
|
||||
to free up the handler's data as soon as synchronize_sched() returns.
|
||||
|
||||
|
||||
Answer to Quick Quiz
|
||||
|
||||
Why might the rcu_dereference() be necessary on Alpha, given
|
||||
that the code referenced by the pointer is read-only?
|
||||
|
||||
Answer: The caller to set_nmi_callback() might well have
|
||||
initialized some data that is to be used by the
|
||||
new NMI handler. In this case, the rcu_dereference()
|
||||
would be needed, because otherwise a CPU that received
|
||||
an NMI just after the new handler was set might see
|
||||
the pointer to the new NMI handler, but the old
|
||||
pre-initialized version of the handler's data.
|
||||
|
||||
More important, the rcu_dereference() makes it clear
|
||||
to someone reading the code that the pointer is being
|
||||
protected by RCU.
|
||||
449
Documentation/RCU/RTFP.txt
Normal file
449
Documentation/RCU/RTFP.txt
Normal file
@ -0,0 +1,449 @@
|
||||
Read the F-ing Papers!
|
||||
|
||||
|
||||
This document describes RCU-related publications, and is followed by
|
||||
the corresponding bibtex entries. A number of the publications may
|
||||
be found at http://www.rdrop.com/users/paulmck/RCU/.
|
||||
|
||||
The first thing resembling RCU was published in 1980, when Kung and Lehman
|
||||
[Kung80] recommended use of a garbage collector to defer destruction
|
||||
of nodes in a parallel binary search tree in order to simplify its
|
||||
implementation. This works well in environments that have garbage
|
||||
collectors, but current production garbage collectors incur significant
|
||||
read-side overhead.
|
||||
|
||||
In 1982, Manber and Ladner [Manber82,Manber84] recommended deferring
|
||||
destruction until all threads running at that time have terminated, again
|
||||
for a parallel binary search tree. This approach works well in systems
|
||||
with short-lived threads, such as the K42 research operating system.
|
||||
However, Linux has long-lived tasks, so more is needed.
|
||||
|
||||
In 1986, Hennessy, Osisek, and Seigh [Hennessy89] introduced passive
|
||||
serialization, which is an RCU-like mechanism that relies on the presence
|
||||
of "quiescent states" in the VM/XA hypervisor that are guaranteed not
|
||||
to be referencing the data structure. However, this mechanism was not
|
||||
optimized for modern computer systems, which is not surprising given
|
||||
that these overheads were not so expensive in the mid-80s. Nonetheless,
|
||||
passive serialization appears to be the first deferred-destruction
|
||||
mechanism to be used in production. Furthermore, the relevant patent has
|
||||
lapsed, so this approach may be used in non-GPL software, if desired.
|
||||
(In contrast, use of RCU is permitted only in software licensed under
|
||||
GPL. Sorry!!!)
|
||||
|
||||
In 1990, Pugh [Pugh90] noted that explicitly tracking which threads
|
||||
were reading a given data structure permitted deferred free to operate
|
||||
in the presence of non-terminating threads. However, this explicit
|
||||
tracking imposes significant read-side overhead, which is undesirable
|
||||
in read-mostly situations. This algorithm does take pains to avoid
|
||||
write-side contention and parallelize the other write-side overheads by
|
||||
providing a fine-grained locking design, however, it would be interesting
|
||||
to see how much of the performance advantage reported in 1990 remains
|
||||
in 2004.
|
||||
|
||||
At about this same time, Adams [Adams91] described ``chaotic relaxation'',
|
||||
where the normal barriers between successive iterations of convergent
|
||||
numerical algorithms are relaxed, so that iteration $n$ might use
|
||||
data from iteration $n-1$ or even $n-2$. This introduces error,
|
||||
which typically slows convergence and thus increases the number of
|
||||
iterations required. However, this increase is sometimes more than made
|
||||
up for by a reduction in the number of expensive barrier operations,
|
||||
which are otherwise required to synchronize the threads at the end
|
||||
of each iteration. Unfortunately, chaotic relaxation requires highly
|
||||
structured data, such as the matrices used in scientific programs, and
|
||||
is thus inapplicable to most data structures in operating-system kernels.
|
||||
|
||||
In 1993, Jacobson [Jacobson93] verbally described what is perhaps the
|
||||
simplest deferred-free technique: simply waiting a fixed amount of time
|
||||
before freeing blocks awaiting deferred free. Jacobson did not describe
|
||||
any write-side changes he might have made in this work using SGI's Irix
|
||||
kernel. Aju John published a similar technique in 1995 [AjuJohn95].
|
||||
This works well if there is a well-defined upper bound on the length of
|
||||
time that reading threads can hold references, as there might well be in
|
||||
hard real-time systems. However, if this time is exceeded, perhaps due
|
||||
to preemption, excessive interrupts, or larger-than-anticipated load,
|
||||
memory corruption can ensue, with no reasonable means of diagnosis.
|
||||
Jacobson's technique is therefore inappropriate for use in production
|
||||
operating-system kernels, except when such kernels can provide hard
|
||||
real-time response guarantees for all operations.
|
||||
|
||||
Also in 1995, Pu et al. [Pu95a] applied a technique similar to that of Pugh's
|
||||
read-side-tracking to permit replugging of algorithms within a commercial
|
||||
Unix operating system. However, this replugging permitted only a single
|
||||
reader at a time. The following year, this same group of researchers
|
||||
extended their technique to allow for multiple readers [Cowan96a].
|
||||
Their approach requires memory barriers (and thus pipeline stalls),
|
||||
but reduces memory latency, contention, and locking overheads.
|
||||
|
||||
1995 also saw the first publication of DYNIX/ptx's RCU mechanism
|
||||
[Slingwine95], which was optimized for modern CPU architectures,
|
||||
and was successfully applied to a number of situations within the
|
||||
DYNIX/ptx kernel. The corresponding conference paper appeared in 1998
|
||||
[McKenney98].
|
||||
|
||||
In 1999, the Tornado and K42 groups described their "generations"
|
||||
mechanism, which quite similar to RCU [Gamsa99]. These operating systems
|
||||
made pervasive use of RCU in place of "existence locks", which greatly
|
||||
simplifies locking hierarchies.
|
||||
|
||||
2001 saw the first RCU presentation involving Linux [McKenney01a]
|
||||
at OLS. The resulting abundance of RCU patches was presented the
|
||||
following year [McKenney02a], and use of RCU in dcache was first
|
||||
described that same year [Linder02a].
|
||||
|
||||
Also in 2002, Michael [Michael02b,Michael02a] presented "hazard-pointer"
|
||||
techniques that defer the destruction of data structures to simplify
|
||||
non-blocking synchronization (wait-free synchronization, lock-free
|
||||
synchronization, and obstruction-free synchronization are all examples of
|
||||
non-blocking synchronization). In particular, this technique eliminates
|
||||
locking, reduces contention, reduces memory latency for readers, and
|
||||
parallelizes pipeline stalls and memory latency for writers. However,
|
||||
these techniques still impose significant read-side overhead in the
|
||||
form of memory barriers. Researchers at Sun worked along similar lines
|
||||
in the same timeframe [HerlihyLM02,HerlihyLMS03]. These techniques
|
||||
can be thought of as inside-out reference counts, where the count is
|
||||
represented by the number of hazard pointers referencing a given data
|
||||
structure (rather than the more conventional counter field within the
|
||||
data structure itself).
|
||||
|
||||
In 2003, the K42 group described how RCU could be used to create
|
||||
hot-pluggable implementations of operating-system functions. Later that
|
||||
year saw a paper describing an RCU implementation of System V IPC
|
||||
[Arcangeli03], and an introduction to RCU in Linux Journal [McKenney03a].
|
||||
|
||||
2004 has seen a Linux-Journal article on use of RCU in dcache
|
||||
[McKenney04a], a performance comparison of locking to RCU on several
|
||||
different CPUs [McKenney04b], a dissertation describing use of RCU in a
|
||||
number of operating-system kernels [PaulEdwardMcKenneyPhD], a paper
|
||||
describing how to make RCU safe for soft-realtime applications [Sarma04c],
|
||||
and a paper describing SELinux performance with RCU [JamesMorris04b].
|
||||
|
||||
2005 has seen further adaptation of RCU to realtime use, permitting
|
||||
preemption of RCU realtime critical sections [PaulMcKenney05a,
|
||||
PaulMcKenney05b].
|
||||
|
||||
Bibtex Entries
|
||||
|
||||
@article{Kung80
|
||||
,author="H. T. Kung and Q. Lehman"
|
||||
,title="Concurrent Maintenance of Binary Search Trees"
|
||||
,Year="1980"
|
||||
,Month="September"
|
||||
,journal="ACM Transactions on Database Systems"
|
||||
,volume="5"
|
||||
,number="3"
|
||||
,pages="354-382"
|
||||
}
|
||||
|
||||
@techreport{Manber82
|
||||
,author="Udi Manber and Richard E. Ladner"
|
||||
,title="Concurrency Control in a Dynamic Search Structure"
|
||||
,institution="Department of Computer Science, University of Washington"
|
||||
,address="Seattle, Washington"
|
||||
,year="1982"
|
||||
,number="82-01-01"
|
||||
,month="January"
|
||||
,pages="28"
|
||||
}
|
||||
|
||||
@article{Manber84
|
||||
,author="Udi Manber and Richard E. Ladner"
|
||||
,title="Concurrency Control in a Dynamic Search Structure"
|
||||
,Year="1984"
|
||||
,Month="September"
|
||||
,journal="ACM Transactions on Database Systems"
|
||||
,volume="9"
|
||||
,number="3"
|
||||
,pages="439-455"
|
||||
}
|
||||
|
||||
@techreport{Hennessy89
|
||||
,author="James P. Hennessy and Damian L. Osisek and Joseph W. {Seigh II}"
|
||||
,title="Passive Serialization in a Multitasking Environment"
|
||||
,institution="US Patent and Trademark Office"
|
||||
,address="Washington, DC"
|
||||
,year="1989"
|
||||
,number="US Patent 4,809,168 (lapsed)"
|
||||
,month="February"
|
||||
,pages="11"
|
||||
}
|
||||
|
||||
@techreport{Pugh90
|
||||
,author="William Pugh"
|
||||
,title="Concurrent Maintenance of Skip Lists"
|
||||
,institution="Institute of Advanced Computer Science Studies, Department of Computer Science, University of Maryland"
|
||||
,address="College Park, Maryland"
|
||||
,year="1990"
|
||||
,number="CS-TR-2222.1"
|
||||
,month="June"
|
||||
}
|
||||
|
||||
@Book{Adams91
|
||||
,Author="Gregory R. Adams"
|
||||
,title="Concurrent Programming, Principles, and Practices"
|
||||
,Publisher="Benjamin Cummins"
|
||||
,Year="1991"
|
||||
}
|
||||
|
||||
@unpublished{Jacobson93
|
||||
,author="Van Jacobson"
|
||||
,title="Avoid Read-Side Locking Via Delayed Free"
|
||||
,year="1993"
|
||||
,month="September"
|
||||
,note="Verbal discussion"
|
||||
}
|
||||
|
||||
@Conference{AjuJohn95
|
||||
,Author="Aju John"
|
||||
,Title="Dynamic vnodes -- Design and Implementation"
|
||||
,Booktitle="{USENIX Winter 1995}"
|
||||
,Publisher="USENIX Association"
|
||||
,Month="January"
|
||||
,Year="1995"
|
||||
,pages="11-23"
|
||||
,Address="New Orleans, LA"
|
||||
}
|
||||
|
||||
@techreport{Slingwine95
|
||||
,author="John D. Slingwine and Paul E. McKenney"
|
||||
,title="Apparatus and Method for Achieving Reduced Overhead Mutual
|
||||
Exclusion and Maintaining Coherency in a Multiprocessor System
|
||||
Utilizing Execution History and Thread Monitoring"
|
||||
,institution="US Patent and Trademark Office"
|
||||
,address="Washington, DC"
|
||||
,year="1995"
|
||||
,number="US Patent 5,442,758 (contributed under GPL)"
|
||||
,month="August"
|
||||
}
|
||||
|
||||
@techreport{Slingwine97
|
||||
,author="John D. Slingwine and Paul E. McKenney"
|
||||
,title="Method for maintaining data coherency using thread
|
||||
activity summaries in a multicomputer system"
|
||||
,institution="US Patent and Trademark Office"
|
||||
,address="Washington, DC"
|
||||
,year="1997"
|
||||
,number="US Patent 5,608,893 (contributed under GPL)"
|
||||
,month="March"
|
||||
}
|
||||
|
||||
@techreport{Slingwine98
|
||||
,author="John D. Slingwine and Paul E. McKenney"
|
||||
,title="Apparatus and method for achieving reduced overhead
|
||||
mutual exclusion and maintaining coherency in a multiprocessor
|
||||
system utilizing execution history and thread monitoring"
|
||||
,institution="US Patent and Trademark Office"
|
||||
,address="Washington, DC"
|
||||
,year="1998"
|
||||
,number="US Patent 5,727,209 (contributed under GPL)"
|
||||
,month="March"
|
||||
}
|
||||
|
||||
@Conference{McKenney98
|
||||
,Author="Paul E. McKenney and John D. Slingwine"
|
||||
,Title="Read-Copy Update: Using Execution History to Solve Concurrency
|
||||
Problems"
|
||||
,Booktitle="{Parallel and Distributed Computing and Systems}"
|
||||
,Month="October"
|
||||
,Year="1998"
|
||||
,pages="509-518"
|
||||
,Address="Las Vegas, NV"
|
||||
}
|
||||
|
||||
@Conference{Gamsa99
|
||||
,Author="Ben Gamsa and Orran Krieger and Jonathan Appavoo and Michael Stumm"
|
||||
,Title="Tornado: Maximizing Locality and Concurrency in a Shared Memory
|
||||
Multiprocessor Operating System"
|
||||
,Booktitle="{Proceedings of the 3\textsuperscript{rd} Symposium on
|
||||
Operating System Design and Implementation}"
|
||||
,Month="February"
|
||||
,Year="1999"
|
||||
,pages="87-100"
|
||||
,Address="New Orleans, LA"
|
||||
}
|
||||
|
||||
@techreport{Slingwine01
|
||||
,author="John D. Slingwine and Paul E. McKenney"
|
||||
,title="Apparatus and method for achieving reduced overhead
|
||||
mutual exclusion and maintaining coherency in a multiprocessor
|
||||
system utilizing execution history and thread monitoring"
|
||||
,institution="US Patent and Trademark Office"
|
||||
,address="Washington, DC"
|
||||
,year="2001"
|
||||
,number="US Patent 5,219,690 (contributed under GPL)"
|
||||
,month="April"
|
||||
}
|
||||
|
||||
@Conference{McKenney01a
|
||||
,Author="Paul E. McKenney and Jonathan Appavoo and Andi Kleen and
|
||||
Orran Krieger and Rusty Russell and Dipankar Sarma and Maneesh Soni"
|
||||
,Title="Read-Copy Update"
|
||||
,Booktitle="{Ottawa Linux Symposium}"
|
||||
,Month="July"
|
||||
,Year="2001"
|
||||
,note="Available:
|
||||
\url{http://www.linuxsymposium.org/2001/abstracts/readcopy.php}
|
||||
\url{http://www.rdrop.com/users/paulmck/rclock/rclock_OLS.2001.05.01c.pdf}
|
||||
[Viewed June 23, 2004]"
|
||||
annotation="
|
||||
Described RCU, and presented some patches implementing and using it in
|
||||
the Linux kernel.
|
||||
"
|
||||
}
|
||||
|
||||
@Conference{Linder02a
|
||||
,Author="Hanna Linder and Dipankar Sarma and Maneesh Soni"
|
||||
,Title="Scalability of the Directory Entry Cache"
|
||||
,Booktitle="{Ottawa Linux Symposium}"
|
||||
,Month="June"
|
||||
,Year="2002"
|
||||
,pages="289-300"
|
||||
}
|
||||
|
||||
@Conference{McKenney02a
|
||||
,Author="Paul E. McKenney and Dipankar Sarma and
|
||||
Andrea Arcangeli and Andi Kleen and Orran Krieger and Rusty Russell"
|
||||
,Title="Read-Copy Update"
|
||||
,Booktitle="{Ottawa Linux Symposium}"
|
||||
,Month="June"
|
||||
,Year="2002"
|
||||
,pages="338-367"
|
||||
,note="Available:
|
||||
\url{http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz}
|
||||
[Viewed June 23, 2004]"
|
||||
}
|
||||
|
||||
@article{Appavoo03a
|
||||
,author="J. Appavoo and K. Hui and C. A. N. Soules and R. W. Wisniewski and
|
||||
D. M. {Da Silva} and O. Krieger and M. A. Auslander and D. J. Edelsohn and
|
||||
B. Gamsa and G. R. Ganger and P. McKenney and M. Ostrowski and
|
||||
B. Rosenburg and M. Stumm and J. Xenidis"
|
||||
,title="Enabling Autonomic Behavior in Systems Software With Hot Swapping"
|
||||
,Year="2003"
|
||||
,Month="January"
|
||||
,journal="IBM Systems Journal"
|
||||
,volume="42"
|
||||
,number="1"
|
||||
,pages="60-76"
|
||||
}
|
||||
|
||||
@Conference{Arcangeli03
|
||||
,Author="Andrea Arcangeli and Mingming Cao and Paul E. McKenney and
|
||||
Dipankar Sarma"
|
||||
,Title="Using Read-Copy Update Techniques for {System V IPC} in the
|
||||
{Linux} 2.5 Kernel"
|
||||
,Booktitle="Proceedings of the 2003 USENIX Annual Technical Conference
|
||||
(FREENIX Track)"
|
||||
,Publisher="USENIX Association"
|
||||
,year="2003"
|
||||
,month="June"
|
||||
,pages="297-310"
|
||||
}
|
||||
|
||||
@article{McKenney03a
|
||||
,author="Paul E. McKenney"
|
||||
,title="Using {RCU} in the {Linux} 2.5 Kernel"
|
||||
,Year="2003"
|
||||
,Month="October"
|
||||
,journal="Linux Journal"
|
||||
,volume="1"
|
||||
,number="114"
|
||||
,pages="18-26"
|
||||
}
|
||||
|
||||
@techreport{Friedberg03a
|
||||
,author="Stuart A. Friedberg"
|
||||
,title="Lock-Free Wild Card Search Data Structure and Method"
|
||||
,institution="US Patent and Trademark Office"
|
||||
,address="Washington, DC"
|
||||
,year="2003"
|
||||
,number="US Patent 6,662,184 (contributed under GPL)"
|
||||
,month="December"
|
||||
,pages="112"
|
||||
}
|
||||
|
||||
@article{McKenney04a
|
||||
,author="Paul E. McKenney and Dipankar Sarma and Maneesh Soni"
|
||||
,title="Scaling dcache with {RCU}"
|
||||
,Year="2004"
|
||||
,Month="January"
|
||||
,journal="Linux Journal"
|
||||
,volume="1"
|
||||
,number="118"
|
||||
,pages="38-46"
|
||||
}
|
||||
|
||||
@Conference{McKenney04b
|
||||
,Author="Paul E. McKenney"
|
||||
,Title="{RCU} vs. Locking Performance on Different {CPUs}"
|
||||
,Booktitle="{linux.conf.au}"
|
||||
,Month="January"
|
||||
,Year="2004"
|
||||
,Address="Adelaide, Australia"
|
||||
,note="Available:
|
||||
\url{http://www.linux.org.au/conf/2004/abstracts.html#90}
|
||||
\url{http://www.rdrop.com/users/paulmck/rclock/lockperf.2004.01.17a.pdf}
|
||||
[Viewed June 23, 2004]"
|
||||
}
|
||||
|
||||
@phdthesis{PaulEdwardMcKenneyPhD
|
||||
,author="Paul E. McKenney"
|
||||
,title="Exploiting Deferred Destruction:
|
||||
An Analysis of Read-Copy-Update Techniques
|
||||
in Operating System Kernels"
|
||||
,school="OGI School of Science and Engineering at
|
||||
Oregon Health and Sciences University"
|
||||
,year="2004"
|
||||
,note="Available:
|
||||
\url{http://www.rdrop.com/users/paulmck/RCU/RCUdissertation.2004.07.14e1.pdf}
|
||||
[Viewed October 15, 2004]"
|
||||
}
|
||||
|
||||
@Conference{Sarma04c
|
||||
,Author="Dipankar Sarma and Paul E. McKenney"
|
||||
,Title="Making RCU Safe for Deep Sub-Millisecond Response Realtime Applications"
|
||||
,Booktitle="Proceedings of the 2004 USENIX Annual Technical Conference
|
||||
(FREENIX Track)"
|
||||
,Publisher="USENIX Association"
|
||||
,year="2004"
|
||||
,month="June"
|
||||
,pages="182-191"
|
||||
}
|
||||
|
||||
@unpublished{JamesMorris04b
|
||||
,Author="James Morris"
|
||||
,Title="Recent Developments in {SELinux} Kernel Performance"
|
||||
,month="December"
|
||||
,year="2004"
|
||||
,note="Available:
|
||||
\url{http://www.livejournal.com/users/james_morris/2153.html}
|
||||
[Viewed December 10, 2004]"
|
||||
}
|
||||
|
||||
@unpublished{PaulMcKenney05a
|
||||
,Author="Paul E. McKenney"
|
||||
,Title="{[RFC]} {RCU} and {CONFIG\_PREEMPT\_RT} progress"
|
||||
,month="May"
|
||||
,year="2005"
|
||||
,note="Available:
|
||||
\url{http://lkml.org/lkml/2005/5/9/185}
|
||||
[Viewed May 13, 2005]"
|
||||
,annotation="
|
||||
First publication of working lock-based deferred free patches
|
||||
for the CONFIG_PREEMPT_RT environment.
|
||||
"
|
||||
}
|
||||
|
||||
@conference{PaulMcKenney05b
|
||||
,Author="Paul E. McKenney and Dipankar Sarma"
|
||||
,Title="Towards Hard Realtime Response from the Linux Kernel on SMP Hardware"
|
||||
,Booktitle="linux.conf.au 2005"
|
||||
,month="April"
|
||||
,year="2005"
|
||||
,address="Canberra, Australia"
|
||||
,note="Available:
|
||||
\url{http://www.rdrop.com/users/paulmck/RCU/realtimeRCU.2005.04.23a.pdf}
|
||||
[Viewed May 13, 2005]"
|
||||
,annotation="
|
||||
Realtime turns into making RCU yet more realtime friendly.
|
||||
"
|
||||
}
|
||||
119
Documentation/RCU/UP.txt
Normal file
119
Documentation/RCU/UP.txt
Normal file
@ -0,0 +1,119 @@
|
||||
RCU on Uniprocessor Systems
|
||||
|
||||
|
||||
A common misconception is that, on UP systems, the call_rcu() primitive
|
||||
may immediately invoke its function, and that the synchronize_rcu()
|
||||
primitive may return immediately. The basis of this misconception
|
||||
is that since there is only one CPU, it should not be necessary to
|
||||
wait for anything else to get done, since there are no other CPUs for
|
||||
anything else to be happening on. Although this approach will -sort- -of-
|
||||
work a surprising amount of the time, it is a very bad idea in general.
|
||||
This document presents three examples that demonstrate exactly how bad an
|
||||
idea this is.
|
||||
|
||||
|
||||
Example 1: softirq Suicide
|
||||
|
||||
Suppose that an RCU-based algorithm scans a linked list containing
|
||||
elements A, B, and C in process context, and can delete elements from
|
||||
this same list in softirq context. Suppose that the process-context scan
|
||||
is referencing element B when it is interrupted by softirq processing,
|
||||
which deletes element B, and then invokes call_rcu() to free element B
|
||||
after a grace period.
|
||||
|
||||
Now, if call_rcu() were to directly invoke its arguments, then upon return
|
||||
from softirq, the list scan would find itself referencing a newly freed
|
||||
element B. This situation can greatly decrease the life expectancy of
|
||||
your kernel.
|
||||
|
||||
This same problem can occur if call_rcu() is invoked from a hardware
|
||||
interrupt handler.
|
||||
|
||||
|
||||
Example 2: Function-Call Fatality
|
||||
|
||||
Of course, one could avert the suicide described in the preceding example
|
||||
by having call_rcu() directly invoke its arguments only if it was called
|
||||
from process context. However, this can fail in a similar manner.
|
||||
|
||||
Suppose that an RCU-based algorithm again scans a linked list containing
|
||||
elements A, B, and C in process contexts, but that it invokes a function
|
||||
on each element as it is scanned. Suppose further that this function
|
||||
deletes element B from the list, then passes it to call_rcu() for deferred
|
||||
freeing. This may be a bit unconventional, but it is perfectly legal
|
||||
RCU usage, since call_rcu() must wait for a grace period to elapse.
|
||||
Therefore, in this case, allowing call_rcu() to immediately invoke
|
||||
its arguments would cause it to fail to make the fundamental guarantee
|
||||
underlying RCU, namely that call_rcu() defers invoking its arguments until
|
||||
all RCU read-side critical sections currently executing have completed.
|
||||
|
||||
Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in
|
||||
this case?
|
||||
|
||||
|
||||
Example 3: Death by Deadlock
|
||||
|
||||
Suppose that call_rcu() is invoked while holding a lock, and that the
|
||||
callback function must acquire this same lock. In this case, if
|
||||
call_rcu() were to directly invoke the callback, the result would
|
||||
be self-deadlock.
|
||||
|
||||
In some cases, it would possible to restructure to code so that
|
||||
the call_rcu() is delayed until after the lock is released. However,
|
||||
there are cases where this can be quite ugly:
|
||||
|
||||
1. If a number of items need to be passed to call_rcu() within
|
||||
the same critical section, then the code would need to create
|
||||
a list of them, then traverse the list once the lock was
|
||||
released.
|
||||
|
||||
2. In some cases, the lock will be held across some kernel API,
|
||||
so that delaying the call_rcu() until the lock is released
|
||||
requires that the data item be passed up via a common API.
|
||||
It is far better to guarantee that callbacks are invoked
|
||||
with no locks held than to have to modify such APIs to allow
|
||||
arbitrary data items to be passed back up through them.
|
||||
|
||||
If call_rcu() directly invokes the callback, painful locking restrictions
|
||||
or API changes would be required.
|
||||
|
||||
Quick Quiz #2: What locking restriction must RCU callbacks respect?
|
||||
|
||||
|
||||
Summary
|
||||
|
||||
Permitting call_rcu() to immediately invoke its arguments or permitting
|
||||
synchronize_rcu() to immediately return breaks RCU, even on a UP system.
|
||||
So do not do it! Even on a UP system, the RCU infrastructure -must-
|
||||
respect grace periods, and -must- invoke callbacks from a known environment
|
||||
in which no locks are held.
|
||||
|
||||
|
||||
Answer to Quick Quiz #1:
|
||||
Why is it -not- legal to invoke synchronize_rcu() in this case?
|
||||
|
||||
Because the calling function is scanning an RCU-protected linked
|
||||
list, and is therefore within an RCU read-side critical section.
|
||||
Therefore, the called function has been invoked within an RCU
|
||||
read-side critical section, and is not permitted to block.
|
||||
|
||||
Answer to Quick Quiz #2:
|
||||
What locking restriction must RCU callbacks respect?
|
||||
|
||||
Any lock that is acquired within an RCU callback must be
|
||||
acquired elsewhere using an _irq variant of the spinlock
|
||||
primitive. For example, if "mylock" is acquired by an
|
||||
RCU callback, then a process-context acquisition of this
|
||||
lock must use something like spin_lock_irqsave() to
|
||||
acquire the lock.
|
||||
|
||||
If the process-context code were to simply use spin_lock(),
|
||||
then, since RCU callbacks can be invoked from softirq context,
|
||||
the callback might be called from a softirq that interrupted
|
||||
the process-context critical section. This would result in
|
||||
self-deadlock.
|
||||
|
||||
This restriction might seem gratuitous, since very few RCU
|
||||
callbacks acquire locks directly. However, a great many RCU
|
||||
callbacks do acquire locks -indirectly-, for example, via
|
||||
the kfree() primitive.
|
||||
141
Documentation/RCU/arrayRCU.txt
Normal file
141
Documentation/RCU/arrayRCU.txt
Normal file
@ -0,0 +1,141 @@
|
||||
Using RCU to Protect Read-Mostly Arrays
|
||||
|
||||
|
||||
Although RCU is more commonly used to protect linked lists, it can
|
||||
also be used to protect arrays. Three situations are as follows:
|
||||
|
||||
1. Hash Tables
|
||||
|
||||
2. Static Arrays
|
||||
|
||||
3. Resizeable Arrays
|
||||
|
||||
Each of these situations are discussed below.
|
||||
|
||||
|
||||
Situation 1: Hash Tables
|
||||
|
||||
Hash tables are often implemented as an array, where each array entry
|
||||
has a linked-list hash chain. Each hash chain can be protected by RCU
|
||||
as described in the listRCU.txt document. This approach also applies
|
||||
to other array-of-list situations, such as radix trees.
|
||||
|
||||
|
||||
Situation 2: Static Arrays
|
||||
|
||||
Static arrays, where the data (rather than a pointer to the data) is
|
||||
located in each array element, and where the array is never resized,
|
||||
have not been used with RCU. Rik van Riel recommends using seqlock in
|
||||
this situation, which would also have minimal read-side overhead as long
|
||||
as updates are rare.
|
||||
|
||||
Quick Quiz: Why is it so important that updates be rare when
|
||||
using seqlock?
|
||||
|
||||
|
||||
Situation 3: Resizeable Arrays
|
||||
|
||||
Use of RCU for resizeable arrays is demonstrated by the grow_ary()
|
||||
function used by the System V IPC code. The array is used to map from
|
||||
semaphore, message-queue, and shared-memory IDs to the data structure
|
||||
that represents the corresponding IPC construct. The grow_ary()
|
||||
function does not acquire any locks; instead its caller must hold the
|
||||
ids->sem semaphore.
|
||||
|
||||
The grow_ary() function, shown below, does some limit checks, allocates a
|
||||
new ipc_id_ary, copies the old to the new portion of the new, initializes
|
||||
the remainder of the new, updates the ids->entries pointer to point to
|
||||
the new array, and invokes ipc_rcu_putref() to free up the old array.
|
||||
Note that rcu_assign_pointer() is used to update the ids->entries pointer,
|
||||
which includes any memory barriers required on whatever architecture
|
||||
you are running on.
|
||||
|
||||
static int grow_ary(struct ipc_ids* ids, int newsize)
|
||||
{
|
||||
struct ipc_id_ary* new;
|
||||
struct ipc_id_ary* old;
|
||||
int i;
|
||||
int size = ids->entries->size;
|
||||
|
||||
if(newsize > IPCMNI)
|
||||
newsize = IPCMNI;
|
||||
if(newsize <= size)
|
||||
return newsize;
|
||||
|
||||
new = ipc_rcu_alloc(sizeof(struct kern_ipc_perm *)*newsize +
|
||||
sizeof(struct ipc_id_ary));
|
||||
if(new == NULL)
|
||||
return size;
|
||||
new->size = newsize;
|
||||
memcpy(new->p, ids->entries->p,
|
||||
sizeof(struct kern_ipc_perm *)*size +
|
||||
sizeof(struct ipc_id_ary));
|
||||
for(i=size;i<newsize;i++) {
|
||||
new->p[i] = NULL;
|
||||
}
|
||||
old = ids->entries;
|
||||
|
||||
/*
|
||||
* Use rcu_assign_pointer() to make sure the memcpyed
|
||||
* contents of the new array are visible before the new
|
||||
* array becomes visible.
|
||||
*/
|
||||
rcu_assign_pointer(ids->entries, new);
|
||||
|
||||
ipc_rcu_putref(old);
|
||||
return newsize;
|
||||
}
|
||||
|
||||
The ipc_rcu_putref() function decrements the array's reference count
|
||||
and then, if the reference count has dropped to zero, uses call_rcu()
|
||||
to free the array after a grace period has elapsed.
|
||||
|
||||
The array is traversed by the ipc_lock() function. This function
|
||||
indexes into the array under the protection of rcu_read_lock(),
|
||||
using rcu_dereference() to pick up the pointer to the array so
|
||||
that it may later safely be dereferenced -- memory barriers are
|
||||
required on the Alpha CPU. Since the size of the array is stored
|
||||
with the array itself, there can be no array-size mismatches, so
|
||||
a simple check suffices. The pointer to the structure corresponding
|
||||
to the desired IPC object is placed in "out", with NULL indicating
|
||||
a non-existent entry. After acquiring "out->lock", the "out->deleted"
|
||||
flag indicates whether the IPC object is in the process of being
|
||||
deleted, and, if not, the pointer is returned.
|
||||
|
||||
struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id)
|
||||
{
|
||||
struct kern_ipc_perm* out;
|
||||
int lid = id % SEQ_MULTIPLIER;
|
||||
struct ipc_id_ary* entries;
|
||||
|
||||
rcu_read_lock();
|
||||
entries = rcu_dereference(ids->entries);
|
||||
if(lid >= entries->size) {
|
||||
rcu_read_unlock();
|
||||
return NULL;
|
||||
}
|
||||
out = entries->p[lid];
|
||||
if(out == NULL) {
|
||||
rcu_read_unlock();
|
||||
return NULL;
|
||||
}
|
||||
spin_lock(&out->lock);
|
||||
|
||||
/* ipc_rmid() may have already freed the ID while ipc_lock
|
||||
* was spinning: here verify that the structure is still valid
|
||||
*/
|
||||
if (out->deleted) {
|
||||
spin_unlock(&out->lock);
|
||||
rcu_read_unlock();
|
||||
return NULL;
|
||||
}
|
||||
return out;
|
||||
}
|
||||
|
||||
|
||||
Answer to Quick Quiz:
|
||||
|
||||
The reason that it is important that updates be rare when
|
||||
using seqlock is that frequent updates can livelock readers.
|
||||
One way to avoid this problem is to assign a seqlock for
|
||||
each array entry rather than to the entire array.
|
||||
261
Documentation/RCU/checklist.txt
Normal file
261
Documentation/RCU/checklist.txt
Normal file
@ -0,0 +1,261 @@
|
||||
Review Checklist for RCU Patches
|
||||
|
||||
|
||||
This document contains a checklist for producing and reviewing patches
|
||||
that make use of RCU. Violating any of the rules listed below will
|
||||
result in the same sorts of problems that leaving out a locking primitive
|
||||
would cause. This list is based on experiences reviewing such patches
|
||||
over a rather long period of time, but improvements are always welcome!
|
||||
|
||||
0. Is RCU being applied to a read-mostly situation? If the data
|
||||
structure is updated more than about 10% of the time, then
|
||||
you should strongly consider some other approach, unless
|
||||
detailed performance measurements show that RCU is nonetheless
|
||||
the right tool for the job.
|
||||
|
||||
The other exception would be where performance is not an issue,
|
||||
and RCU provides a simpler implementation. An example of this
|
||||
situation is the dynamic NMI code in the Linux 2.6 kernel,
|
||||
at least on architectures where NMIs are rare.
|
||||
|
||||
1. Does the update code have proper mutual exclusion?
|
||||
|
||||
RCU does allow -readers- to run (almost) naked, but -writers- must
|
||||
still use some sort of mutual exclusion, such as:
|
||||
|
||||
a. locking,
|
||||
b. atomic operations, or
|
||||
c. restricting updates to a single task.
|
||||
|
||||
If you choose #b, be prepared to describe how you have handled
|
||||
memory barriers on weakly ordered machines (pretty much all of
|
||||
them -- even x86 allows reads to be reordered), and be prepared
|
||||
to explain why this added complexity is worthwhile. If you
|
||||
choose #c, be prepared to explain how this single task does not
|
||||
become a major bottleneck on big multiprocessor machines (for
|
||||
example, if the task is updating information relating to itself
|
||||
that other tasks can read, there by definition can be no
|
||||
bottleneck).
|
||||
|
||||
2. Do the RCU read-side critical sections make proper use of
|
||||
rcu_read_lock() and friends? These primitives are needed
|
||||
to suppress preemption (or bottom halves, in the case of
|
||||
rcu_read_lock_bh()) in the read-side critical sections,
|
||||
and are also an excellent aid to readability.
|
||||
|
||||
As a rough rule of thumb, any dereference of an RCU-protected
|
||||
pointer must be covered by rcu_read_lock() or rcu_read_lock_bh()
|
||||
or by the appropriate update-side lock.
|
||||
|
||||
3. Does the update code tolerate concurrent accesses?
|
||||
|
||||
The whole point of RCU is to permit readers to run without
|
||||
any locks or atomic operations. This means that readers will
|
||||
be running while updates are in progress. There are a number
|
||||
of ways to handle this concurrency, depending on the situation:
|
||||
|
||||
a. Make updates appear atomic to readers. For example,
|
||||
pointer updates to properly aligned fields will appear
|
||||
atomic, as will individual atomic primitives. Operations
|
||||
performed under a lock and sequences of multiple atomic
|
||||
primitives will -not- appear to be atomic.
|
||||
|
||||
This is almost always the best approach.
|
||||
|
||||
b. Carefully order the updates and the reads so that
|
||||
readers see valid data at all phases of the update.
|
||||
This is often more difficult than it sounds, especially
|
||||
given modern CPUs' tendency to reorder memory references.
|
||||
One must usually liberally sprinkle memory barriers
|
||||
(smp_wmb(), smp_rmb(), smp_mb()) through the code,
|
||||
making it difficult to understand and to test.
|
||||
|
||||
It is usually better to group the changing data into
|
||||
a separate structure, so that the change may be made
|
||||
to appear atomic by updating a pointer to reference
|
||||
a new structure containing updated values.
|
||||
|
||||
4. Weakly ordered CPUs pose special challenges. Almost all CPUs
|
||||
are weakly ordered -- even i386 CPUs allow reads to be reordered.
|
||||
RCU code must take all of the following measures to prevent
|
||||
memory-corruption problems:
|
||||
|
||||
a. Readers must maintain proper ordering of their memory
|
||||
accesses. The rcu_dereference() primitive ensures that
|
||||
the CPU picks up the pointer before it picks up the data
|
||||
that the pointer points to. This really is necessary
|
||||
on Alpha CPUs. If you don't believe me, see:
|
||||
|
||||
http://www.openvms.compaq.com/wizard/wiz_2637.html
|
||||
|
||||
The rcu_dereference() primitive is also an excellent
|
||||
documentation aid, letting the person reading the code
|
||||
know exactly which pointers are protected by RCU.
|
||||
|
||||
The rcu_dereference() primitive is used by the various
|
||||
"_rcu()" list-traversal primitives, such as the
|
||||
list_for_each_entry_rcu(). Note that it is perfectly
|
||||
legal (if redundant) for update-side code to use
|
||||
rcu_dereference() and the "_rcu()" list-traversal
|
||||
primitives. This is particularly useful in code
|
||||
that is common to readers and updaters.
|
||||
|
||||
b. If the list macros are being used, the list_add_tail_rcu()
|
||||
and list_add_rcu() primitives must be used in order
|
||||
to prevent weakly ordered machines from misordering
|
||||
structure initialization and pointer planting.
|
||||
Similarly, if the hlist macros are being used, the
|
||||
hlist_add_head_rcu() primitive is required.
|
||||
|
||||
c. If the list macros are being used, the list_del_rcu()
|
||||
primitive must be used to keep list_del()'s pointer
|
||||
poisoning from inflicting toxic effects on concurrent
|
||||
readers. Similarly, if the hlist macros are being used,
|
||||
the hlist_del_rcu() primitive is required.
|
||||
|
||||
The list_replace_rcu() primitive may be used to
|
||||
replace an old structure with a new one in an
|
||||
RCU-protected list.
|
||||
|
||||
d. Updates must ensure that initialization of a given
|
||||
structure happens before pointers to that structure are
|
||||
publicized. Use the rcu_assign_pointer() primitive
|
||||
when publicizing a pointer to a structure that can
|
||||
be traversed by an RCU read-side critical section.
|
||||
|
||||
5. If call_rcu(), or a related primitive such as call_rcu_bh(),
|
||||
is used, the callback function must be written to be called
|
||||
from softirq context. In particular, it cannot block.
|
||||
|
||||
6. Since synchronize_rcu() can block, it cannot be called from
|
||||
any sort of irq context.
|
||||
|
||||
7. If the updater uses call_rcu(), then the corresponding readers
|
||||
must use rcu_read_lock() and rcu_read_unlock(). If the updater
|
||||
uses call_rcu_bh(), then the corresponding readers must use
|
||||
rcu_read_lock_bh() and rcu_read_unlock_bh(). Mixing things up
|
||||
will result in confusion and broken kernels.
|
||||
|
||||
One exception to this rule: rcu_read_lock() and rcu_read_unlock()
|
||||
may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh()
|
||||
in cases where local bottom halves are already known to be
|
||||
disabled, for example, in irq or softirq context. Commenting
|
||||
such cases is a must, of course! And the jury is still out on
|
||||
whether the increased speed is worth it.
|
||||
|
||||
8. Although synchronize_rcu() is a bit slower than is call_rcu(),
|
||||
it usually results in simpler code. So, unless update
|
||||
performance is critically important or the updaters cannot block,
|
||||
synchronize_rcu() should be used in preference to call_rcu().
|
||||
|
||||
An especially important property of the synchronize_rcu()
|
||||
primitive is that it automatically self-limits: if grace periods
|
||||
are delayed for whatever reason, then the synchronize_rcu()
|
||||
primitive will correspondingly delay updates. In contrast,
|
||||
code using call_rcu() should explicitly limit update rate in
|
||||
cases where grace periods are delayed, as failing to do so can
|
||||
result in excessive realtime latencies or even OOM conditions.
|
||||
|
||||
Ways of gaining this self-limiting property when using call_rcu()
|
||||
include:
|
||||
|
||||
a. Keeping a count of the number of data-structure elements
|
||||
used by the RCU-protected data structure, including those
|
||||
waiting for a grace period to elapse. Enforce a limit
|
||||
on this number, stalling updates as needed to allow
|
||||
previously deferred frees to complete.
|
||||
|
||||
Alternatively, limit only the number awaiting deferred
|
||||
free rather than the total number of elements.
|
||||
|
||||
b. Limiting update rate. For example, if updates occur only
|
||||
once per hour, then no explicit rate limiting is required,
|
||||
unless your system is already badly broken. The dcache
|
||||
subsystem takes this approach -- updates are guarded
|
||||
by a global lock, limiting their rate.
|
||||
|
||||
c. Trusted update -- if updates can only be done manually by
|
||||
superuser or some other trusted user, then it might not
|
||||
be necessary to automatically limit them. The theory
|
||||
here is that superuser already has lots of ways to crash
|
||||
the machine.
|
||||
|
||||
d. Use call_rcu_bh() rather than call_rcu(), in order to take
|
||||
advantage of call_rcu_bh()'s faster grace periods.
|
||||
|
||||
e. Periodically invoke synchronize_rcu(), permitting a limited
|
||||
number of updates per grace period.
|
||||
|
||||
9. All RCU list-traversal primitives, which include
|
||||
list_for_each_rcu(), list_for_each_entry_rcu(),
|
||||
list_for_each_continue_rcu(), and list_for_each_safe_rcu(),
|
||||
must be within an RCU read-side critical section. RCU
|
||||
read-side critical sections are delimited by rcu_read_lock()
|
||||
and rcu_read_unlock(), or by similar primitives such as
|
||||
rcu_read_lock_bh() and rcu_read_unlock_bh().
|
||||
|
||||
Use of the _rcu() list-traversal primitives outside of an
|
||||
RCU read-side critical section causes no harm other than
|
||||
a slight performance degradation on Alpha CPUs. It can
|
||||
also be quite helpful in reducing code bloat when common
|
||||
code is shared between readers and updaters.
|
||||
|
||||
10. Conversely, if you are in an RCU read-side critical section,
|
||||
you -must- use the "_rcu()" variants of the list macros.
|
||||
Failing to do so will break Alpha and confuse people reading
|
||||
your code.
|
||||
|
||||
11. Note that synchronize_rcu() -only- guarantees to wait until
|
||||
all currently executing rcu_read_lock()-protected RCU read-side
|
||||
critical sections complete. It does -not- necessarily guarantee
|
||||
that all currently running interrupts, NMIs, preempt_disable()
|
||||
code, or idle loops will complete. Therefore, if you do not have
|
||||
rcu_read_lock()-protected read-side critical sections, do -not-
|
||||
use synchronize_rcu().
|
||||
|
||||
If you want to wait for some of these other things, you might
|
||||
instead need to use synchronize_irq() or synchronize_sched().
|
||||
|
||||
12. Any lock acquired by an RCU callback must be acquired elsewhere
|
||||
with irq disabled, e.g., via spin_lock_irqsave(). Failing to
|
||||
disable irq on a given acquisition of that lock will result in
|
||||
deadlock as soon as the RCU callback happens to interrupt that
|
||||
acquisition's critical section.
|
||||
|
||||
13. SRCU (srcu_read_lock(), srcu_read_unlock(), and synchronize_srcu())
|
||||
may only be invoked from process context. Unlike other forms of
|
||||
RCU, it -is- permissible to block in an SRCU read-side critical
|
||||
section (demarked by srcu_read_lock() and srcu_read_unlock()),
|
||||
hence the "SRCU": "sleepable RCU". Please note that if you
|
||||
don't need to sleep in read-side critical sections, you should
|
||||
be using RCU rather than SRCU, because RCU is almost always
|
||||
faster and easier to use than is SRCU.
|
||||
|
||||
Also unlike other forms of RCU, explicit initialization
|
||||
and cleanup is required via init_srcu_struct() and
|
||||
cleanup_srcu_struct(). These are passed a "struct srcu_struct"
|
||||
that defines the scope of a given SRCU domain. Once initialized,
|
||||
the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock()
|
||||
and synchronize_srcu(). A given synchronize_srcu() waits only
|
||||
for SRCU read-side critical sections governed by srcu_read_lock()
|
||||
and srcu_read_unlock() calls that have been passd the same
|
||||
srcu_struct. This property is what makes sleeping read-side
|
||||
critical sections tolerable -- a given subsystem delays only
|
||||
its own updates, not those of other subsystems using SRCU.
|
||||
Therefore, SRCU is less prone to OOM the system than RCU would
|
||||
be if RCU's read-side critical sections were permitted to
|
||||
sleep.
|
||||
|
||||
The ability to sleep in read-side critical sections does not
|
||||
come for free. First, corresponding srcu_read_lock() and
|
||||
srcu_read_unlock() calls must be passed the same srcu_struct.
|
||||
Second, grace-period-detection overhead is amortized only
|
||||
over those updates sharing a given srcu_struct, rather than
|
||||
being globally amortized as they are for other forms of RCU.
|
||||
Therefore, SRCU should be used in preference to rw_semaphore
|
||||
only in extremely read-intensive situations, or in situations
|
||||
requiring SRCU's read-side deadlock immunity or low read-side
|
||||
realtime latency.
|
||||
|
||||
Note that, rcu_assign_pointer() and rcu_dereference() relate to
|
||||
SRCU just as they do to other forms of RCU.
|
||||
315
Documentation/RCU/listRCU.txt
Normal file
315
Documentation/RCU/listRCU.txt
Normal file
@ -0,0 +1,315 @@
|
||||
Using RCU to Protect Read-Mostly Linked Lists
|
||||
|
||||
|
||||
One of the best applications of RCU is to protect read-mostly linked lists
|
||||
("struct list_head" in list.h). One big advantage of this approach
|
||||
is that all of the required memory barriers are included for you in
|
||||
the list macros. This document describes several applications of RCU,
|
||||
with the best fits first.
|
||||
|
||||
|
||||
Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates
|
||||
|
||||
The best applications are cases where, if reader-writer locking were
|
||||
used, the read-side lock would be dropped before taking any action
|
||||
based on the results of the search. The most celebrated example is
|
||||
the routing table. Because the routing table is tracking the state of
|
||||
equipment outside of the computer, it will at times contain stale data.
|
||||
Therefore, once the route has been computed, there is no need to hold
|
||||
the routing table static during transmission of the packet. After all,
|
||||
you can hold the routing table static all you want, but that won't keep
|
||||
the external Internet from changing, and it is the state of the external
|
||||
Internet that really matters. In addition, routing entries are typically
|
||||
added or deleted, rather than being modified in place.
|
||||
|
||||
A straightforward example of this use of RCU may be found in the
|
||||
system-call auditing support. For example, a reader-writer locked
|
||||
implementation of audit_filter_task() might be as follows:
|
||||
|
||||
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
||||
{
|
||||
struct audit_entry *e;
|
||||
enum audit_state state;
|
||||
|
||||
read_lock(&auditsc_lock);
|
||||
/* Note: audit_netlink_sem held by caller. */
|
||||
list_for_each_entry(e, &audit_tsklist, list) {
|
||||
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
|
||||
read_unlock(&auditsc_lock);
|
||||
return state;
|
||||
}
|
||||
}
|
||||
read_unlock(&auditsc_lock);
|
||||
return AUDIT_BUILD_CONTEXT;
|
||||
}
|
||||
|
||||
Here the list is searched under the lock, but the lock is dropped before
|
||||
the corresponding value is returned. By the time that this value is acted
|
||||
on, the list may well have been modified. This makes sense, since if
|
||||
you are turning auditing off, it is OK to audit a few extra system calls.
|
||||
|
||||
This means that RCU can be easily applied to the read side, as follows:
|
||||
|
||||
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
||||
{
|
||||
struct audit_entry *e;
|
||||
enum audit_state state;
|
||||
|
||||
rcu_read_lock();
|
||||
/* Note: audit_netlink_sem held by caller. */
|
||||
list_for_each_entry_rcu(e, &audit_tsklist, list) {
|
||||
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
|
||||
rcu_read_unlock();
|
||||
return state;
|
||||
}
|
||||
}
|
||||
rcu_read_unlock();
|
||||
return AUDIT_BUILD_CONTEXT;
|
||||
}
|
||||
|
||||
The read_lock() and read_unlock() calls have become rcu_read_lock()
|
||||
and rcu_read_unlock(), respectively, and the list_for_each_entry() has
|
||||
become list_for_each_entry_rcu(). The _rcu() list-traversal primitives
|
||||
insert the read-side memory barriers that are required on DEC Alpha CPUs.
|
||||
|
||||
The changes to the update side are also straightforward. A reader-writer
|
||||
lock might be used as follows for deletion and insertion:
|
||||
|
||||
static inline int audit_del_rule(struct audit_rule *rule,
|
||||
struct list_head *list)
|
||||
{
|
||||
struct audit_entry *e;
|
||||
|
||||
write_lock(&auditsc_lock);
|
||||
list_for_each_entry(e, list, list) {
|
||||
if (!audit_compare_rule(rule, &e->rule)) {
|
||||
list_del(&e->list);
|
||||
write_unlock(&auditsc_lock);
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
write_unlock(&auditsc_lock);
|
||||
return -EFAULT; /* No matching rule */
|
||||
}
|
||||
|
||||
static inline int audit_add_rule(struct audit_entry *entry,
|
||||
struct list_head *list)
|
||||
{
|
||||
write_lock(&auditsc_lock);
|
||||
if (entry->rule.flags & AUDIT_PREPEND) {
|
||||
entry->rule.flags &= ~AUDIT_PREPEND;
|
||||
list_add(&entry->list, list);
|
||||
} else {
|
||||
list_add_tail(&entry->list, list);
|
||||
}
|
||||
write_unlock(&auditsc_lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
Following are the RCU equivalents for these two functions:
|
||||
|
||||
static inline int audit_del_rule(struct audit_rule *rule,
|
||||
struct list_head *list)
|
||||
{
|
||||
struct audit_entry *e;
|
||||
|
||||
/* Do not use the _rcu iterator here, since this is the only
|
||||
* deletion routine. */
|
||||
list_for_each_entry(e, list, list) {
|
||||
if (!audit_compare_rule(rule, &e->rule)) {
|
||||
list_del_rcu(&e->list);
|
||||
call_rcu(&e->rcu, audit_free_rule, e);
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
return -EFAULT; /* No matching rule */
|
||||
}
|
||||
|
||||
static inline int audit_add_rule(struct audit_entry *entry,
|
||||
struct list_head *list)
|
||||
{
|
||||
if (entry->rule.flags & AUDIT_PREPEND) {
|
||||
entry->rule.flags &= ~AUDIT_PREPEND;
|
||||
list_add_rcu(&entry->list, list);
|
||||
} else {
|
||||
list_add_tail_rcu(&entry->list, list);
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
Normally, the write_lock() and write_unlock() would be replaced by
|
||||
a spin_lock() and a spin_unlock(), but in this case, all callers hold
|
||||
audit_netlink_sem, so no additional locking is required. The auditsc_lock
|
||||
can therefore be eliminated, since use of RCU eliminates the need for
|
||||
writers to exclude readers. Normally, the write_lock() calls would
|
||||
be converted into spin_lock() calls.
|
||||
|
||||
The list_del(), list_add(), and list_add_tail() primitives have been
|
||||
replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
|
||||
The _rcu() list-manipulation primitives add memory barriers that are
|
||||
needed on weakly ordered CPUs (most of them!). The list_del_rcu()
|
||||
primitive omits the pointer poisoning debug-assist code that would
|
||||
otherwise cause concurrent readers to fail spectacularly.
|
||||
|
||||
So, when readers can tolerate stale data and when entries are either added
|
||||
or deleted, without in-place modification, it is very easy to use RCU!
|
||||
|
||||
|
||||
Example 2: Handling In-Place Updates
|
||||
|
||||
The system-call auditing code does not update auditing rules in place.
|
||||
However, if it did, reader-writer-locked code to do so might look as
|
||||
follows (presumably, the field_count is only permitted to decrease,
|
||||
otherwise, the added fields would need to be filled in):
|
||||
|
||||
static inline int audit_upd_rule(struct audit_rule *rule,
|
||||
struct list_head *list,
|
||||
__u32 newaction,
|
||||
__u32 newfield_count)
|
||||
{
|
||||
struct audit_entry *e;
|
||||
struct audit_newentry *ne;
|
||||
|
||||
write_lock(&auditsc_lock);
|
||||
/* Note: audit_netlink_sem held by caller. */
|
||||
list_for_each_entry(e, list, list) {
|
||||
if (!audit_compare_rule(rule, &e->rule)) {
|
||||
e->rule.action = newaction;
|
||||
e->rule.file_count = newfield_count;
|
||||
write_unlock(&auditsc_lock);
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
write_unlock(&auditsc_lock);
|
||||
return -EFAULT; /* No matching rule */
|
||||
}
|
||||
|
||||
The RCU version creates a copy, updates the copy, then replaces the old
|
||||
entry with the newly updated entry. This sequence of actions, allowing
|
||||
concurrent reads while doing a copy to perform an update, is what gives
|
||||
RCU ("read-copy update") its name. The RCU code is as follows:
|
||||
|
||||
static inline int audit_upd_rule(struct audit_rule *rule,
|
||||
struct list_head *list,
|
||||
__u32 newaction,
|
||||
__u32 newfield_count)
|
||||
{
|
||||
struct audit_entry *e;
|
||||
struct audit_newentry *ne;
|
||||
|
||||
list_for_each_entry(e, list, list) {
|
||||
if (!audit_compare_rule(rule, &e->rule)) {
|
||||
ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
|
||||
if (ne == NULL)
|
||||
return -ENOMEM;
|
||||
audit_copy_rule(&ne->rule, &e->rule);
|
||||
ne->rule.action = newaction;
|
||||
ne->rule.file_count = newfield_count;
|
||||
list_replace_rcu(e, ne);
|
||||
call_rcu(&e->rcu, audit_free_rule, e);
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
return -EFAULT; /* No matching rule */
|
||||
}
|
||||
|
||||
Again, this assumes that the caller holds audit_netlink_sem. Normally,
|
||||
the reader-writer lock would become a spinlock in this sort of code.
|
||||
|
||||
|
||||
Example 3: Eliminating Stale Data
|
||||
|
||||
The auditing examples above tolerate stale data, as do most algorithms
|
||||
that are tracking external state. Because there is a delay from the
|
||||
time the external state changes before Linux becomes aware of the change,
|
||||
additional RCU-induced staleness is normally not a problem.
|
||||
|
||||
However, there are many examples where stale data cannot be tolerated.
|
||||
One example in the Linux kernel is the System V IPC (see the ipc_lock()
|
||||
function in ipc/util.c). This code checks a "deleted" flag under a
|
||||
per-entry spinlock, and, if the "deleted" flag is set, pretends that the
|
||||
entry does not exist. For this to be helpful, the search function must
|
||||
return holding the per-entry spinlock, as ipc_lock() does in fact do.
|
||||
|
||||
Quick Quiz: Why does the search function need to return holding the
|
||||
per-entry lock for this deleted-flag technique to be helpful?
|
||||
|
||||
If the system-call audit module were to ever need to reject stale data,
|
||||
one way to accomplish this would be to add a "deleted" flag and a "lock"
|
||||
spinlock to the audit_entry structure, and modify audit_filter_task()
|
||||
as follows:
|
||||
|
||||
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
||||
{
|
||||
struct audit_entry *e;
|
||||
enum audit_state state;
|
||||
|
||||
rcu_read_lock();
|
||||
list_for_each_entry_rcu(e, &audit_tsklist, list) {
|
||||
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
|
||||
spin_lock(&e->lock);
|
||||
if (e->deleted) {
|
||||
spin_unlock(&e->lock);
|
||||
rcu_read_unlock();
|
||||
return AUDIT_BUILD_CONTEXT;
|
||||
}
|
||||
rcu_read_unlock();
|
||||
return state;
|
||||
}
|
||||
}
|
||||
rcu_read_unlock();
|
||||
return AUDIT_BUILD_CONTEXT;
|
||||
}
|
||||
|
||||
Note that this example assumes that entries are only added and deleted.
|
||||
Additional mechanism is required to deal correctly with the
|
||||
update-in-place performed by audit_upd_rule(). For one thing,
|
||||
audit_upd_rule() would need additional memory barriers to ensure
|
||||
that the list_add_rcu() was really executed before the list_del_rcu().
|
||||
|
||||
The audit_del_rule() function would need to set the "deleted"
|
||||
flag under the spinlock as follows:
|
||||
|
||||
static inline int audit_del_rule(struct audit_rule *rule,
|
||||
struct list_head *list)
|
||||
{
|
||||
struct audit_entry *e;
|
||||
|
||||
/* Do not need to use the _rcu iterator here, since this
|
||||
* is the only deletion routine. */
|
||||
list_for_each_entry(e, list, list) {
|
||||
if (!audit_compare_rule(rule, &e->rule)) {
|
||||
spin_lock(&e->lock);
|
||||
list_del_rcu(&e->list);
|
||||
e->deleted = 1;
|
||||
spin_unlock(&e->lock);
|
||||
call_rcu(&e->rcu, audit_free_rule, e);
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
return -EFAULT; /* No matching rule */
|
||||
}
|
||||
|
||||
|
||||
Summary
|
||||
|
||||
Read-mostly list-based data structures that can tolerate stale data are
|
||||
the most amenable to use of RCU. The simplest case is where entries are
|
||||
either added or deleted from the data structure (or atomically modified
|
||||
in place), but non-atomic in-place modifications can be handled by making
|
||||
a copy, updating the copy, then replacing the original with the copy.
|
||||
If stale data cannot be tolerated, then a "deleted" flag may be used
|
||||
in conjunction with a per-entry spinlock in order to allow the search
|
||||
function to reject newly deleted data.
|
||||
|
||||
|
||||
Answer to Quick Quiz
|
||||
Why does the search function need to return holding the per-entry
|
||||
lock for this deleted-flag technique to be helpful?
|
||||
|
||||
If the search function drops the per-entry lock before returning,
|
||||
then the caller will be processing stale data in any case. If it
|
||||
is really OK to be processing stale data, then you don't need a
|
||||
"deleted" flag. If processing stale data really is a problem,
|
||||
then you need to hold the per-entry lock across all of the code
|
||||
that uses the value that was returned.
|
||||
123
Documentation/RCU/rcu.txt
Normal file
123
Documentation/RCU/rcu.txt
Normal file
@ -0,0 +1,123 @@
|
||||
RCU Concepts
|
||||
|
||||
|
||||
The basic idea behind RCU (read-copy update) is to split destructive
|
||||
operations into two parts, one that prevents anyone from seeing the data
|
||||
item being destroyed, and one that actually carries out the destruction.
|
||||
A "grace period" must elapse between the two parts, and this grace period
|
||||
must be long enough that any readers accessing the item being deleted have
|
||||
since dropped their references. For example, an RCU-protected deletion
|
||||
from a linked list would first remove the item from the list, wait for
|
||||
a grace period to elapse, then free the element. See the listRCU.txt
|
||||
file for more information on using RCU with linked lists.
|
||||
|
||||
|
||||
Frequently Asked Questions
|
||||
|
||||
o Why would anyone want to use RCU?
|
||||
|
||||
The advantage of RCU's two-part approach is that RCU readers need
|
||||
not acquire any locks, perform any atomic instructions, write to
|
||||
shared memory, or (on CPUs other than Alpha) execute any memory
|
||||
barriers. The fact that these operations are quite expensive
|
||||
on modern CPUs is what gives RCU its performance advantages
|
||||
in read-mostly situations. The fact that RCU readers need not
|
||||
acquire locks can also greatly simplify deadlock-avoidance code.
|
||||
|
||||
o How can the updater tell when a grace period has completed
|
||||
if the RCU readers give no indication when they are done?
|
||||
|
||||
Just as with spinlocks, RCU readers are not permitted to
|
||||
block, switch to user-mode execution, or enter the idle loop.
|
||||
Therefore, as soon as a CPU is seen passing through any of these
|
||||
three states, we know that that CPU has exited any previous RCU
|
||||
read-side critical sections. So, if we remove an item from a
|
||||
linked list, and then wait until all CPUs have switched context,
|
||||
executed in user mode, or executed in the idle loop, we can
|
||||
safely free up that item.
|
||||
|
||||
o If I am running on a uniprocessor kernel, which can only do one
|
||||
thing at a time, why should I wait for a grace period?
|
||||
|
||||
See the UP.txt file in this directory.
|
||||
|
||||
o How can I see where RCU is currently used in the Linux kernel?
|
||||
|
||||
Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
|
||||
"rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh",
|
||||
"srcu_read_lock", "srcu_read_unlock", "synchronize_rcu",
|
||||
"synchronize_net", and "synchronize_srcu".
|
||||
|
||||
o What guidelines should I follow when writing code that uses RCU?
|
||||
|
||||
See the checklist.txt file in this directory.
|
||||
|
||||
o Why the name "RCU"?
|
||||
|
||||
"RCU" stands for "read-copy update". The file listRCU.txt has
|
||||
more information on where this name came from, search for
|
||||
"read-copy update" to find it.
|
||||
|
||||
o I hear that RCU is patented? What is with that?
|
||||
|
||||
Yes, it is. There are several known patents related to RCU,
|
||||
search for the string "Patent" in RTFP.txt to find them.
|
||||
Of these, one was allowed to lapse by the assignee, and the
|
||||
others have been contributed to the Linux kernel under GPL.
|
||||
|
||||
o I hear that RCU needs work in order to support realtime kernels?
|
||||
|
||||
Yes, work in progress.
|
||||
|
||||
o Where can I find more information on RCU?
|
||||
|
||||
See the RTFP.txt file in this directory.
|
||||
Or point your browser at http://www.rdrop.com/users/paulmck/RCU/.
|
||||
|
||||
o What are all these files in this directory?
|
||||
|
||||
|
||||
NMI-RCU.txt
|
||||
|
||||
Describes how to use RCU to implement dynamic
|
||||
NMI handlers, which can be revectored on the fly,
|
||||
without rebooting.
|
||||
|
||||
RTFP.txt
|
||||
|
||||
List of RCU-related publications and web sites.
|
||||
|
||||
UP.txt
|
||||
|
||||
Discussion of RCU usage in UP kernels.
|
||||
|
||||
arrayRCU.txt
|
||||
|
||||
Describes how to use RCU to protect arrays, with
|
||||
resizeable arrays whose elements reference other
|
||||
data structures being of the most interest.
|
||||
|
||||
checklist.txt
|
||||
|
||||
Lists things to check for when inspecting code that
|
||||
uses RCU.
|
||||
|
||||
listRCU.txt
|
||||
|
||||
Describes how to use RCU to protect linked lists.
|
||||
This is the simplest and most common use of RCU
|
||||
in the Linux kernel.
|
||||
|
||||
rcu.txt
|
||||
|
||||
You are reading it!
|
||||
|
||||
rcuref.txt
|
||||
|
||||
Describes how to combine use of reference counts
|
||||
with RCU.
|
||||
|
||||
whatisRCU.txt
|
||||
|
||||
Overview of how the RCU implementation works. Along
|
||||
the way, presents a conceptual view of RCU.
|
||||
66
Documentation/RCU/rcuref.txt
Normal file
66
Documentation/RCU/rcuref.txt
Normal file
@ -0,0 +1,66 @@
|
||||
Reference-count design for elements of lists/arrays protected by RCU.
|
||||
|
||||
Reference counting on elements of lists which are protected by traditional
|
||||
reader/writer spinlocks or semaphores are straightforward:
|
||||
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object read_lock(&list_lock);
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); atomic_inc(&el->rc);
|
||||
write_lock(&list_lock); ...
|
||||
add_element read_unlock(&list_lock);
|
||||
... ...
|
||||
write_unlock(&list_lock); }
|
||||
}
|
||||
|
||||
3. 4.
|
||||
release_referenced() delete()
|
||||
{ {
|
||||
... write_lock(&list_lock);
|
||||
atomic_dec(&el->rc, relfunc) ...
|
||||
... delete_element
|
||||
} write_unlock(&list_lock);
|
||||
...
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
kfree(el);
|
||||
...
|
||||
}
|
||||
|
||||
If this list/array is made lock free using RCU as in changing the
|
||||
write_lock() in add() and delete() to spin_lock and changing read_lock
|
||||
in search_and_reference to rcu_read_lock(), the atomic_get in
|
||||
search_and_reference could potentially hold reference to an element which
|
||||
has already been deleted from the list/array. Use atomic_inc_not_zero()
|
||||
in this scenario as follows:
|
||||
|
||||
1. 2.
|
||||
add() search_and_reference()
|
||||
{ {
|
||||
alloc_object rcu_read_lock();
|
||||
... search_for_element
|
||||
atomic_set(&el->rc, 1); if (atomic_inc_not_zero(&el->rc)) {
|
||||
write_lock(&list_lock); rcu_read_unlock();
|
||||
return FAIL;
|
||||
add_element }
|
||||
... ...
|
||||
write_unlock(&list_lock); rcu_read_unlock();
|
||||
} }
|
||||
3. 4.
|
||||
release_referenced() delete()
|
||||
{ {
|
||||
... write_lock(&list_lock);
|
||||
if (atomic_dec_and_test(&el->rc)) ...
|
||||
call_rcu(&el->head, el_free); delete_element
|
||||
... write_unlock(&list_lock);
|
||||
} ...
|
||||
if (atomic_dec_and_test(&el->rc))
|
||||
call_rcu(&el->head, el_free);
|
||||
...
|
||||
}
|
||||
|
||||
Sometimes, a reference to the element needs to be obtained in the
|
||||
update (write) stream. In such cases, atomic_inc_not_zero() might be
|
||||
overkill, since we hold the update-side spinlock. One might instead
|
||||
use atomic_inc() in such cases.
|
||||
163
Documentation/RCU/torture.txt
Normal file
163
Documentation/RCU/torture.txt
Normal file
@ -0,0 +1,163 @@
|
||||
RCU Torture Test Operation
|
||||
|
||||
|
||||
CONFIG_RCU_TORTURE_TEST
|
||||
|
||||
The CONFIG_RCU_TORTURE_TEST config option is available for all RCU
|
||||
implementations. It creates an rcutorture kernel module that can
|
||||
be loaded to run a torture test. The test periodically outputs
|
||||
status messages via printk(), which can be examined via the dmesg
|
||||
command (perhaps grepping for "torture"). The test is started
|
||||
when the module is loaded, and stops when the module is unloaded.
|
||||
|
||||
However, actually setting this config option to "y" results in the system
|
||||
running the test immediately upon boot, and ending only when the system
|
||||
is taken down. Normally, one will instead want to build the system
|
||||
with CONFIG_RCU_TORTURE_TEST=m and to use modprobe and rmmod to control
|
||||
the test, perhaps using a script similar to the one shown at the end of
|
||||
this document. Note that you will need CONFIG_MODULE_UNLOAD in order
|
||||
to be able to end the test.
|
||||
|
||||
|
||||
MODULE PARAMETERS
|
||||
|
||||
This module has the following parameters:
|
||||
|
||||
nreaders This is the number of RCU reading threads supported.
|
||||
The default is twice the number of CPUs. Why twice?
|
||||
To properly exercise RCU implementations with preemptible
|
||||
read-side critical sections.
|
||||
|
||||
nfakewriters This is the number of RCU fake writer threads to run. Fake
|
||||
writer threads repeatedly use the synchronous "wait for
|
||||
current readers" function of the interface selected by
|
||||
torture_type, with a delay between calls to allow for various
|
||||
different numbers of writers running in parallel.
|
||||
nfakewriters defaults to 4, which provides enough parallelism
|
||||
to trigger special cases caused by multiple writers, such as
|
||||
the synchronize_srcu() early return optimization.
|
||||
|
||||
stat_interval The number of seconds between output of torture
|
||||
statistics (via printk()). Regardless of the interval,
|
||||
statistics are printed when the module is unloaded.
|
||||
Setting the interval to zero causes the statistics to
|
||||
be printed -only- when the module is unloaded, and this
|
||||
is the default.
|
||||
|
||||
shuffle_interval
|
||||
The number of seconds to keep the test threads affinitied
|
||||
to a particular subset of the CPUs. Used in conjunction
|
||||
with test_no_idle_hz.
|
||||
|
||||
test_no_idle_hz Whether or not to test the ability of RCU to operate in
|
||||
a kernel that disables the scheduling-clock interrupt to
|
||||
idle CPUs. Boolean parameter, "1" to test, "0" otherwise.
|
||||
|
||||
torture_type The type of RCU to test: "rcu" for the rcu_read_lock() API,
|
||||
"rcu_sync" for rcu_read_lock() with synchronous reclamation,
|
||||
"rcu_bh" for the rcu_read_lock_bh() API, "rcu_bh_sync" for
|
||||
rcu_read_lock_bh() with synchronous reclamation, "srcu" for
|
||||
the "srcu_read_lock()" API, and "sched" for the use of
|
||||
preempt_disable() together with synchronize_sched().
|
||||
|
||||
verbose Enable debug printk()s. Default is disabled.
|
||||
|
||||
|
||||
OUTPUT
|
||||
|
||||
The statistics output is as follows:
|
||||
|
||||
rcu-torture: --- Start of test: nreaders=16 stat_interval=0 verbose=0
|
||||
rcu-torture: rtc: 0000000000000000 ver: 1916 tfle: 0 rta: 1916 rtaf: 0 rtf: 1915
|
||||
rcu-torture: Reader Pipe: 1466408 9747 0 0 0 0 0 0 0 0 0
|
||||
rcu-torture: Reader Batch: 1464477 11678 0 0 0 0 0 0 0 0
|
||||
rcu-torture: Free-Block Circulation: 1915 1915 1915 1915 1915 1915 1915 1915 1915 1915 0
|
||||
rcu-torture: --- End of test
|
||||
|
||||
The command "dmesg | grep torture:" will extract this information on
|
||||
most systems. On more esoteric configurations, it may be necessary to
|
||||
use other commands to access the output of the printk()s used by
|
||||
the RCU torture test. The printk()s use KERN_ALERT, so they should
|
||||
be evident. ;-)
|
||||
|
||||
The entries are as follows:
|
||||
|
||||
o "ggp": The number of counter flips (or batches) since boot.
|
||||
|
||||
o "rtc": The hexadecimal address of the structure currently visible
|
||||
to readers.
|
||||
|
||||
o "ver": The number of times since boot that the rcutw writer task
|
||||
has changed the structure visible to readers.
|
||||
|
||||
o "tfle": If non-zero, indicates that the "torture freelist"
|
||||
containing structure to be placed into the "rtc" area is empty.
|
||||
This condition is important, since it can fool you into thinking
|
||||
that RCU is working when it is not. :-/
|
||||
|
||||
o "rta": Number of structures allocated from the torture freelist.
|
||||
|
||||
o "rtaf": Number of allocations from the torture freelist that have
|
||||
failed due to the list being empty.
|
||||
|
||||
o "rtf": Number of frees into the torture freelist.
|
||||
|
||||
o "Reader Pipe": Histogram of "ages" of structures seen by readers.
|
||||
If any entries past the first two are non-zero, RCU is broken.
|
||||
And rcutorture prints the error flag string "!!!" to make sure
|
||||
you notice. The age of a newly allocated structure is zero,
|
||||
it becomes one when removed from reader visibility, and is
|
||||
incremented once per grace period subsequently -- and is freed
|
||||
after passing through (RCU_TORTURE_PIPE_LEN-2) grace periods.
|
||||
|
||||
The output displayed above was taken from a correctly working
|
||||
RCU. If you want to see what it looks like when broken, break
|
||||
it yourself. ;-)
|
||||
|
||||
o "Reader Batch": Another histogram of "ages" of structures seen
|
||||
by readers, but in terms of counter flips (or batches) rather
|
||||
than in terms of grace periods. The legal number of non-zero
|
||||
entries is again two. The reason for this separate view is
|
||||
that it is easier to get the third entry to show up in the
|
||||
"Reader Batch" list than in the "Reader Pipe" list.
|
||||
|
||||
o "Free-Block Circulation": Shows the number of torture structures
|
||||
that have reached a given point in the pipeline. The first element
|
||||
should closely correspond to the number of structures allocated,
|
||||
the second to the number that have been removed from reader view,
|
||||
and all but the last remaining to the corresponding number of
|
||||
passes through a grace period. The last entry should be zero,
|
||||
as it is only incremented if a torture structure's counter
|
||||
somehow gets incremented farther than it should.
|
||||
|
||||
Different implementations of RCU can provide implementation-specific
|
||||
additional information. For example, SRCU provides the following:
|
||||
|
||||
srcu-torture: rtc: f8cf46a8 ver: 355 tfle: 0 rta: 356 rtaf: 0 rtf: 346 rtmbe: 0
|
||||
srcu-torture: Reader Pipe: 559738 939 0 0 0 0 0 0 0 0 0
|
||||
srcu-torture: Reader Batch: 560434 243 0 0 0 0 0 0 0 0
|
||||
srcu-torture: Free-Block Circulation: 355 354 353 352 351 350 349 348 347 346 0
|
||||
srcu-torture: per-CPU(idx=1): 0(0,1) 1(0,1) 2(0,0) 3(0,1)
|
||||
|
||||
The first four lines are similar to those for RCU. The last line shows
|
||||
the per-CPU counter state. The numbers in parentheses are the values
|
||||
of the "old" and "current" counters for the corresponding CPU. The
|
||||
"idx" value maps the "old" and "current" values to the underlying array,
|
||||
and is useful for debugging.
|
||||
|
||||
|
||||
USAGE
|
||||
|
||||
The following script may be used to torture RCU:
|
||||
|
||||
#!/bin/sh
|
||||
|
||||
modprobe rcutorture
|
||||
sleep 100
|
||||
rmmod rcutorture
|
||||
dmesg | grep torture:
|
||||
|
||||
The output can be manually inspected for the error flag of "!!!".
|
||||
One could of course create a more elaborate script that automatically
|
||||
checked for such errors. The "rmmod" command forces a "SUCCESS" or
|
||||
"FAILURE" indication to be printk()ed.
|
||||
918
Documentation/RCU/whatisRCU.txt
Normal file
918
Documentation/RCU/whatisRCU.txt
Normal file
@ -0,0 +1,918 @@
|
||||
What is RCU?
|
||||
|
||||
RCU is a synchronization mechanism that was added to the Linux kernel
|
||||
during the 2.5 development effort that is optimized for read-mostly
|
||||
situations. Although RCU is actually quite simple once you understand it,
|
||||
getting there can sometimes be a challenge. Part of the problem is that
|
||||
most of the past descriptions of RCU have been written with the mistaken
|
||||
assumption that there is "one true way" to describe RCU. Instead,
|
||||
the experience has been that different people must take different paths
|
||||
to arrive at an understanding of RCU. This document provides several
|
||||
different paths, as follows:
|
||||
|
||||
1. RCU OVERVIEW
|
||||
2. WHAT IS RCU'S CORE API?
|
||||
3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
|
||||
4. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
|
||||
5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
|
||||
6. ANALOGY WITH READER-WRITER LOCKING
|
||||
7. FULL LIST OF RCU APIs
|
||||
8. ANSWERS TO QUICK QUIZZES
|
||||
|
||||
People who prefer starting with a conceptual overview should focus on
|
||||
Section 1, though most readers will profit by reading this section at
|
||||
some point. People who prefer to start with an API that they can then
|
||||
experiment with should focus on Section 2. People who prefer to start
|
||||
with example uses should focus on Sections 3 and 4. People who need to
|
||||
understand the RCU implementation should focus on Section 5, then dive
|
||||
into the kernel source code. People who reason best by analogy should
|
||||
focus on Section 6. Section 7 serves as an index to the docbook API
|
||||
documentation, and Section 8 is the traditional answer key.
|
||||
|
||||
So, start with the section that makes the most sense to you and your
|
||||
preferred method of learning. If you need to know everything about
|
||||
everything, feel free to read the whole thing -- but if you are really
|
||||
that type of person, you have perused the source code and will therefore
|
||||
never need this document anyway. ;-)
|
||||
|
||||
|
||||
1. RCU OVERVIEW
|
||||
|
||||
The basic idea behind RCU is to split updates into "removal" and
|
||||
"reclamation" phases. The removal phase removes references to data items
|
||||
within a data structure (possibly by replacing them with references to
|
||||
new versions of these data items), and can run concurrently with readers.
|
||||
The reason that it is safe to run the removal phase concurrently with
|
||||
readers is the semantics of modern CPUs guarantee that readers will see
|
||||
either the old or the new version of the data structure rather than a
|
||||
partially updated reference. The reclamation phase does the work of reclaiming
|
||||
(e.g., freeing) the data items removed from the data structure during the
|
||||
removal phase. Because reclaiming data items can disrupt any readers
|
||||
concurrently referencing those data items, the reclamation phase must
|
||||
not start until readers no longer hold references to those data items.
|
||||
|
||||
Splitting the update into removal and reclamation phases permits the
|
||||
updater to perform the removal phase immediately, and to defer the
|
||||
reclamation phase until all readers active during the removal phase have
|
||||
completed, either by blocking until they finish or by registering a
|
||||
callback that is invoked after they finish. Only readers that are active
|
||||
during the removal phase need be considered, because any reader starting
|
||||
after the removal phase will be unable to gain a reference to the removed
|
||||
data items, and therefore cannot be disrupted by the reclamation phase.
|
||||
|
||||
So the typical RCU update sequence goes something like the following:
|
||||
|
||||
a. Remove pointers to a data structure, so that subsequent
|
||||
readers cannot gain a reference to it.
|
||||
|
||||
b. Wait for all previous readers to complete their RCU read-side
|
||||
critical sections.
|
||||
|
||||
c. At this point, there cannot be any readers who hold references
|
||||
to the data structure, so it now may safely be reclaimed
|
||||
(e.g., kfree()d).
|
||||
|
||||
Step (b) above is the key idea underlying RCU's deferred destruction.
|
||||
The ability to wait until all readers are done allows RCU readers to
|
||||
use much lighter-weight synchronization, in some cases, absolutely no
|
||||
synchronization at all. In contrast, in more conventional lock-based
|
||||
schemes, readers must use heavy-weight synchronization in order to
|
||||
prevent an updater from deleting the data structure out from under them.
|
||||
This is because lock-based updaters typically update data items in place,
|
||||
and must therefore exclude readers. In contrast, RCU-based updaters
|
||||
typically take advantage of the fact that writes to single aligned
|
||||
pointers are atomic on modern CPUs, allowing atomic insertion, removal,
|
||||
and replacement of data items in a linked structure without disrupting
|
||||
readers. Concurrent RCU readers can then continue accessing the old
|
||||
versions, and can dispense with the atomic operations, memory barriers,
|
||||
and communications cache misses that are so expensive on present-day
|
||||
SMP computer systems, even in absence of lock contention.
|
||||
|
||||
In the three-step procedure shown above, the updater is performing both
|
||||
the removal and the reclamation step, but it is often helpful for an
|
||||
entirely different thread to do the reclamation, as is in fact the case
|
||||
in the Linux kernel's directory-entry cache (dcache). Even if the same
|
||||
thread performs both the update step (step (a) above) and the reclamation
|
||||
step (step (c) above), it is often helpful to think of them separately.
|
||||
For example, RCU readers and updaters need not communicate at all,
|
||||
but RCU provides implicit low-overhead communication between readers
|
||||
and reclaimers, namely, in step (b) above.
|
||||
|
||||
So how the heck can a reclaimer tell when a reader is done, given
|
||||
that readers are not doing any sort of synchronization operations???
|
||||
Read on to learn about how RCU's API makes this easy.
|
||||
|
||||
|
||||
2. WHAT IS RCU'S CORE API?
|
||||
|
||||
The core RCU API is quite small:
|
||||
|
||||
a. rcu_read_lock()
|
||||
b. rcu_read_unlock()
|
||||
c. synchronize_rcu() / call_rcu()
|
||||
d. rcu_assign_pointer()
|
||||
e. rcu_dereference()
|
||||
|
||||
There are many other members of the RCU API, but the rest can be
|
||||
expressed in terms of these five, though most implementations instead
|
||||
express synchronize_rcu() in terms of the call_rcu() callback API.
|
||||
|
||||
The five core RCU APIs are described below, the other 18 will be enumerated
|
||||
later. See the kernel docbook documentation for more info, or look directly
|
||||
at the function header comments.
|
||||
|
||||
rcu_read_lock()
|
||||
|
||||
void rcu_read_lock(void);
|
||||
|
||||
Used by a reader to inform the reclaimer that the reader is
|
||||
entering an RCU read-side critical section. It is illegal
|
||||
to block while in an RCU read-side critical section, though
|
||||
kernels built with CONFIG_PREEMPT_RCU can preempt RCU read-side
|
||||
critical sections. Any RCU-protected data structure accessed
|
||||
during an RCU read-side critical section is guaranteed to remain
|
||||
unreclaimed for the full duration of that critical section.
|
||||
Reference counts may be used in conjunction with RCU to maintain
|
||||
longer-term references to data structures.
|
||||
|
||||
rcu_read_unlock()
|
||||
|
||||
void rcu_read_unlock(void);
|
||||
|
||||
Used by a reader to inform the reclaimer that the reader is
|
||||
exiting an RCU read-side critical section. Note that RCU
|
||||
read-side critical sections may be nested and/or overlapping.
|
||||
|
||||
synchronize_rcu()
|
||||
|
||||
void synchronize_rcu(void);
|
||||
|
||||
Marks the end of updater code and the beginning of reclaimer
|
||||
code. It does this by blocking until all pre-existing RCU
|
||||
read-side critical sections on all CPUs have completed.
|
||||
Note that synchronize_rcu() will -not- necessarily wait for
|
||||
any subsequent RCU read-side critical sections to complete.
|
||||
For example, consider the following sequence of events:
|
||||
|
||||
CPU 0 CPU 1 CPU 2
|
||||
----------------- ------------------------- ---------------
|
||||
1. rcu_read_lock()
|
||||
2. enters synchronize_rcu()
|
||||
3. rcu_read_lock()
|
||||
4. rcu_read_unlock()
|
||||
5. exits synchronize_rcu()
|
||||
6. rcu_read_unlock()
|
||||
|
||||
To reiterate, synchronize_rcu() waits only for ongoing RCU
|
||||
read-side critical sections to complete, not necessarily for
|
||||
any that begin after synchronize_rcu() is invoked.
|
||||
|
||||
Of course, synchronize_rcu() does not necessarily return
|
||||
-immediately- after the last pre-existing RCU read-side critical
|
||||
section completes. For one thing, there might well be scheduling
|
||||
delays. For another thing, many RCU implementations process
|
||||
requests in batches in order to improve efficiencies, which can
|
||||
further delay synchronize_rcu().
|
||||
|
||||
Since synchronize_rcu() is the API that must figure out when
|
||||
readers are done, its implementation is key to RCU. For RCU
|
||||
to be useful in all but the most read-intensive situations,
|
||||
synchronize_rcu()'s overhead must also be quite small.
|
||||
|
||||
The call_rcu() API is a callback form of synchronize_rcu(),
|
||||
and is described in more detail in a later section. Instead of
|
||||
blocking, it registers a function and argument which are invoked
|
||||
after all ongoing RCU read-side critical sections have completed.
|
||||
This callback variant is particularly useful in situations where
|
||||
it is illegal to block or where update-side performance is
|
||||
critically important.
|
||||
|
||||
However, the call_rcu() API should not be used lightly, as use
|
||||
of the synchronize_rcu() API generally results in simpler code.
|
||||
In addition, the synchronize_rcu() API has the nice property
|
||||
of automatically limiting update rate should grace periods
|
||||
be delayed. This property results in system resilience in face
|
||||
of denial-of-service attacks. Code using call_rcu() should limit
|
||||
update rate in order to gain this same sort of resilience. See
|
||||
checklist.txt for some approaches to limiting the update rate.
|
||||
|
||||
rcu_assign_pointer()
|
||||
|
||||
typeof(p) rcu_assign_pointer(p, typeof(p) v);
|
||||
|
||||
Yes, rcu_assign_pointer() -is- implemented as a macro, though it
|
||||
would be cool to be able to declare a function in this manner.
|
||||
(Compiler experts will no doubt disagree.)
|
||||
|
||||
The updater uses this function to assign a new value to an
|
||||
RCU-protected pointer, in order to safely communicate the change
|
||||
in value from the updater to the reader. This function returns
|
||||
the new value, and also executes any memory-barrier instructions
|
||||
required for a given CPU architecture.
|
||||
|
||||
Perhaps just as important, it serves to document (1) which
|
||||
pointers are protected by RCU and (2) the point at which a
|
||||
given structure becomes accessible to other CPUs. That said,
|
||||
rcu_assign_pointer() is most frequently used indirectly, via
|
||||
the _rcu list-manipulation primitives such as list_add_rcu().
|
||||
|
||||
rcu_dereference()
|
||||
|
||||
typeof(p) rcu_dereference(p);
|
||||
|
||||
Like rcu_assign_pointer(), rcu_dereference() must be implemented
|
||||
as a macro.
|
||||
|
||||
The reader uses rcu_dereference() to fetch an RCU-protected
|
||||
pointer, which returns a value that may then be safely
|
||||
dereferenced. Note that rcu_deference() does not actually
|
||||
dereference the pointer, instead, it protects the pointer for
|
||||
later dereferencing. It also executes any needed memory-barrier
|
||||
instructions for a given CPU architecture. Currently, only Alpha
|
||||
needs memory barriers within rcu_dereference() -- on other CPUs,
|
||||
it compiles to nothing, not even a compiler directive.
|
||||
|
||||
Common coding practice uses rcu_dereference() to copy an
|
||||
RCU-protected pointer to a local variable, then dereferences
|
||||
this local variable, for example as follows:
|
||||
|
||||
p = rcu_dereference(head.next);
|
||||
return p->data;
|
||||
|
||||
However, in this case, one could just as easily combine these
|
||||
into one statement:
|
||||
|
||||
return rcu_dereference(head.next)->data;
|
||||
|
||||
If you are going to be fetching multiple fields from the
|
||||
RCU-protected structure, using the local variable is of
|
||||
course preferred. Repeated rcu_dereference() calls look
|
||||
ugly and incur unnecessary overhead on Alpha CPUs.
|
||||
|
||||
Note that the value returned by rcu_dereference() is valid
|
||||
only within the enclosing RCU read-side critical section.
|
||||
For example, the following is -not- legal:
|
||||
|
||||
rcu_read_lock();
|
||||
p = rcu_dereference(head.next);
|
||||
rcu_read_unlock();
|
||||
x = p->address;
|
||||
rcu_read_lock();
|
||||
y = p->data;
|
||||
rcu_read_unlock();
|
||||
|
||||
Holding a reference from one RCU read-side critical section
|
||||
to another is just as illegal as holding a reference from
|
||||
one lock-based critical section to another! Similarly,
|
||||
using a reference outside of the critical section in which
|
||||
it was acquired is just as illegal as doing so with normal
|
||||
locking.
|
||||
|
||||
As with rcu_assign_pointer(), an important function of
|
||||
rcu_dereference() is to document which pointers are protected by
|
||||
RCU, in particular, flagging a pointer that is subject to changing
|
||||
at any time, including immediately after the rcu_dereference().
|
||||
And, again like rcu_assign_pointer(), rcu_dereference() is
|
||||
typically used indirectly, via the _rcu list-manipulation
|
||||
primitives, such as list_for_each_entry_rcu().
|
||||
|
||||
The following diagram shows how each API communicates among the
|
||||
reader, updater, and reclaimer.
|
||||
|
||||
|
||||
rcu_assign_pointer()
|
||||
+--------+
|
||||
+---------------------->| reader |---------+
|
||||
| +--------+ |
|
||||
| | |
|
||||
| | | Protect:
|
||||
| | | rcu_read_lock()
|
||||
| | | rcu_read_unlock()
|
||||
| rcu_dereference() | |
|
||||
+---------+ | |
|
||||
| updater |<---------------------+ |
|
||||
+---------+ V
|
||||
| +-----------+
|
||||
+----------------------------------->| reclaimer |
|
||||
+-----------+
|
||||
Defer:
|
||||
synchronize_rcu() & call_rcu()
|
||||
|
||||
|
||||
The RCU infrastructure observes the time sequence of rcu_read_lock(),
|
||||
rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
|
||||
order to determine when (1) synchronize_rcu() invocations may return
|
||||
to their callers and (2) call_rcu() callbacks may be invoked. Efficient
|
||||
implementations of the RCU infrastructure make heavy use of batching in
|
||||
order to amortize their overhead over many uses of the corresponding APIs.
|
||||
|
||||
There are no fewer than three RCU mechanisms in the Linux kernel; the
|
||||
diagram above shows the first one, which is by far the most commonly used.
|
||||
The rcu_dereference() and rcu_assign_pointer() primitives are used for
|
||||
all three mechanisms, but different defer and protect primitives are
|
||||
used as follows:
|
||||
|
||||
Defer Protect
|
||||
|
||||
a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock()
|
||||
call_rcu()
|
||||
|
||||
b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh()
|
||||
|
||||
c. synchronize_sched() preempt_disable() / preempt_enable()
|
||||
local_irq_save() / local_irq_restore()
|
||||
hardirq enter / hardirq exit
|
||||
NMI enter / NMI exit
|
||||
|
||||
These three mechanisms are used as follows:
|
||||
|
||||
a. RCU applied to normal data structures.
|
||||
|
||||
b. RCU applied to networking data structures that may be subjected
|
||||
to remote denial-of-service attacks.
|
||||
|
||||
c. RCU applied to scheduler and interrupt/NMI-handler tasks.
|
||||
|
||||
Again, most uses will be of (a). The (b) and (c) cases are important
|
||||
for specialized uses, but are relatively uncommon.
|
||||
|
||||
|
||||
3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
|
||||
|
||||
This section shows a simple use of the core RCU API to protect a
|
||||
global pointer to a dynamically allocated structure. More-typical
|
||||
uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt.
|
||||
|
||||
struct foo {
|
||||
int a;
|
||||
char b;
|
||||
long c;
|
||||
};
|
||||
DEFINE_SPINLOCK(foo_mutex);
|
||||
|
||||
struct foo *gbl_foo;
|
||||
|
||||
/*
|
||||
* Create a new struct foo that is the same as the one currently
|
||||
* pointed to by gbl_foo, except that field "a" is replaced
|
||||
* with "new_a". Points gbl_foo to the new structure, and
|
||||
* frees up the old structure after a grace period.
|
||||
*
|
||||
* Uses rcu_assign_pointer() to ensure that concurrent readers
|
||||
* see the initialized version of the new structure.
|
||||
*
|
||||
* Uses synchronize_rcu() to ensure that any readers that might
|
||||
* have references to the old structure complete before freeing
|
||||
* the old structure.
|
||||
*/
|
||||
void foo_update_a(int new_a)
|
||||
{
|
||||
struct foo *new_fp;
|
||||
struct foo *old_fp;
|
||||
|
||||
new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
|
||||
spin_lock(&foo_mutex);
|
||||
old_fp = gbl_foo;
|
||||
*new_fp = *old_fp;
|
||||
new_fp->a = new_a;
|
||||
rcu_assign_pointer(gbl_foo, new_fp);
|
||||
spin_unlock(&foo_mutex);
|
||||
synchronize_rcu();
|
||||
kfree(old_fp);
|
||||
}
|
||||
|
||||
/*
|
||||
* Return the value of field "a" of the current gbl_foo
|
||||
* structure. Use rcu_read_lock() and rcu_read_unlock()
|
||||
* to ensure that the structure does not get deleted out
|
||||
* from under us, and use rcu_dereference() to ensure that
|
||||
* we see the initialized version of the structure (important
|
||||
* for DEC Alpha and for people reading the code).
|
||||
*/
|
||||
int foo_get_a(void)
|
||||
{
|
||||
int retval;
|
||||
|
||||
rcu_read_lock();
|
||||
retval = rcu_dereference(gbl_foo)->a;
|
||||
rcu_read_unlock();
|
||||
return retval;
|
||||
}
|
||||
|
||||
So, to sum up:
|
||||
|
||||
o Use rcu_read_lock() and rcu_read_unlock() to guard RCU
|
||||
read-side critical sections.
|
||||
|
||||
o Within an RCU read-side critical section, use rcu_dereference()
|
||||
to dereference RCU-protected pointers.
|
||||
|
||||
o Use some solid scheme (such as locks or semaphores) to
|
||||
keep concurrent updates from interfering with each other.
|
||||
|
||||
o Use rcu_assign_pointer() to update an RCU-protected pointer.
|
||||
This primitive protects concurrent readers from the updater,
|
||||
-not- concurrent updates from each other! You therefore still
|
||||
need to use locking (or something similar) to keep concurrent
|
||||
rcu_assign_pointer() primitives from interfering with each other.
|
||||
|
||||
o Use synchronize_rcu() -after- removing a data element from an
|
||||
RCU-protected data structure, but -before- reclaiming/freeing
|
||||
the data element, in order to wait for the completion of all
|
||||
RCU read-side critical sections that might be referencing that
|
||||
data item.
|
||||
|
||||
See checklist.txt for additional rules to follow when using RCU.
|
||||
And again, more-typical uses of RCU may be found in listRCU.txt,
|
||||
arrayRCU.txt, and NMI-RCU.txt.
|
||||
|
||||
|
||||
4. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
|
||||
|
||||
In the example above, foo_update_a() blocks until a grace period elapses.
|
||||
This is quite simple, but in some cases one cannot afford to wait so
|
||||
long -- there might be other high-priority work to be done.
|
||||
|
||||
In such cases, one uses call_rcu() rather than synchronize_rcu().
|
||||
The call_rcu() API is as follows:
|
||||
|
||||
void call_rcu(struct rcu_head * head,
|
||||
void (*func)(struct rcu_head *head));
|
||||
|
||||
This function invokes func(head) after a grace period has elapsed.
|
||||
This invocation might happen from either softirq or process context,
|
||||
so the function is not permitted to block. The foo struct needs to
|
||||
have an rcu_head structure added, perhaps as follows:
|
||||
|
||||
struct foo {
|
||||
int a;
|
||||
char b;
|
||||
long c;
|
||||
struct rcu_head rcu;
|
||||
};
|
||||
|
||||
The foo_update_a() function might then be written as follows:
|
||||
|
||||
/*
|
||||
* Create a new struct foo that is the same as the one currently
|
||||
* pointed to by gbl_foo, except that field "a" is replaced
|
||||
* with "new_a". Points gbl_foo to the new structure, and
|
||||
* frees up the old structure after a grace period.
|
||||
*
|
||||
* Uses rcu_assign_pointer() to ensure that concurrent readers
|
||||
* see the initialized version of the new structure.
|
||||
*
|
||||
* Uses call_rcu() to ensure that any readers that might have
|
||||
* references to the old structure complete before freeing the
|
||||
* old structure.
|
||||
*/
|
||||
void foo_update_a(int new_a)
|
||||
{
|
||||
struct foo *new_fp;
|
||||
struct foo *old_fp;
|
||||
|
||||
new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
|
||||
spin_lock(&foo_mutex);
|
||||
old_fp = gbl_foo;
|
||||
*new_fp = *old_fp;
|
||||
new_fp->a = new_a;
|
||||
rcu_assign_pointer(gbl_foo, new_fp);
|
||||
spin_unlock(&foo_mutex);
|
||||
call_rcu(&old_fp->rcu, foo_reclaim);
|
||||
}
|
||||
|
||||
The foo_reclaim() function might appear as follows:
|
||||
|
||||
void foo_reclaim(struct rcu_head *rp)
|
||||
{
|
||||
struct foo *fp = container_of(rp, struct foo, rcu);
|
||||
|
||||
kfree(fp);
|
||||
}
|
||||
|
||||
The container_of() primitive is a macro that, given a pointer into a
|
||||
struct, the type of the struct, and the pointed-to field within the
|
||||
struct, returns a pointer to the beginning of the struct.
|
||||
|
||||
The use of call_rcu() permits the caller of foo_update_a() to
|
||||
immediately regain control, without needing to worry further about the
|
||||
old version of the newly updated element. It also clearly shows the
|
||||
RCU distinction between updater, namely foo_update_a(), and reclaimer,
|
||||
namely foo_reclaim().
|
||||
|
||||
The summary of advice is the same as for the previous section, except
|
||||
that we are now using call_rcu() rather than synchronize_rcu():
|
||||
|
||||
o Use call_rcu() -after- removing a data element from an
|
||||
RCU-protected data structure in order to register a callback
|
||||
function that will be invoked after the completion of all RCU
|
||||
read-side critical sections that might be referencing that
|
||||
data item.
|
||||
|
||||
Again, see checklist.txt for additional rules governing the use of RCU.
|
||||
|
||||
|
||||
5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
|
||||
|
||||
One of the nice things about RCU is that it has extremely simple "toy"
|
||||
implementations that are a good first step towards understanding the
|
||||
production-quality implementations in the Linux kernel. This section
|
||||
presents two such "toy" implementations of RCU, one that is implemented
|
||||
in terms of familiar locking primitives, and another that more closely
|
||||
resembles "classic" RCU. Both are way too simple for real-world use,
|
||||
lacking both functionality and performance. However, they are useful
|
||||
in getting a feel for how RCU works. See kernel/rcupdate.c for a
|
||||
production-quality implementation, and see:
|
||||
|
||||
http://www.rdrop.com/users/paulmck/RCU
|
||||
|
||||
for papers describing the Linux kernel RCU implementation. The OLS'01
|
||||
and OLS'02 papers are a good introduction, and the dissertation provides
|
||||
more details on the current implementation as of early 2004.
|
||||
|
||||
|
||||
5A. "TOY" IMPLEMENTATION #1: LOCKING
|
||||
|
||||
This section presents a "toy" RCU implementation that is based on
|
||||
familiar locking primitives. Its overhead makes it a non-starter for
|
||||
real-life use, as does its lack of scalability. It is also unsuitable
|
||||
for realtime use, since it allows scheduling latency to "bleed" from
|
||||
one read-side critical section to another.
|
||||
|
||||
However, it is probably the easiest implementation to relate to, so is
|
||||
a good starting point.
|
||||
|
||||
It is extremely simple:
|
||||
|
||||
static DEFINE_RWLOCK(rcu_gp_mutex);
|
||||
|
||||
void rcu_read_lock(void)
|
||||
{
|
||||
read_lock(&rcu_gp_mutex);
|
||||
}
|
||||
|
||||
void rcu_read_unlock(void)
|
||||
{
|
||||
read_unlock(&rcu_gp_mutex);
|
||||
}
|
||||
|
||||
void synchronize_rcu(void)
|
||||
{
|
||||
write_lock(&rcu_gp_mutex);
|
||||
write_unlock(&rcu_gp_mutex);
|
||||
}
|
||||
|
||||
[You can ignore rcu_assign_pointer() and rcu_dereference() without
|
||||
missing much. But here they are anyway. And whatever you do, don't
|
||||
forget about them when submitting patches making use of RCU!]
|
||||
|
||||
#define rcu_assign_pointer(p, v) ({ \
|
||||
smp_wmb(); \
|
||||
(p) = (v); \
|
||||
})
|
||||
|
||||
#define rcu_dereference(p) ({ \
|
||||
typeof(p) _________p1 = p; \
|
||||
smp_read_barrier_depends(); \
|
||||
(_________p1); \
|
||||
})
|
||||
|
||||
|
||||
The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
|
||||
and release a global reader-writer lock. The synchronize_rcu()
|
||||
primitive write-acquires this same lock, then immediately releases
|
||||
it. This means that once synchronize_rcu() exits, all RCU read-side
|
||||
critical sections that were in progress before synchronize_rcu() was
|
||||
called are guaranteed to have completed -- there is no way that
|
||||
synchronize_rcu() would have been able to write-acquire the lock
|
||||
otherwise.
|
||||
|
||||
It is possible to nest rcu_read_lock(), since reader-writer locks may
|
||||
be recursively acquired. Note also that rcu_read_lock() is immune
|
||||
from deadlock (an important property of RCU). The reason for this is
|
||||
that the only thing that can block rcu_read_lock() is a synchronize_rcu().
|
||||
But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex,
|
||||
so there can be no deadlock cycle.
|
||||
|
||||
Quick Quiz #1: Why is this argument naive? How could a deadlock
|
||||
occur when using this algorithm in a real-world Linux
|
||||
kernel? How could this deadlock be avoided?
|
||||
|
||||
|
||||
5B. "TOY" EXAMPLE #2: CLASSIC RCU
|
||||
|
||||
This section presents a "toy" RCU implementation that is based on
|
||||
"classic RCU". It is also short on performance (but only for updates) and
|
||||
on features such as hotplug CPU and the ability to run in CONFIG_PREEMPT
|
||||
kernels. The definitions of rcu_dereference() and rcu_assign_pointer()
|
||||
are the same as those shown in the preceding section, so they are omitted.
|
||||
|
||||
void rcu_read_lock(void) { }
|
||||
|
||||
void rcu_read_unlock(void) { }
|
||||
|
||||
void synchronize_rcu(void)
|
||||
{
|
||||
int cpu;
|
||||
|
||||
for_each_possible_cpu(cpu)
|
||||
run_on(cpu);
|
||||
}
|
||||
|
||||
Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing.
|
||||
This is the great strength of classic RCU in a non-preemptive kernel:
|
||||
read-side overhead is precisely zero, at least on non-Alpha CPUs.
|
||||
And there is absolutely no way that rcu_read_lock() can possibly
|
||||
participate in a deadlock cycle!
|
||||
|
||||
The implementation of synchronize_rcu() simply schedules itself on each
|
||||
CPU in turn. The run_on() primitive can be implemented straightforwardly
|
||||
in terms of the sched_setaffinity() primitive. Of course, a somewhat less
|
||||
"toy" implementation would restore the affinity upon completion rather
|
||||
than just leaving all tasks running on the last CPU, but when I said
|
||||
"toy", I meant -toy-!
|
||||
|
||||
So how the heck is this supposed to work???
|
||||
|
||||
Remember that it is illegal to block while in an RCU read-side critical
|
||||
section. Therefore, if a given CPU executes a context switch, we know
|
||||
that it must have completed all preceding RCU read-side critical sections.
|
||||
Once -all- CPUs have executed a context switch, then -all- preceding
|
||||
RCU read-side critical sections will have completed.
|
||||
|
||||
So, suppose that we remove a data item from its structure and then invoke
|
||||
synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed
|
||||
that there are no RCU read-side critical sections holding a reference
|
||||
to that data item, so we can safely reclaim it.
|
||||
|
||||
Quick Quiz #2: Give an example where Classic RCU's read-side
|
||||
overhead is -negative-.
|
||||
|
||||
Quick Quiz #3: If it is illegal to block in an RCU read-side
|
||||
critical section, what the heck do you do in
|
||||
PREEMPT_RT, where normal spinlocks can block???
|
||||
|
||||
|
||||
6. ANALOGY WITH READER-WRITER LOCKING
|
||||
|
||||
Although RCU can be used in many different ways, a very common use of
|
||||
RCU is analogous to reader-writer locking. The following unified
|
||||
diff shows how closely related RCU and reader-writer locking can be.
|
||||
|
||||
@@ -13,15 +14,15 @@
|
||||
struct list_head *lp;
|
||||
struct el *p;
|
||||
|
||||
- read_lock();
|
||||
- list_for_each_entry(p, head, lp) {
|
||||
+ rcu_read_lock();
|
||||
+ list_for_each_entry_rcu(p, head, lp) {
|
||||
if (p->key == key) {
|
||||
*result = p->data;
|
||||
- read_unlock();
|
||||
+ rcu_read_unlock();
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
- read_unlock();
|
||||
+ rcu_read_unlock();
|
||||
return 0;
|
||||
}
|
||||
|
||||
@@ -29,15 +30,16 @@
|
||||
{
|
||||
struct el *p;
|
||||
|
||||
- write_lock(&listmutex);
|
||||
+ spin_lock(&listmutex);
|
||||
list_for_each_entry(p, head, lp) {
|
||||
if (p->key == key) {
|
||||
- list_del(&p->list);
|
||||
- write_unlock(&listmutex);
|
||||
+ list_del_rcu(&p->list);
|
||||
+ spin_unlock(&listmutex);
|
||||
+ synchronize_rcu();
|
||||
kfree(p);
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
- write_unlock(&listmutex);
|
||||
+ spin_unlock(&listmutex);
|
||||
return 0;
|
||||
}
|
||||
|
||||
Or, for those who prefer a side-by-side listing:
|
||||
|
||||
1 struct el { 1 struct el {
|
||||
2 struct list_head list; 2 struct list_head list;
|
||||
3 long key; 3 long key;
|
||||
4 spinlock_t mutex; 4 spinlock_t mutex;
|
||||
5 int data; 5 int data;
|
||||
6 /* Other data fields */ 6 /* Other data fields */
|
||||
7 }; 7 };
|
||||
8 spinlock_t listmutex; 8 spinlock_t listmutex;
|
||||
9 struct el head; 9 struct el head;
|
||||
|
||||
1 int search(long key, int *result) 1 int search(long key, int *result)
|
||||
2 { 2 {
|
||||
3 struct list_head *lp; 3 struct list_head *lp;
|
||||
4 struct el *p; 4 struct el *p;
|
||||
5 5
|
||||
6 read_lock(); 6 rcu_read_lock();
|
||||
7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) {
|
||||
8 if (p->key == key) { 8 if (p->key == key) {
|
||||
9 *result = p->data; 9 *result = p->data;
|
||||
10 read_unlock(); 10 rcu_read_unlock();
|
||||
11 return 1; 11 return 1;
|
||||
12 } 12 }
|
||||
13 } 13 }
|
||||
14 read_unlock(); 14 rcu_read_unlock();
|
||||
15 return 0; 15 return 0;
|
||||
16 } 16 }
|
||||
|
||||
1 int delete(long key) 1 int delete(long key)
|
||||
2 { 2 {
|
||||
3 struct el *p; 3 struct el *p;
|
||||
4 4
|
||||
5 write_lock(&listmutex); 5 spin_lock(&listmutex);
|
||||
6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) {
|
||||
7 if (p->key == key) { 7 if (p->key == key) {
|
||||
8 list_del(&p->list); 8 list_del_rcu(&p->list);
|
||||
9 write_unlock(&listmutex); 9 spin_unlock(&listmutex);
|
||||
10 synchronize_rcu();
|
||||
10 kfree(p); 11 kfree(p);
|
||||
11 return 1; 12 return 1;
|
||||
12 } 13 }
|
||||
13 } 14 }
|
||||
14 write_unlock(&listmutex); 15 spin_unlock(&listmutex);
|
||||
15 return 0; 16 return 0;
|
||||
16 } 17 }
|
||||
|
||||
Either way, the differences are quite small. Read-side locking moves
|
||||
to rcu_read_lock() and rcu_read_unlock, update-side locking moves from
|
||||
a reader-writer lock to a simple spinlock, and a synchronize_rcu()
|
||||
precedes the kfree().
|
||||
|
||||
However, there is one potential catch: the read-side and update-side
|
||||
critical sections can now run concurrently. In many cases, this will
|
||||
not be a problem, but it is necessary to check carefully regardless.
|
||||
For example, if multiple independent list updates must be seen as
|
||||
a single atomic update, converting to RCU will require special care.
|
||||
|
||||
Also, the presence of synchronize_rcu() means that the RCU version of
|
||||
delete() can now block. If this is a problem, there is a callback-based
|
||||
mechanism that never blocks, namely call_rcu(), that can be used in
|
||||
place of synchronize_rcu().
|
||||
|
||||
|
||||
7. FULL LIST OF RCU APIs
|
||||
|
||||
The RCU APIs are documented in docbook-format header comments in the
|
||||
Linux-kernel source code, but it helps to have a full list of the
|
||||
APIs, since there does not appear to be a way to categorize them
|
||||
in docbook. Here is the list, by category.
|
||||
|
||||
Markers for RCU read-side critical sections:
|
||||
|
||||
rcu_read_lock
|
||||
rcu_read_unlock
|
||||
rcu_read_lock_bh
|
||||
rcu_read_unlock_bh
|
||||
srcu_read_lock
|
||||
srcu_read_unlock
|
||||
|
||||
RCU pointer/list traversal:
|
||||
|
||||
rcu_dereference
|
||||
list_for_each_rcu (to be deprecated in favor of
|
||||
list_for_each_entry_rcu)
|
||||
list_for_each_entry_rcu
|
||||
list_for_each_continue_rcu (to be deprecated in favor of new
|
||||
list_for_each_entry_continue_rcu)
|
||||
hlist_for_each_entry_rcu
|
||||
|
||||
RCU pointer update:
|
||||
|
||||
rcu_assign_pointer
|
||||
list_add_rcu
|
||||
list_add_tail_rcu
|
||||
list_del_rcu
|
||||
list_replace_rcu
|
||||
hlist_del_rcu
|
||||
hlist_add_head_rcu
|
||||
|
||||
RCU grace period:
|
||||
|
||||
synchronize_net
|
||||
synchronize_sched
|
||||
synchronize_rcu
|
||||
synchronize_srcu
|
||||
call_rcu
|
||||
call_rcu_bh
|
||||
|
||||
See the comment headers in the source code (or the docbook generated
|
||||
from them) for more information.
|
||||
|
||||
|
||||
8. ANSWERS TO QUICK QUIZZES
|
||||
|
||||
Quick Quiz #1: Why is this argument naive? How could a deadlock
|
||||
occur when using this algorithm in a real-world Linux
|
||||
kernel? [Referring to the lock-based "toy" RCU
|
||||
algorithm.]
|
||||
|
||||
Answer: Consider the following sequence of events:
|
||||
|
||||
1. CPU 0 acquires some unrelated lock, call it
|
||||
"problematic_lock", disabling irq via
|
||||
spin_lock_irqsave().
|
||||
|
||||
2. CPU 1 enters synchronize_rcu(), write-acquiring
|
||||
rcu_gp_mutex.
|
||||
|
||||
3. CPU 0 enters rcu_read_lock(), but must wait
|
||||
because CPU 1 holds rcu_gp_mutex.
|
||||
|
||||
4. CPU 1 is interrupted, and the irq handler
|
||||
attempts to acquire problematic_lock.
|
||||
|
||||
The system is now deadlocked.
|
||||
|
||||
One way to avoid this deadlock is to use an approach like
|
||||
that of CONFIG_PREEMPT_RT, where all normal spinlocks
|
||||
become blocking locks, and all irq handlers execute in
|
||||
the context of special tasks. In this case, in step 4
|
||||
above, the irq handler would block, allowing CPU 1 to
|
||||
release rcu_gp_mutex, avoiding the deadlock.
|
||||
|
||||
Even in the absence of deadlock, this RCU implementation
|
||||
allows latency to "bleed" from readers to other
|
||||
readers through synchronize_rcu(). To see this,
|
||||
consider task A in an RCU read-side critical section
|
||||
(thus read-holding rcu_gp_mutex), task B blocked
|
||||
attempting to write-acquire rcu_gp_mutex, and
|
||||
task C blocked in rcu_read_lock() attempting to
|
||||
read_acquire rcu_gp_mutex. Task A's RCU read-side
|
||||
latency is holding up task C, albeit indirectly via
|
||||
task B.
|
||||
|
||||
Realtime RCU implementations therefore use a counter-based
|
||||
approach where tasks in RCU read-side critical sections
|
||||
cannot be blocked by tasks executing synchronize_rcu().
|
||||
|
||||
Quick Quiz #2: Give an example where Classic RCU's read-side
|
||||
overhead is -negative-.
|
||||
|
||||
Answer: Imagine a single-CPU system with a non-CONFIG_PREEMPT
|
||||
kernel where a routing table is used by process-context
|
||||
code, but can be updated by irq-context code (for example,
|
||||
by an "ICMP REDIRECT" packet). The usual way of handling
|
||||
this would be to have the process-context code disable
|
||||
interrupts while searching the routing table. Use of
|
||||
RCU allows such interrupt-disabling to be dispensed with.
|
||||
Thus, without RCU, you pay the cost of disabling interrupts,
|
||||
and with RCU you don't.
|
||||
|
||||
One can argue that the overhead of RCU in this
|
||||
case is negative with respect to the single-CPU
|
||||
interrupt-disabling approach. Others might argue that
|
||||
the overhead of RCU is merely zero, and that replacing
|
||||
the positive overhead of the interrupt-disabling scheme
|
||||
with the zero-overhead RCU scheme does not constitute
|
||||
negative overhead.
|
||||
|
||||
In real life, of course, things are more complex. But
|
||||
even the theoretical possibility of negative overhead for
|
||||
a synchronization primitive is a bit unexpected. ;-)
|
||||
|
||||
Quick Quiz #3: If it is illegal to block in an RCU read-side
|
||||
critical section, what the heck do you do in
|
||||
PREEMPT_RT, where normal spinlocks can block???
|
||||
|
||||
Answer: Just as PREEMPT_RT permits preemption of spinlock
|
||||
critical sections, it permits preemption of RCU
|
||||
read-side critical sections. It also permits
|
||||
spinlocks blocking while in RCU read-side critical
|
||||
sections.
|
||||
|
||||
Why the apparent inconsistency? Because it is it
|
||||
possible to use priority boosting to keep the RCU
|
||||
grace periods short if need be (for example, if running
|
||||
short of memory). In contrast, if blocking waiting
|
||||
for (say) network reception, there is no way to know
|
||||
what should be boosted. Especially given that the
|
||||
process we need to boost might well be a human being
|
||||
who just went out for a pizza or something. And although
|
||||
a computer-operated cattle prod might arouse serious
|
||||
interest, it might also provoke serious objections.
|
||||
Besides, how does the computer know what pizza parlor
|
||||
the human being went to???
|
||||
|
||||
|
||||
ACKNOWLEDGEMENTS
|
||||
|
||||
My thanks to the people who helped make this human-readable, including
|
||||
Jon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern.
|
||||
|
||||
|
||||
For more information, see http://www.rdrop.com/users/paulmck/RCU.
|
||||
756
Documentation/README.DAC960
Normal file
756
Documentation/README.DAC960
Normal file
@ -0,0 +1,756 @@
|
||||
Linux Driver for Mylex DAC960/AcceleRAID/eXtremeRAID PCI RAID Controllers
|
||||
|
||||
Version 2.2.11 for Linux 2.2.19
|
||||
Version 2.4.11 for Linux 2.4.12
|
||||
|
||||
PRODUCTION RELEASE
|
||||
|
||||
11 October 2001
|
||||
|
||||
Leonard N. Zubkoff
|
||||
Dandelion Digital
|
||||
lnz@dandelion.com
|
||||
|
||||
Copyright 1998-2001 by Leonard N. Zubkoff <lnz@dandelion.com>
|
||||
|
||||
|
||||
INTRODUCTION
|
||||
|
||||
Mylex, Inc. designs and manufactures a variety of high performance PCI RAID
|
||||
controllers. Mylex Corporation is located at 34551 Ardenwood Blvd., Fremont,
|
||||
California 94555, USA and can be reached at 510.796.6100 or on the World Wide
|
||||
Web at http://www.mylex.com. Mylex Technical Support can be reached by
|
||||
electronic mail at mylexsup@us.ibm.com, by voice at 510.608.2400, or by FAX at
|
||||
510.745.7715. Contact information for offices in Europe and Japan is available
|
||||
on their Web site.
|
||||
|
||||
The latest information on Linux support for DAC960 PCI RAID Controllers, as
|
||||
well as the most recent release of this driver, will always be available from
|
||||
my Linux Home Page at URL "http://www.dandelion.com/Linux/". The Linux DAC960
|
||||
driver supports all current Mylex PCI RAID controllers including the new
|
||||
eXtremeRAID 2000/3000 and AcceleRAID 352/170/160 models which have an entirely
|
||||
new firmware interface from the older eXtremeRAID 1100, AcceleRAID 150/200/250,
|
||||
and DAC960PJ/PG/PU/PD/PL. See below for a complete controller list as well as
|
||||
minimum firmware version requirements. For simplicity, in most places this
|
||||
documentation refers to DAC960 generically rather than explicitly listing all
|
||||
the supported models.
|
||||
|
||||
Driver bug reports should be sent via electronic mail to "lnz@dandelion.com".
|
||||
Please include with the bug report the complete configuration messages reported
|
||||
by the driver at startup, along with any subsequent system messages relevant to
|
||||
the controller's operation, and a detailed description of your system's
|
||||
hardware configuration. Driver bugs are actually quite rare; if you encounter
|
||||
problems with disks being marked offline, for example, please contact Mylex
|
||||
Technical Support as the problem is related to the hardware configuration
|
||||
rather than the Linux driver.
|
||||
|
||||
Please consult the RAID controller documentation for detailed information
|
||||
regarding installation and configuration of the controllers. This document
|
||||
primarily provides information specific to the Linux support.
|
||||
|
||||
|
||||
DRIVER FEATURES
|
||||
|
||||
The DAC960 RAID controllers are supported solely as high performance RAID
|
||||
controllers, not as interfaces to arbitrary SCSI devices. The Linux DAC960
|
||||
driver operates at the block device level, the same level as the SCSI and IDE
|
||||
drivers. Unlike other RAID controllers currently supported on Linux, the
|
||||
DAC960 driver is not dependent on the SCSI subsystem, and hence avoids all the
|
||||
complexity and unnecessary code that would be associated with an implementation
|
||||
as a SCSI driver. The DAC960 driver is designed for as high a performance as
|
||||
possible with no compromises or extra code for compatibility with lower
|
||||
performance devices. The DAC960 driver includes extensive error logging and
|
||||
online configuration management capabilities. Except for initial configuration
|
||||
of the controller and adding new disk drives, most everything can be handled
|
||||
from Linux while the system is operational.
|
||||
|
||||
The DAC960 driver is architected to support up to 8 controllers per system.
|
||||
Each DAC960 parallel SCSI controller can support up to 15 disk drives per
|
||||
channel, for a maximum of 60 drives on a four channel controller; the fibre
|
||||
channel eXtremeRAID 3000 controller supports up to 125 disk drives per loop for
|
||||
a total of 250 drives. The drives installed on a controller are divided into
|
||||
one or more "Drive Groups", and then each Drive Group is subdivided further
|
||||
into 1 to 32 "Logical Drives". Each Logical Drive has a specific RAID Level
|
||||
and caching policy associated with it, and it appears to Linux as a single
|
||||
block device. Logical Drives are further subdivided into up to 7 partitions
|
||||
through the normal Linux and PC disk partitioning schemes. Logical Drives are
|
||||
also known as "System Drives", and Drive Groups are also called "Packs". Both
|
||||
terms are in use in the Mylex documentation; I have chosen to standardize on
|
||||
the more generic "Logical Drive" and "Drive Group".
|
||||
|
||||
DAC960 RAID disk devices are named in the style of the obsolete Device File
|
||||
System (DEVFS). The device corresponding to Logical Drive D on Controller C
|
||||
is referred to as /dev/rd/cCdD, and the partitions are called /dev/rd/cCdDp1
|
||||
through /dev/rd/cCdDp7. For example, partition 3 of Logical Drive 5 on
|
||||
Controller 2 is referred to as /dev/rd/c2d5p3. Note that unlike with SCSI
|
||||
disks the device names will not change in the event of a disk drive failure.
|
||||
The DAC960 driver is assigned major numbers 48 - 55 with one major number per
|
||||
controller. The 8 bits of minor number are divided into 5 bits for the Logical
|
||||
Drive and 3 bits for the partition.
|
||||
|
||||
|
||||
SUPPORTED DAC960/AcceleRAID/eXtremeRAID PCI RAID CONTROLLERS
|
||||
|
||||
The following list comprises the supported DAC960, AcceleRAID, and eXtremeRAID
|
||||
PCI RAID Controllers as of the date of this document. It is recommended that
|
||||
anyone purchasing a Mylex PCI RAID Controller not in the following table
|
||||
contact the author beforehand to verify that it is or will be supported.
|
||||
|
||||
eXtremeRAID 3000
|
||||
1 Wide Ultra-2/LVD SCSI channel
|
||||
2 External Fibre FC-AL channels
|
||||
233MHz StrongARM SA 110 Processor
|
||||
64 Bit 33MHz PCI (backward compatible with 32 Bit PCI slots)
|
||||
32MB/64MB ECC SDRAM Memory
|
||||
|
||||
eXtremeRAID 2000
|
||||
4 Wide Ultra-160 LVD SCSI channels
|
||||
233MHz StrongARM SA 110 Processor
|
||||
64 Bit 33MHz PCI (backward compatible with 32 Bit PCI slots)
|
||||
32MB/64MB ECC SDRAM Memory
|
||||
|
||||
AcceleRAID 352
|
||||
2 Wide Ultra-160 LVD SCSI channels
|
||||
100MHz Intel i960RN RISC Processor
|
||||
64 Bit 33MHz PCI (backward compatible with 32 Bit PCI slots)
|
||||
32MB/64MB ECC SDRAM Memory
|
||||
|
||||
AcceleRAID 170
|
||||
1 Wide Ultra-160 LVD SCSI channel
|
||||
100MHz Intel i960RM RISC Processor
|
||||
16MB/32MB/64MB ECC SDRAM Memory
|
||||
|
||||
AcceleRAID 160 (AcceleRAID 170LP)
|
||||
1 Wide Ultra-160 LVD SCSI channel
|
||||
100MHz Intel i960RS RISC Processor
|
||||
Built in 16M ECC SDRAM Memory
|
||||
PCI Low Profile Form Factor - fit for 2U height
|
||||
|
||||
eXtremeRAID 1100 (DAC1164P)
|
||||
3 Wide Ultra-2/LVD SCSI channels
|
||||
233MHz StrongARM SA 110 Processor
|
||||
64 Bit 33MHz PCI (backward compatible with 32 Bit PCI slots)
|
||||
16MB/32MB/64MB Parity SDRAM Memory with Battery Backup
|
||||
|
||||
AcceleRAID 250 (DAC960PTL1)
|
||||
Uses onboard Symbios SCSI chips on certain motherboards
|
||||
Also includes one onboard Wide Ultra-2/LVD SCSI Channel
|
||||
66MHz Intel i960RD RISC Processor
|
||||
4MB/8MB/16MB/32MB/64MB/128MB ECC EDO Memory
|
||||
|
||||
AcceleRAID 200 (DAC960PTL0)
|
||||
Uses onboard Symbios SCSI chips on certain motherboards
|
||||
Includes no onboard SCSI Channels
|
||||
66MHz Intel i960RD RISC Processor
|
||||
4MB/8MB/16MB/32MB/64MB/128MB ECC EDO Memory
|
||||
|
||||
AcceleRAID 150 (DAC960PRL)
|
||||
Uses onboard Symbios SCSI chips on certain motherboards
|
||||
Also includes one onboard Wide Ultra-2/LVD SCSI Channel
|
||||
33MHz Intel i960RP RISC Processor
|
||||
4MB Parity EDO Memory
|
||||
|
||||
DAC960PJ 1/2/3 Wide Ultra SCSI-3 Channels
|
||||
66MHz Intel i960RD RISC Processor
|
||||
4MB/8MB/16MB/32MB/64MB/128MB ECC EDO Memory
|
||||
|
||||
DAC960PG 1/2/3 Wide Ultra SCSI-3 Channels
|
||||
33MHz Intel i960RP RISC Processor
|
||||
4MB/8MB ECC EDO Memory
|
||||
|
||||
DAC960PU 1/2/3 Wide Ultra SCSI-3 Channels
|
||||
Intel i960CF RISC Processor
|
||||
4MB/8MB EDRAM or 2MB/4MB/8MB/16MB/32MB DRAM Memory
|
||||
|
||||
DAC960PD 1/2/3 Wide Fast SCSI-2 Channels
|
||||
Intel i960CF RISC Processor
|
||||
4MB/8MB EDRAM or 2MB/4MB/8MB/16MB/32MB DRAM Memory
|
||||
|
||||
DAC960PL 1/2/3 Wide Fast SCSI-2 Channels
|
||||
Intel i960 RISC Processor
|
||||
2MB/4MB/8MB/16MB/32MB DRAM Memory
|
||||
|
||||
DAC960P 1/2/3 Wide Fast SCSI-2 Channels
|
||||
Intel i960 RISC Processor
|
||||
2MB/4MB/8MB/16MB/32MB DRAM Memory
|
||||
|
||||
For the eXtremeRAID 2000/3000 and AcceleRAID 352/170/160, firmware version
|
||||
6.00-01 or above is required.
|
||||
|
||||
For the eXtremeRAID 1100, firmware version 5.06-0-52 or above is required.
|
||||
|
||||
For the AcceleRAID 250, 200, and 150, firmware version 4.06-0-57 or above is
|
||||
required.
|
||||
|
||||
For the DAC960PJ and DAC960PG, firmware version 4.06-0-00 or above is required.
|
||||
|
||||
For the DAC960PU, DAC960PD, DAC960PL, and DAC960P, either firmware version
|
||||
3.51-0-04 or above is required (for dual Flash ROM controllers), or firmware
|
||||
version 2.73-0-00 or above is required (for single Flash ROM controllers)
|
||||
|
||||
Please note that not all SCSI disk drives are suitable for use with DAC960
|
||||
controllers, and only particular firmware versions of any given model may
|
||||
actually function correctly. Similarly, not all motherboards have a BIOS that
|
||||
properly initializes the AcceleRAID 250, AcceleRAID 200, AcceleRAID 150,
|
||||
DAC960PJ, and DAC960PG because the Intel i960RD/RP is a multi-function device.
|
||||
If in doubt, contact Mylex RAID Technical Support (mylexsup@us.ibm.com) to
|
||||
verify compatibility. Mylex makes available a hard disk compatibility list at
|
||||
http://www.mylex.com/support/hdcomp/hd-lists.html.
|
||||
|
||||
|
||||
DRIVER INSTALLATION
|
||||
|
||||
This distribution was prepared for Linux kernel version 2.2.19 or 2.4.12.
|
||||
|
||||
To install the DAC960 RAID driver, you may use the following commands,
|
||||
replacing "/usr/src" with wherever you keep your Linux kernel source tree:
|
||||
|
||||
cd /usr/src
|
||||
tar -xvzf DAC960-2.2.11.tar.gz (or DAC960-2.4.11.tar.gz)
|
||||
mv README.DAC960 linux/Documentation
|
||||
mv DAC960.[ch] linux/drivers/block
|
||||
patch -p0 < DAC960.patch (if DAC960.patch is included)
|
||||
cd linux
|
||||
make config
|
||||
make bzImage (or zImage)
|
||||
|
||||
Then install "arch/i386/boot/bzImage" or "arch/i386/boot/zImage" as your
|
||||
standard kernel, run lilo if appropriate, and reboot.
|
||||
|
||||
To create the necessary devices in /dev, the "make_rd" script included in
|
||||
"DAC960-Utilities.tar.gz" from http://www.dandelion.com/Linux/ may be used.
|
||||
LILO 21 and FDISK v2.9 include DAC960 support; also included in this archive
|
||||
are patches to LILO 20 and FDISK v2.8 that add DAC960 support, along with
|
||||
statically linked executables of LILO and FDISK. This modified version of LILO
|
||||
will allow booting from a DAC960 controller and/or mounting the root file
|
||||
system from a DAC960.
|
||||
|
||||
Red Hat Linux 6.0 and SuSE Linux 6.1 include support for Mylex PCI RAID
|
||||
controllers. Installing directly onto a DAC960 may be problematic from other
|
||||
Linux distributions until their installation utilities are updated.
|
||||
|
||||
|
||||
INSTALLATION NOTES
|
||||
|
||||
Before installing Linux or adding DAC960 logical drives to an existing Linux
|
||||
system, the controller must first be configured to provide one or more logical
|
||||
drives using the BIOS Configuration Utility or DACCF. Please note that since
|
||||
there are only at most 6 usable partitions on each logical drive, systems
|
||||
requiring more partitions should subdivide a drive group into multiple logical
|
||||
drives, each of which can have up to 6 usable partitions. Also, note that with
|
||||
large disk arrays it is advisable to enable the 8GB BIOS Geometry (255/63)
|
||||
rather than accepting the default 2GB BIOS Geometry (128/32); failing to so do
|
||||
will cause the logical drive geometry to have more than 65535 cylinders which
|
||||
will make it impossible for FDISK to be used properly. The 8GB BIOS Geometry
|
||||
can be enabled by configuring the DAC960 BIOS, which is accessible via Alt-M
|
||||
during the BIOS initialization sequence.
|
||||
|
||||
For maximum performance and the most efficient E2FSCK performance, it is
|
||||
recommended that EXT2 file systems be built with a 4KB block size and 16 block
|
||||
stride to match the DAC960 controller's 64KB default stripe size. The command
|
||||
"mke2fs -b 4096 -R stride=16 <device>" is appropriate. Unless there will be a
|
||||
large number of small files on the file systems, it is also beneficial to add
|
||||
the "-i 16384" option to increase the bytes per inode parameter thereby
|
||||
reducing the file system metadata. Finally, on systems that will only be run
|
||||
with Linux 2.2 or later kernels it is beneficial to enable sparse superblocks
|
||||
with the "-s 1" option.
|
||||
|
||||
|
||||
DAC960 ANNOUNCEMENTS MAILING LIST
|
||||
|
||||
The DAC960 Announcements Mailing List provides a forum for informing Linux
|
||||
users of new driver releases and other announcements regarding Linux support
|
||||
for DAC960 PCI RAID Controllers. To join the mailing list, send a message to
|
||||
"dac960-announce-request@dandelion.com" with the line "subscribe" in the
|
||||
message body.
|
||||
|
||||
|
||||
CONTROLLER CONFIGURATION AND STATUS MONITORING
|
||||
|
||||
The DAC960 RAID controllers running firmware 4.06 or above include a Background
|
||||
Initialization facility so that system downtime is minimized both for initial
|
||||
installation and subsequent configuration of additional storage. The BIOS
|
||||
Configuration Utility (accessible via Alt-R during the BIOS initialization
|
||||
sequence) is used to quickly configure the controller, and then the logical
|
||||
drives that have been created are available for immediate use even while they
|
||||
are still being initialized by the controller. The primary need for online
|
||||
configuration and status monitoring is then to avoid system downtime when disk
|
||||
drives fail and must be replaced. Mylex's online monitoring and configuration
|
||||
utilities are being ported to Linux and will become available at some point in
|
||||
the future. Note that with a SAF-TE (SCSI Accessed Fault-Tolerant Enclosure)
|
||||
enclosure, the controller is able to rebuild failed drives automatically as
|
||||
soon as a drive replacement is made available.
|
||||
|
||||
The primary interfaces for controller configuration and status monitoring are
|
||||
special files created in the /proc/rd/... hierarchy along with the normal
|
||||
system console logging mechanism. Whenever the system is operating, the DAC960
|
||||
driver queries each controller for status information every 10 seconds, and
|
||||
checks for additional conditions every 60 seconds. The initial status of each
|
||||
controller is always available for controller N in /proc/rd/cN/initial_status,
|
||||
and the current status as of the last status monitoring query is available in
|
||||
/proc/rd/cN/current_status. In addition, status changes are also logged by the
|
||||
driver to the system console and will appear in the log files maintained by
|
||||
syslog. The progress of asynchronous rebuild or consistency check operations
|
||||
is also available in /proc/rd/cN/current_status, and progress messages are
|
||||
logged to the system console at most every 60 seconds.
|
||||
|
||||
Starting with the 2.2.3/2.0.3 versions of the driver, the status information
|
||||
available in /proc/rd/cN/initial_status and /proc/rd/cN/current_status has been
|
||||
augmented to include the vendor, model, revision, and serial number (if
|
||||
available) for each physical device found connected to the controller:
|
||||
|
||||
***** DAC960 RAID Driver Version 2.2.3 of 19 August 1999 *****
|
||||
Copyright 1998-1999 by Leonard N. Zubkoff <lnz@dandelion.com>
|
||||
Configuring Mylex DAC960PRL PCI RAID Controller
|
||||
Firmware Version: 4.07-0-07, Channels: 1, Memory Size: 16MB
|
||||
PCI Bus: 1, Device: 4, Function: 1, I/O Address: Unassigned
|
||||
PCI Address: 0xFE300000 mapped at 0xA0800000, IRQ Channel: 21
|
||||
Controller Queue Depth: 128, Maximum Blocks per Command: 128
|
||||
Driver Queue Depth: 127, Maximum Scatter/Gather Segments: 33
|
||||
Stripe Size: 64KB, Segment Size: 8KB, BIOS Geometry: 255/63
|
||||
SAF-TE Enclosure Management Enabled
|
||||
Physical Devices:
|
||||
0:0 Vendor: IBM Model: DRVS09D Revision: 0270
|
||||
Serial Number: 68016775HA
|
||||
Disk Status: Online, 17928192 blocks
|
||||
0:1 Vendor: IBM Model: DRVS09D Revision: 0270
|
||||
Serial Number: 68004E53HA
|
||||
Disk Status: Online, 17928192 blocks
|
||||
0:2 Vendor: IBM Model: DRVS09D Revision: 0270
|
||||
Serial Number: 13013935HA
|
||||
Disk Status: Online, 17928192 blocks
|
||||
0:3 Vendor: IBM Model: DRVS09D Revision: 0270
|
||||
Serial Number: 13016897HA
|
||||
Disk Status: Online, 17928192 blocks
|
||||
0:4 Vendor: IBM Model: DRVS09D Revision: 0270
|
||||
Serial Number: 68019905HA
|
||||
Disk Status: Online, 17928192 blocks
|
||||
0:5 Vendor: IBM Model: DRVS09D Revision: 0270
|
||||
Serial Number: 68012753HA
|
||||
Disk Status: Online, 17928192 blocks
|
||||
0:6 Vendor: ESG-SHV Model: SCA HSBP M6 Revision: 0.61
|
||||
Logical Drives:
|
||||
/dev/rd/c0d0: RAID-5, Online, 89640960 blocks, Write Thru
|
||||
No Rebuild or Consistency Check in Progress
|
||||
|
||||
To simplify the monitoring process for custom software, the special file
|
||||
/proc/rd/status returns "OK" when all DAC960 controllers in the system are
|
||||
operating normally and no failures have occurred, or "ALERT" if any logical
|
||||
drives are offline or critical or any non-standby physical drives are dead.
|
||||
|
||||
Configuration commands for controller N are available via the special file
|
||||
/proc/rd/cN/user_command. A human readable command can be written to this
|
||||
special file to initiate a configuration operation, and the results of the
|
||||
operation can then be read back from the special file in addition to being
|
||||
logged to the system console. The shell command sequence
|
||||
|
||||
echo "<configuration-command>" > /proc/rd/c0/user_command
|
||||
cat /proc/rd/c0/user_command
|
||||
|
||||
is typically used to execute configuration commands. The configuration
|
||||
commands are:
|
||||
|
||||
flush-cache
|
||||
|
||||
The "flush-cache" command flushes the controller's cache. The system
|
||||
automatically flushes the cache at shutdown or if the driver module is
|
||||
unloaded, so this command is only needed to be certain a write back cache
|
||||
is flushed to disk before the system is powered off by a command to a UPS.
|
||||
Note that the flush-cache command also stops an asynchronous rebuild or
|
||||
consistency check, so it should not be used except when the system is being
|
||||
halted.
|
||||
|
||||
kill <channel>:<target-id>
|
||||
|
||||
The "kill" command marks the physical drive <channel>:<target-id> as DEAD.
|
||||
This command is provided primarily for testing, and should not be used
|
||||
during normal system operation.
|
||||
|
||||
make-online <channel>:<target-id>
|
||||
|
||||
The "make-online" command changes the physical drive <channel>:<target-id>
|
||||
from status DEAD to status ONLINE. In cases where multiple physical drives
|
||||
have been killed simultaneously, this command may be used to bring all but
|
||||
one of them back online, after which a rebuild to the final drive is
|
||||
necessary.
|
||||
|
||||
Warning: make-online should only be used on a dead physical drive that is
|
||||
an active part of a drive group, never on a standby drive. The command
|
||||
should never be used on a dead drive that is part of a critical logical
|
||||
drive; rebuild should be used if only a single drive is dead.
|
||||
|
||||
make-standby <channel>:<target-id>
|
||||
|
||||
The "make-standby" command changes physical drive <channel>:<target-id>
|
||||
from status DEAD to status STANDBY. It should only be used in cases where
|
||||
a dead drive was replaced after an automatic rebuild was performed onto a
|
||||
standby drive. It cannot be used to add a standby drive to the controller
|
||||
configuration if one was not created initially; the BIOS Configuration
|
||||
Utility must be used for that currently.
|
||||
|
||||
rebuild <channel>:<target-id>
|
||||
|
||||
The "rebuild" command initiates an asynchronous rebuild onto physical drive
|
||||
<channel>:<target-id>. It should only be used when a dead drive has been
|
||||
replaced.
|
||||
|
||||
check-consistency <logical-drive-number>
|
||||
|
||||
The "check-consistency" command initiates an asynchronous consistency check
|
||||
of <logical-drive-number> with automatic restoration. It can be used
|
||||
whenever it is desired to verify the consistency of the redundancy
|
||||
information.
|
||||
|
||||
cancel-rebuild
|
||||
cancel-consistency-check
|
||||
|
||||
The "cancel-rebuild" and "cancel-consistency-check" commands cancel any
|
||||
rebuild or consistency check operations previously initiated.
|
||||
|
||||
|
||||
EXAMPLE I - DRIVE FAILURE WITHOUT A STANDBY DRIVE
|
||||
|
||||
The following annotated logs demonstrate the controller configuration and and
|
||||
online status monitoring capabilities of the Linux DAC960 Driver. The test
|
||||
configuration comprises 6 1GB Quantum Atlas I disk drives on two channels of a
|
||||
DAC960PJ controller. The physical drives are configured into a single drive
|
||||
group without a standby drive, and the drive group has been configured into two
|
||||
logical drives, one RAID-5 and one RAID-6. Note that these logs are from an
|
||||
earlier version of the driver and the messages have changed somewhat with newer
|
||||
releases, but the functionality remains similar. First, here is the current
|
||||
status of the RAID configuration:
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/c0/current_status
|
||||
***** DAC960 RAID Driver Version 2.0.0 of 23 March 1999 *****
|
||||
Copyright 1998-1999 by Leonard N. Zubkoff <lnz@dandelion.com>
|
||||
Configuring Mylex DAC960PJ PCI RAID Controller
|
||||
Firmware Version: 4.06-0-08, Channels: 3, Memory Size: 8MB
|
||||
PCI Bus: 0, Device: 19, Function: 1, I/O Address: Unassigned
|
||||
PCI Address: 0xFD4FC000 mapped at 0x8807000, IRQ Channel: 9
|
||||
Controller Queue Depth: 128, Maximum Blocks per Command: 128
|
||||
Driver Queue Depth: 127, Maximum Scatter/Gather Segments: 33
|
||||
Stripe Size: 64KB, Segment Size: 8KB, BIOS Geometry: 255/63
|
||||
Physical Devices:
|
||||
0:1 - Disk: Online, 2201600 blocks
|
||||
0:2 - Disk: Online, 2201600 blocks
|
||||
0:3 - Disk: Online, 2201600 blocks
|
||||
1:1 - Disk: Online, 2201600 blocks
|
||||
1:2 - Disk: Online, 2201600 blocks
|
||||
1:3 - Disk: Online, 2201600 blocks
|
||||
Logical Drives:
|
||||
/dev/rd/c0d0: RAID-5, Online, 5498880 blocks, Write Thru
|
||||
/dev/rd/c0d1: RAID-6, Online, 3305472 blocks, Write Thru
|
||||
No Rebuild or Consistency Check in Progress
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/status
|
||||
OK
|
||||
|
||||
The above messages indicate that everything is healthy, and /proc/rd/status
|
||||
returns "OK" indicating that there are no problems with any DAC960 controller
|
||||
in the system. For demonstration purposes, while I/O is active Physical Drive
|
||||
1:1 is now disconnected, simulating a drive failure. The failure is noted by
|
||||
the driver within 10 seconds of the controller's having detected it, and the
|
||||
driver logs the following console status messages indicating that Logical
|
||||
Drives 0 and 1 are now CRITICAL as a result of Physical Drive 1:1 being DEAD:
|
||||
|
||||
DAC960#0: Physical Drive 1:2 Error Log: Sense Key = 6, ASC = 29, ASCQ = 02
|
||||
DAC960#0: Physical Drive 1:3 Error Log: Sense Key = 6, ASC = 29, ASCQ = 02
|
||||
DAC960#0: Physical Drive 1:1 killed because of timeout on SCSI command
|
||||
DAC960#0: Physical Drive 1:1 is now DEAD
|
||||
DAC960#0: Logical Drive 0 (/dev/rd/c0d0) is now CRITICAL
|
||||
DAC960#0: Logical Drive 1 (/dev/rd/c0d1) is now CRITICAL
|
||||
|
||||
The Sense Keys logged here are just Check Condition / Unit Attention conditions
|
||||
arising from a SCSI bus reset that is forced by the controller during its error
|
||||
recovery procedures. Concurrently with the above, the driver status available
|
||||
from /proc/rd also reflects the drive failure. The status message in
|
||||
/proc/rd/status has changed from "OK" to "ALERT":
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/status
|
||||
ALERT
|
||||
|
||||
and /proc/rd/c0/current_status has been updated:
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/c0/current_status
|
||||
...
|
||||
Physical Devices:
|
||||
0:1 - Disk: Online, 2201600 blocks
|
||||
0:2 - Disk: Online, 2201600 blocks
|
||||
0:3 - Disk: Online, 2201600 blocks
|
||||
1:1 - Disk: Dead, 2201600 blocks
|
||||
1:2 - Disk: Online, 2201600 blocks
|
||||
1:3 - Disk: Online, 2201600 blocks
|
||||
Logical Drives:
|
||||
/dev/rd/c0d0: RAID-5, Critical, 5498880 blocks, Write Thru
|
||||
/dev/rd/c0d1: RAID-6, Critical, 3305472 blocks, Write Thru
|
||||
No Rebuild or Consistency Check in Progress
|
||||
|
||||
Since there are no standby drives configured, the system can continue to access
|
||||
the logical drives in a performance degraded mode until the failed drive is
|
||||
replaced and a rebuild operation completed to restore the redundancy of the
|
||||
logical drives. Once Physical Drive 1:1 is replaced with a properly
|
||||
functioning drive, or if the physical drive was killed without having failed
|
||||
(e.g., due to electrical problems on the SCSI bus), the user can instruct the
|
||||
controller to initiate a rebuild operation onto the newly replaced drive:
|
||||
|
||||
gwynedd:/u/lnz# echo "rebuild 1:1" > /proc/rd/c0/user_command
|
||||
gwynedd:/u/lnz# cat /proc/rd/c0/user_command
|
||||
Rebuild of Physical Drive 1:1 Initiated
|
||||
|
||||
The echo command instructs the controller to initiate an asynchronous rebuild
|
||||
operation onto Physical Drive 1:1, and the status message that results from the
|
||||
operation is then available for reading from /proc/rd/c0/user_command, as well
|
||||
as being logged to the console by the driver.
|
||||
|
||||
Within 10 seconds of this command the driver logs the initiation of the
|
||||
asynchronous rebuild operation:
|
||||
|
||||
DAC960#0: Rebuild of Physical Drive 1:1 Initiated
|
||||
DAC960#0: Physical Drive 1:1 Error Log: Sense Key = 6, ASC = 29, ASCQ = 01
|
||||
DAC960#0: Physical Drive 1:1 is now WRITE-ONLY
|
||||
DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 1% completed
|
||||
|
||||
and /proc/rd/c0/current_status is updated:
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/c0/current_status
|
||||
...
|
||||
Physical Devices:
|
||||
0:1 - Disk: Online, 2201600 blocks
|
||||
0:2 - Disk: Online, 2201600 blocks
|
||||
0:3 - Disk: Online, 2201600 blocks
|
||||
1:1 - Disk: Write-Only, 2201600 blocks
|
||||
1:2 - Disk: Online, 2201600 blocks
|
||||
1:3 - Disk: Online, 2201600 blocks
|
||||
Logical Drives:
|
||||
/dev/rd/c0d0: RAID-5, Critical, 5498880 blocks, Write Thru
|
||||
/dev/rd/c0d1: RAID-6, Critical, 3305472 blocks, Write Thru
|
||||
Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 6% completed
|
||||
|
||||
As the rebuild progresses, the current status in /proc/rd/c0/current_status is
|
||||
updated every 10 seconds:
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/c0/current_status
|
||||
...
|
||||
Physical Devices:
|
||||
0:1 - Disk: Online, 2201600 blocks
|
||||
0:2 - Disk: Online, 2201600 blocks
|
||||
0:3 - Disk: Online, 2201600 blocks
|
||||
1:1 - Disk: Write-Only, 2201600 blocks
|
||||
1:2 - Disk: Online, 2201600 blocks
|
||||
1:3 - Disk: Online, 2201600 blocks
|
||||
Logical Drives:
|
||||
/dev/rd/c0d0: RAID-5, Critical, 5498880 blocks, Write Thru
|
||||
/dev/rd/c0d1: RAID-6, Critical, 3305472 blocks, Write Thru
|
||||
Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 15% completed
|
||||
|
||||
and every minute a progress message is logged to the console by the driver:
|
||||
|
||||
DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 32% completed
|
||||
DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 63% completed
|
||||
DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 94% completed
|
||||
DAC960#0: Rebuild in Progress: Logical Drive 1 (/dev/rd/c0d1) 94% completed
|
||||
|
||||
Finally, the rebuild completes successfully. The driver logs the status of the
|
||||
logical and physical drives and the rebuild completion:
|
||||
|
||||
DAC960#0: Rebuild Completed Successfully
|
||||
DAC960#0: Physical Drive 1:1 is now ONLINE
|
||||
DAC960#0: Logical Drive 0 (/dev/rd/c0d0) is now ONLINE
|
||||
DAC960#0: Logical Drive 1 (/dev/rd/c0d1) is now ONLINE
|
||||
|
||||
/proc/rd/c0/current_status is updated:
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/c0/current_status
|
||||
...
|
||||
Physical Devices:
|
||||
0:1 - Disk: Online, 2201600 blocks
|
||||
0:2 - Disk: Online, 2201600 blocks
|
||||
0:3 - Disk: Online, 2201600 blocks
|
||||
1:1 - Disk: Online, 2201600 blocks
|
||||
1:2 - Disk: Online, 2201600 blocks
|
||||
1:3 - Disk: Online, 2201600 blocks
|
||||
Logical Drives:
|
||||
/dev/rd/c0d0: RAID-5, Online, 5498880 blocks, Write Thru
|
||||
/dev/rd/c0d1: RAID-6, Online, 3305472 blocks, Write Thru
|
||||
Rebuild Completed Successfully
|
||||
|
||||
and /proc/rd/status indicates that everything is healthy once again:
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/status
|
||||
OK
|
||||
|
||||
|
||||
EXAMPLE II - DRIVE FAILURE WITH A STANDBY DRIVE
|
||||
|
||||
The following annotated logs demonstrate the controller configuration and and
|
||||
online status monitoring capabilities of the Linux DAC960 Driver. The test
|
||||
configuration comprises 6 1GB Quantum Atlas I disk drives on two channels of a
|
||||
DAC960PJ controller. The physical drives are configured into a single drive
|
||||
group with a standby drive, and the drive group has been configured into two
|
||||
logical drives, one RAID-5 and one RAID-6. Note that these logs are from an
|
||||
earlier version of the driver and the messages have changed somewhat with newer
|
||||
releases, but the functionality remains similar. First, here is the current
|
||||
status of the RAID configuration:
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/c0/current_status
|
||||
***** DAC960 RAID Driver Version 2.0.0 of 23 March 1999 *****
|
||||
Copyright 1998-1999 by Leonard N. Zubkoff <lnz@dandelion.com>
|
||||
Configuring Mylex DAC960PJ PCI RAID Controller
|
||||
Firmware Version: 4.06-0-08, Channels: 3, Memory Size: 8MB
|
||||
PCI Bus: 0, Device: 19, Function: 1, I/O Address: Unassigned
|
||||
PCI Address: 0xFD4FC000 mapped at 0x8807000, IRQ Channel: 9
|
||||
Controller Queue Depth: 128, Maximum Blocks per Command: 128
|
||||
Driver Queue Depth: 127, Maximum Scatter/Gather Segments: 33
|
||||
Stripe Size: 64KB, Segment Size: 8KB, BIOS Geometry: 255/63
|
||||
Physical Devices:
|
||||
0:1 - Disk: Online, 2201600 blocks
|
||||
0:2 - Disk: Online, 2201600 blocks
|
||||
0:3 - Disk: Online, 2201600 blocks
|
||||
1:1 - Disk: Online, 2201600 blocks
|
||||
1:2 - Disk: Online, 2201600 blocks
|
||||
1:3 - Disk: Standby, 2201600 blocks
|
||||
Logical Drives:
|
||||
/dev/rd/c0d0: RAID-5, Online, 4399104 blocks, Write Thru
|
||||
/dev/rd/c0d1: RAID-6, Online, 2754560 blocks, Write Thru
|
||||
No Rebuild or Consistency Check in Progress
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/status
|
||||
OK
|
||||
|
||||
The above messages indicate that everything is healthy, and /proc/rd/status
|
||||
returns "OK" indicating that there are no problems with any DAC960 controller
|
||||
in the system. For demonstration purposes, while I/O is active Physical Drive
|
||||
1:2 is now disconnected, simulating a drive failure. The failure is noted by
|
||||
the driver within 10 seconds of the controller's having detected it, and the
|
||||
driver logs the following console status messages:
|
||||
|
||||
DAC960#0: Physical Drive 1:1 Error Log: Sense Key = 6, ASC = 29, ASCQ = 02
|
||||
DAC960#0: Physical Drive 1:3 Error Log: Sense Key = 6, ASC = 29, ASCQ = 02
|
||||
DAC960#0: Physical Drive 1:2 killed because of timeout on SCSI command
|
||||
DAC960#0: Physical Drive 1:2 is now DEAD
|
||||
DAC960#0: Physical Drive 1:2 killed because it was removed
|
||||
DAC960#0: Logical Drive 0 (/dev/rd/c0d0) is now CRITICAL
|
||||
DAC960#0: Logical Drive 1 (/dev/rd/c0d1) is now CRITICAL
|
||||
|
||||
Since a standby drive is configured, the controller automatically begins
|
||||
rebuilding onto the standby drive:
|
||||
|
||||
DAC960#0: Physical Drive 1:3 is now WRITE-ONLY
|
||||
DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 4% completed
|
||||
|
||||
Concurrently with the above, the driver status available from /proc/rd also
|
||||
reflects the drive failure and automatic rebuild. The status message in
|
||||
/proc/rd/status has changed from "OK" to "ALERT":
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/status
|
||||
ALERT
|
||||
|
||||
and /proc/rd/c0/current_status has been updated:
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/c0/current_status
|
||||
...
|
||||
Physical Devices:
|
||||
0:1 - Disk: Online, 2201600 blocks
|
||||
0:2 - Disk: Online, 2201600 blocks
|
||||
0:3 - Disk: Online, 2201600 blocks
|
||||
1:1 - Disk: Online, 2201600 blocks
|
||||
1:2 - Disk: Dead, 2201600 blocks
|
||||
1:3 - Disk: Write-Only, 2201600 blocks
|
||||
Logical Drives:
|
||||
/dev/rd/c0d0: RAID-5, Critical, 4399104 blocks, Write Thru
|
||||
/dev/rd/c0d1: RAID-6, Critical, 2754560 blocks, Write Thru
|
||||
Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 4% completed
|
||||
|
||||
As the rebuild progresses, the current status in /proc/rd/c0/current_status is
|
||||
updated every 10 seconds:
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/c0/current_status
|
||||
...
|
||||
Physical Devices:
|
||||
0:1 - Disk: Online, 2201600 blocks
|
||||
0:2 - Disk: Online, 2201600 blocks
|
||||
0:3 - Disk: Online, 2201600 blocks
|
||||
1:1 - Disk: Online, 2201600 blocks
|
||||
1:2 - Disk: Dead, 2201600 blocks
|
||||
1:3 - Disk: Write-Only, 2201600 blocks
|
||||
Logical Drives:
|
||||
/dev/rd/c0d0: RAID-5, Critical, 4399104 blocks, Write Thru
|
||||
/dev/rd/c0d1: RAID-6, Critical, 2754560 blocks, Write Thru
|
||||
Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 40% completed
|
||||
|
||||
and every minute a progress message is logged on the console by the driver:
|
||||
|
||||
DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 40% completed
|
||||
DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) 76% completed
|
||||
DAC960#0: Rebuild in Progress: Logical Drive 1 (/dev/rd/c0d1) 66% completed
|
||||
DAC960#0: Rebuild in Progress: Logical Drive 1 (/dev/rd/c0d1) 84% completed
|
||||
|
||||
Finally, the rebuild completes successfully. The driver logs the status of the
|
||||
logical and physical drives and the rebuild completion:
|
||||
|
||||
DAC960#0: Rebuild Completed Successfully
|
||||
DAC960#0: Physical Drive 1:3 is now ONLINE
|
||||
DAC960#0: Logical Drive 0 (/dev/rd/c0d0) is now ONLINE
|
||||
DAC960#0: Logical Drive 1 (/dev/rd/c0d1) is now ONLINE
|
||||
|
||||
/proc/rd/c0/current_status is updated:
|
||||
|
||||
***** DAC960 RAID Driver Version 2.0.0 of 23 March 1999 *****
|
||||
Copyright 1998-1999 by Leonard N. Zubkoff <lnz@dandelion.com>
|
||||
Configuring Mylex DAC960PJ PCI RAID Controller
|
||||
Firmware Version: 4.06-0-08, Channels: 3, Memory Size: 8MB
|
||||
PCI Bus: 0, Device: 19, Function: 1, I/O Address: Unassigned
|
||||
PCI Address: 0xFD4FC000 mapped at 0x8807000, IRQ Channel: 9
|
||||
Controller Queue Depth: 128, Maximum Blocks per Command: 128
|
||||
Driver Queue Depth: 127, Maximum Scatter/Gather Segments: 33
|
||||
Stripe Size: 64KB, Segment Size: 8KB, BIOS Geometry: 255/63
|
||||
Physical Devices:
|
||||
0:1 - Disk: Online, 2201600 blocks
|
||||
0:2 - Disk: Online, 2201600 blocks
|
||||
0:3 - Disk: Online, 2201600 blocks
|
||||
1:1 - Disk: Online, 2201600 blocks
|
||||
1:2 - Disk: Dead, 2201600 blocks
|
||||
1:3 - Disk: Online, 2201600 blocks
|
||||
Logical Drives:
|
||||
/dev/rd/c0d0: RAID-5, Online, 4399104 blocks, Write Thru
|
||||
/dev/rd/c0d1: RAID-6, Online, 2754560 blocks, Write Thru
|
||||
Rebuild Completed Successfully
|
||||
|
||||
and /proc/rd/status indicates that everything is healthy once again:
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/status
|
||||
OK
|
||||
|
||||
Note that the absence of a viable standby drive does not create an "ALERT"
|
||||
status. Once dead Physical Drive 1:2 has been replaced, the controller must be
|
||||
told that this has occurred and that the newly replaced drive should become the
|
||||
new standby drive:
|
||||
|
||||
gwynedd:/u/lnz# echo "make-standby 1:2" > /proc/rd/c0/user_command
|
||||
gwynedd:/u/lnz# cat /proc/rd/c0/user_command
|
||||
Make Standby of Physical Drive 1:2 Succeeded
|
||||
|
||||
The echo command instructs the controller to make Physical Drive 1:2 into a
|
||||
standby drive, and the status message that results from the operation is then
|
||||
available for reading from /proc/rd/c0/user_command, as well as being logged to
|
||||
the console by the driver. Within 60 seconds of this command the driver logs:
|
||||
|
||||
DAC960#0: Physical Drive 1:2 Error Log: Sense Key = 6, ASC = 29, ASCQ = 01
|
||||
DAC960#0: Physical Drive 1:2 is now STANDBY
|
||||
DAC960#0: Make Standby of Physical Drive 1:2 Succeeded
|
||||
|
||||
and /proc/rd/c0/current_status is updated:
|
||||
|
||||
gwynedd:/u/lnz# cat /proc/rd/c0/current_status
|
||||
...
|
||||
Physical Devices:
|
||||
0:1 - Disk: Online, 2201600 blocks
|
||||
0:2 - Disk: Online, 2201600 blocks
|
||||
0:3 - Disk: Online, 2201600 blocks
|
||||
1:1 - Disk: Online, 2201600 blocks
|
||||
1:2 - Disk: Standby, 2201600 blocks
|
||||
1:3 - Disk: Online, 2201600 blocks
|
||||
Logical Drives:
|
||||
/dev/rd/c0d0: RAID-5, Online, 4399104 blocks, Write Thru
|
||||
/dev/rd/c0d1: RAID-6, Online, 2754560 blocks, Write Thru
|
||||
Rebuild Completed Successfully
|
||||
8
Documentation/README.cycladesZ
Normal file
8
Documentation/README.cycladesZ
Normal file
@ -0,0 +1,8 @@
|
||||
|
||||
The Cyclades-Z must have firmware loaded onto the card before it will
|
||||
operate. This operation should be performed during system startup,
|
||||
|
||||
The firmware, loader program and the latest device driver code are
|
||||
available from Cyclades at
|
||||
ftp://ftp.cyclades.com/pub/cyclades/cyclades-z/linux/
|
||||
|
||||
88
Documentation/SAK.txt
Normal file
88
Documentation/SAK.txt
Normal file
@ -0,0 +1,88 @@
|
||||
Linux 2.4.2 Secure Attention Key (SAK) handling
|
||||
18 March 2001, Andrew Morton <akpm@osdl.org>
|
||||
|
||||
An operating system's Secure Attention Key is a security tool which is
|
||||
provided as protection against trojan password capturing programs. It
|
||||
is an undefeatable way of killing all programs which could be
|
||||
masquerading as login applications. Users need to be taught to enter
|
||||
this key sequence before they log in to the system.
|
||||
|
||||
From the PC keyboard, Linux has two similar but different ways of
|
||||
providing SAK. One is the ALT-SYSRQ-K sequence. You shouldn't use
|
||||
this sequence. It is only available if the kernel was compiled with
|
||||
sysrq support.
|
||||
|
||||
The proper way of generating a SAK is to define the key sequence using
|
||||
`loadkeys'. This will work whether or not sysrq support is compiled
|
||||
into the kernel.
|
||||
|
||||
SAK works correctly when the keyboard is in raw mode. This means that
|
||||
once defined, SAK will kill a running X server. If the system is in
|
||||
run level 5, the X server will restart. This is what you want to
|
||||
happen.
|
||||
|
||||
What key sequence should you use? Well, CTRL-ALT-DEL is used to reboot
|
||||
the machine. CTRL-ALT-BACKSPACE is magical to the X server. We'll
|
||||
choose CTRL-ALT-PAUSE.
|
||||
|
||||
In your rc.sysinit (or rc.local) file, add the command
|
||||
|
||||
echo "control alt keycode 101 = SAK" | /bin/loadkeys
|
||||
|
||||
And that's it! Only the superuser may reprogram the SAK key.
|
||||
|
||||
|
||||
NOTES
|
||||
=====
|
||||
|
||||
1: Linux SAK is said to be not a "true SAK" as is required by
|
||||
systems which implement C2 level security. This author does not
|
||||
know why.
|
||||
|
||||
|
||||
2: On the PC keyboard, SAK kills all applications which have
|
||||
/dev/console opened.
|
||||
|
||||
Unfortunately this includes a number of things which you don't
|
||||
actually want killed. This is because these applications are
|
||||
incorrectly holding /dev/console open. Be sure to complain to your
|
||||
Linux distributor about this!
|
||||
|
||||
You can identify processes which will be killed by SAK with the
|
||||
command
|
||||
|
||||
# ls -l /proc/[0-9]*/fd/* | grep console
|
||||
l-wx------ 1 root root 64 Mar 18 00:46 /proc/579/fd/0 -> /dev/console
|
||||
|
||||
Then:
|
||||
|
||||
# ps aux|grep 579
|
||||
root 579 0.0 0.1 1088 436 ? S 00:43 0:00 gpm -t ps/2
|
||||
|
||||
So `gpm' will be killed by SAK. This is a bug in gpm. It should
|
||||
be closing standard input. You can work around this by finding the
|
||||
initscript which launches gpm and changing it thusly:
|
||||
|
||||
Old:
|
||||
|
||||
daemon gpm
|
||||
|
||||
New:
|
||||
|
||||
daemon gpm < /dev/null
|
||||
|
||||
Vixie cron also seems to have this problem, and needs the same treatment.
|
||||
|
||||
Also, one prominent Linux distribution has the following three
|
||||
lines in its rc.sysinit and rc scripts:
|
||||
|
||||
exec 3<&0
|
||||
exec 4>&1
|
||||
exec 5>&2
|
||||
|
||||
These commands cause *all* daemons which are launched by the
|
||||
initscripts to have file descriptors 3, 4 and 5 attached to
|
||||
/dev/console. So SAK kills them all. A workaround is to simply
|
||||
delete these lines, but this may cause system management
|
||||
applications to malfunction - test everything well.
|
||||
|
||||
38
Documentation/SecurityBugs
Normal file
38
Documentation/SecurityBugs
Normal file
@ -0,0 +1,38 @@
|
||||
Linux kernel developers take security very seriously. As such, we'd
|
||||
like to know when a security bug is found so that it can be fixed and
|
||||
disclosed as quickly as possible. Please report security bugs to the
|
||||
Linux kernel security team.
|
||||
|
||||
1) Contact
|
||||
|
||||
The Linux kernel security team can be contacted by email at
|
||||
<security@kernel.org>. This is a private list of security officers
|
||||
who will help verify the bug report and develop and release a fix.
|
||||
It is possible that the security team will bring in extra help from
|
||||
area maintainers to understand and fix the security vulnerability.
|
||||
|
||||
As it is with any bug, the more information provided the easier it
|
||||
will be to diagnose and fix. Please review the procedure outlined in
|
||||
REPORTING-BUGS if you are unclear about what information is helpful.
|
||||
Any exploit code is very helpful and will not be released without
|
||||
consent from the reporter unless it has already been made public.
|
||||
|
||||
2) Disclosure
|
||||
|
||||
The goal of the Linux kernel security team is to work with the
|
||||
bug submitter to bug resolution as well as disclosure. We prefer
|
||||
to fully disclose the bug as soon as possible. It is reasonable to
|
||||
delay disclosure when the bug or the fix is not yet fully understood,
|
||||
the solution is not well-tested or for vendor coordination. However, we
|
||||
expect these delays to be short, measurable in days, not weeks or months.
|
||||
A disclosure date is negotiated by the security team working with the
|
||||
bug submitter as well as vendors. However, the kernel security team
|
||||
holds the final say when setting a disclosure date. The timeframe for
|
||||
disclosure is from immediate (esp. if it's already publically known)
|
||||
to a few weeks. As a basic default policy, we expect report date to
|
||||
disclosure date to be on the order of 7 days.
|
||||
|
||||
3) Non-disclosure agreements
|
||||
|
||||
The Linux kernel security team is not a formal body and therefore unable
|
||||
to enter any non-disclosure agreements.
|
||||
82
Documentation/SubmitChecklist
Normal file
82
Documentation/SubmitChecklist
Normal file
@ -0,0 +1,82 @@
|
||||
Linux Kernel patch sumbittal checklist
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Here are some basic things that developers should do if they want to see their
|
||||
kernel patch submissions accepted more quickly.
|
||||
|
||||
These are all above and beyond the documentation that is provided in
|
||||
Documentation/SubmittingPatches and elsewhere regarding submitting Linux
|
||||
kernel patches.
|
||||
|
||||
|
||||
|
||||
1: Builds cleanly with applicable or modified CONFIG options =y, =m, and
|
||||
=n. No gcc warnings/errors, no linker warnings/errors.
|
||||
|
||||
2: Passes allnoconfig, allmodconfig
|
||||
|
||||
3: Builds on multiple CPU architectures by using local cross-compile tools
|
||||
or something like PLM at OSDL.
|
||||
|
||||
4: ppc64 is a good architecture for cross-compilation checking because it
|
||||
tends to use `unsigned long' for 64-bit quantities.
|
||||
|
||||
5: Matches kernel coding style(!)
|
||||
|
||||
6: Any new or modified CONFIG options don't muck up the config menu.
|
||||
|
||||
7: All new Kconfig options have help text.
|
||||
|
||||
8: Has been carefully reviewed with respect to relevant Kconfig
|
||||
combinations. This is very hard to get right with testing -- brainpower
|
||||
pays off here.
|
||||
|
||||
9: Check cleanly with sparse.
|
||||
|
||||
10: Use 'make checkstack' and 'make namespacecheck' and fix any problems
|
||||
that they find. Note: checkstack does not point out problems explicitly,
|
||||
but any one function that uses more than 512 bytes on the stack is a
|
||||
candidate for change.
|
||||
|
||||
11: Include kernel-doc to document global kernel APIs. (Not required for
|
||||
static functions, but OK there also.) Use 'make htmldocs' or 'make
|
||||
mandocs' to check the kernel-doc and fix any issues.
|
||||
|
||||
12: Has been tested with CONFIG_PREEMPT, CONFIG_DEBUG_PREEMPT,
|
||||
CONFIG_DEBUG_SLAB, CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_MUTEXES,
|
||||
CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_SPINLOCK_SLEEP all simultaneously
|
||||
enabled.
|
||||
|
||||
13: Has been build- and runtime tested with and without CONFIG_SMP and
|
||||
CONFIG_PREEMPT.
|
||||
|
||||
14: If the patch affects IO/Disk, etc: has been tested with and without
|
||||
CONFIG_LBD.
|
||||
|
||||
15: All codepaths have been exercised with all lockdep features enabled.
|
||||
|
||||
16: All new /proc entries are documented under Documentation/
|
||||
|
||||
17: All new kernel boot parameters are documented in
|
||||
Documentation/kernel-parameters.txt.
|
||||
|
||||
18: All new module parameters are documented with MODULE_PARM_DESC()
|
||||
|
||||
19: All new userspace interfaces are documented in Documentation/ABI/.
|
||||
See Documentation/ABI/README for more information.
|
||||
|
||||
20: Check that it all passes `make headers_check'.
|
||||
|
||||
21: Has been checked with injection of at least slab and page-allocation
|
||||
fauilures. See Documentation/fault-injection/.
|
||||
|
||||
If the new code is substantial, addition of subsystem-specific fault
|
||||
injection might be appropriate.
|
||||
|
||||
22: Newly-added code has been compiled with `gcc -W'. This will generate
|
||||
lots of noise, but is good for finding bugs like "warning: comparison
|
||||
between signed and unsigned".
|
||||
|
||||
23: Tested after it has been merged into the -mm patchset to make sure
|
||||
that it still works with all of the other queued patches and various
|
||||
changes in the VM, VFS, and other subsystems.
|
||||
148
Documentation/SubmittingDrivers
Normal file
148
Documentation/SubmittingDrivers
Normal file
@ -0,0 +1,148 @@
|
||||
Submitting Drivers For The Linux Kernel
|
||||
---------------------------------------
|
||||
|
||||
This document is intended to explain how to submit device drivers to the
|
||||
various kernel trees. Note that if you are interested in video card drivers
|
||||
you should probably talk to XFree86 (http://www.xfree86.org/) and/or X.Org
|
||||
(http://x.org/) instead.
|
||||
|
||||
Also read the Documentation/SubmittingPatches document.
|
||||
|
||||
|
||||
Allocating Device Numbers
|
||||
-------------------------
|
||||
|
||||
Major and minor numbers for block and character devices are allocated
|
||||
by the Linux assigned name and number authority (currently this is
|
||||
Torben Mathiasen). The site is http://www.lanana.org/. This
|
||||
also deals with allocating numbers for devices that are not going to
|
||||
be submitted to the mainstream kernel.
|
||||
See Documentation/devices.txt for more information on this.
|
||||
|
||||
If you don't use assigned numbers then when your device is submitted it will
|
||||
be given an assigned number even if that is different from values you may
|
||||
have shipped to customers before.
|
||||
|
||||
Who To Submit Drivers To
|
||||
------------------------
|
||||
|
||||
Linux 2.0:
|
||||
No new drivers are accepted for this kernel tree.
|
||||
|
||||
Linux 2.2:
|
||||
No new drivers are accepted for this kernel tree.
|
||||
|
||||
Linux 2.4:
|
||||
If the code area has a general maintainer then please submit it to
|
||||
the maintainer listed in MAINTAINERS in the kernel file. If the
|
||||
maintainer does not respond or you cannot find the appropriate
|
||||
maintainer then please contact Marcelo Tosatti
|
||||
<marcelo.tosatti@cyclades.com>.
|
||||
|
||||
Linux 2.6:
|
||||
The same rules apply as 2.4 except that you should follow linux-kernel
|
||||
to track changes in API's. The final contact point for Linux 2.6
|
||||
submissions is Andrew Morton <akpm@osdl.org>.
|
||||
|
||||
What Criteria Determine Acceptance
|
||||
----------------------------------
|
||||
|
||||
Licensing: The code must be released to us under the
|
||||
GNU General Public License. We don't insist on any kind
|
||||
of exclusive GPL licensing, and if you wish the driver
|
||||
to be useful to other communities such as BSD you may well
|
||||
wish to release under multiple licenses.
|
||||
See accepted licenses at include/linux/module.h
|
||||
|
||||
Copyright: The copyright owner must agree to use of GPL.
|
||||
It's best if the submitter and copyright owner
|
||||
are the same person/entity. If not, the name of
|
||||
the person/entity authorizing use of GPL should be
|
||||
listed in case it's necessary to verify the will of
|
||||
the copyright owner.
|
||||
|
||||
Interfaces: If your driver uses existing interfaces and behaves like
|
||||
other drivers in the same class it will be much more likely
|
||||
to be accepted than if it invents gratuitous new ones.
|
||||
If you need to implement a common API over Linux and NT
|
||||
drivers do it in userspace.
|
||||
|
||||
Code: Please use the Linux style of code formatting as documented
|
||||
in Documentation/CodingStyle. If you have sections of code
|
||||
that need to be in other formats, for example because they
|
||||
are shared with a windows driver kit and you want to
|
||||
maintain them just once separate them out nicely and note
|
||||
this fact.
|
||||
|
||||
Portability: Pointers are not always 32bits, not all computers are little
|
||||
endian, people do not all have floating point and you
|
||||
shouldn't use inline x86 assembler in your driver without
|
||||
careful thought. Pure x86 drivers generally are not popular.
|
||||
If you only have x86 hardware it is hard to test portability
|
||||
but it is easy to make sure the code can easily be made
|
||||
portable.
|
||||
|
||||
Clarity: It helps if anyone can see how to fix the driver. It helps
|
||||
you because you get patches not bug reports. If you submit a
|
||||
driver that intentionally obfuscates how the hardware works
|
||||
it will go in the bitbucket.
|
||||
|
||||
Control: In general if there is active maintainance of a driver by
|
||||
the author then patches will be redirected to them unless
|
||||
they are totally obvious and without need of checking.
|
||||
If you want to be the contact and update point for the
|
||||
driver it is a good idea to state this in the comments,
|
||||
and include an entry in MAINTAINERS for your driver.
|
||||
|
||||
What Criteria Do Not Determine Acceptance
|
||||
-----------------------------------------
|
||||
|
||||
Vendor: Being the hardware vendor and maintaining the driver is
|
||||
often a good thing. If there is a stable working driver from
|
||||
other people already in the tree don't expect 'we are the
|
||||
vendor' to get your driver chosen. Ideally work with the
|
||||
existing driver author to build a single perfect driver.
|
||||
|
||||
Author: It doesn't matter if a large Linux company wrote the driver,
|
||||
or you did. Nobody has any special access to the kernel
|
||||
tree. Anyone who tells you otherwise isn't telling the
|
||||
whole story.
|
||||
|
||||
|
||||
Resources
|
||||
---------
|
||||
|
||||
Linux kernel master tree:
|
||||
ftp.??.kernel.org:/pub/linux/kernel/...
|
||||
?? == your country code, such as "us", "uk", "fr", etc.
|
||||
|
||||
Linux kernel mailing list:
|
||||
linux-kernel@vger.kernel.org
|
||||
[mail majordomo@vger.kernel.org to subscribe]
|
||||
|
||||
Linux Device Drivers, Third Edition (covers 2.6.10):
|
||||
http://lwn.net/Kernel/LDD3/ (free version)
|
||||
|
||||
LWN.net:
|
||||
Weekly summary of kernel development activity - http://lwn.net/
|
||||
2.6 API changes:
|
||||
http://lwn.net/Articles/2.6-kernel-api/
|
||||
Porting drivers from prior kernels to 2.6:
|
||||
http://lwn.net/Articles/driver-porting/
|
||||
|
||||
KernelTrap:
|
||||
Occasional Linux kernel articles and developer interviews
|
||||
http://kerneltrap.org/
|
||||
|
||||
KernelNewbies:
|
||||
Documentation and assistance for new kernel programmers
|
||||
http://kernelnewbies.org/
|
||||
|
||||
Linux USB project:
|
||||
http://www.linux-usb.org/
|
||||
|
||||
How to NOT write kernel driver by Arjan van de Ven:
|
||||
http://www.fenrus.org/how-to-not-write-a-device-driver-paper.pdf
|
||||
|
||||
Kernel Janitor:
|
||||
http://janitor.kernelnewbies.org/
|
||||
509
Documentation/SubmittingPatches
Normal file
509
Documentation/SubmittingPatches
Normal file
@ -0,0 +1,509 @@
|
||||
|
||||
How to Get Your Change Into the Linux Kernel
|
||||
or
|
||||
Care And Operation Of Your Linus Torvalds
|
||||
|
||||
|
||||
|
||||
For a person or company who wishes to submit a change to the Linux
|
||||
kernel, the process can sometimes be daunting if you're not familiar
|
||||
with "the system." This text is a collection of suggestions which
|
||||
can greatly increase the chances of your change being accepted.
|
||||
|
||||
Read Documentation/SubmitChecklist for a list of items to check
|
||||
before submitting code. If you are submitting a driver, also read
|
||||
Documentation/SubmittingDrivers.
|
||||
|
||||
|
||||
|
||||
--------------------------------------------
|
||||
SECTION 1 - CREATING AND SENDING YOUR CHANGE
|
||||
--------------------------------------------
|
||||
|
||||
|
||||
|
||||
1) "diff -up"
|
||||
------------
|
||||
|
||||
Use "diff -up" or "diff -uprN" to create patches.
|
||||
|
||||
All changes to the Linux kernel occur in the form of patches, as
|
||||
generated by diff(1). When creating your patch, make sure to create it
|
||||
in "unified diff" format, as supplied by the '-u' argument to diff(1).
|
||||
Also, please use the '-p' argument which shows which C function each
|
||||
change is in - that makes the resultant diff a lot easier to read.
|
||||
Patches should be based in the root kernel source directory,
|
||||
not in any lower subdirectory.
|
||||
|
||||
To create a patch for a single file, it is often sufficient to do:
|
||||
|
||||
SRCTREE= linux-2.6
|
||||
MYFILE= drivers/net/mydriver.c
|
||||
|
||||
cd $SRCTREE
|
||||
cp $MYFILE $MYFILE.orig
|
||||
vi $MYFILE # make your change
|
||||
cd ..
|
||||
diff -up $SRCTREE/$MYFILE{.orig,} > /tmp/patch
|
||||
|
||||
To create a patch for multiple files, you should unpack a "vanilla",
|
||||
or unmodified kernel source tree, and generate a diff against your
|
||||
own source tree. For example:
|
||||
|
||||
MYSRC= /devel/linux-2.6
|
||||
|
||||
tar xvfz linux-2.6.12.tar.gz
|
||||
mv linux-2.6.12 linux-2.6.12-vanilla
|
||||
diff -uprN -X linux-2.6.12-vanilla/Documentation/dontdiff \
|
||||
linux-2.6.12-vanilla $MYSRC > /tmp/patch
|
||||
|
||||
"dontdiff" is a list of files which are generated by the kernel during
|
||||
the build process, and should be ignored in any diff(1)-generated
|
||||
patch. The "dontdiff" file is included in the kernel tree in
|
||||
2.6.12 and later. For earlier kernel versions, you can get it
|
||||
from <http://www.xenotime.net/linux/doc/dontdiff>.
|
||||
|
||||
Make sure your patch does not include any extra files which do not
|
||||
belong in a patch submission. Make sure to review your patch -after-
|
||||
generated it with diff(1), to ensure accuracy.
|
||||
|
||||
If your changes produce a lot of deltas, you may want to look into
|
||||
splitting them into individual patches which modify things in
|
||||
logical stages. This will facilitate easier reviewing by other
|
||||
kernel developers, very important if you want your patch accepted.
|
||||
There are a number of scripts which can aid in this:
|
||||
|
||||
Quilt:
|
||||
http://savannah.nongnu.org/projects/quilt
|
||||
|
||||
Andrew Morton's patch scripts:
|
||||
http://www.zip.com.au/~akpm/linux/patches/
|
||||
Instead of these scripts, quilt is the recommended patch management
|
||||
tool (see above).
|
||||
|
||||
|
||||
|
||||
2) Describe your changes.
|
||||
|
||||
Describe the technical detail of the change(s) your patch includes.
|
||||
|
||||
Be as specific as possible. The WORST descriptions possible include
|
||||
things like "update driver X", "bug fix for driver X", or "this patch
|
||||
includes updates for subsystem X. Please apply."
|
||||
|
||||
If your description starts to get long, that's a sign that you probably
|
||||
need to split up your patch. See #3, next.
|
||||
|
||||
|
||||
|
||||
3) Separate your changes.
|
||||
|
||||
Separate _logical changes_ into a single patch file.
|
||||
|
||||
For example, if your changes include both bug fixes and performance
|
||||
enhancements for a single driver, separate those changes into two
|
||||
or more patches. If your changes include an API update, and a new
|
||||
driver which uses that new API, separate those into two patches.
|
||||
|
||||
On the other hand, if you make a single change to numerous files,
|
||||
group those changes into a single patch. Thus a single logical change
|
||||
is contained within a single patch.
|
||||
|
||||
If one patch depends on another patch in order for a change to be
|
||||
complete, that is OK. Simply note "this patch depends on patch X"
|
||||
in your patch description.
|
||||
|
||||
If you cannot condense your patch set into a smaller set of patches,
|
||||
then only post say 15 or so at a time and wait for review and integration.
|
||||
|
||||
|
||||
|
||||
4) Select e-mail destination.
|
||||
|
||||
Look through the MAINTAINERS file and the source code, and determine
|
||||
if your change applies to a specific subsystem of the kernel, with
|
||||
an assigned maintainer. If so, e-mail that person.
|
||||
|
||||
If no maintainer is listed, or the maintainer does not respond, send
|
||||
your patch to the primary Linux kernel developer's mailing list,
|
||||
linux-kernel@vger.kernel.org. Most kernel developers monitor this
|
||||
e-mail list, and can comment on your changes.
|
||||
|
||||
|
||||
Do not send more than 15 patches at once to the vger mailing lists!!!
|
||||
|
||||
|
||||
Linus Torvalds is the final arbiter of all changes accepted into the
|
||||
Linux kernel. His e-mail address is <torvalds@linux-foundation.org>.
|
||||
He gets a lot of e-mail, so typically you should do your best to -avoid-
|
||||
sending him e-mail.
|
||||
|
||||
Patches which are bug fixes, are "obvious" changes, or similarly
|
||||
require little discussion should be sent or CC'd to Linus. Patches
|
||||
which require discussion or do not have a clear advantage should
|
||||
usually be sent first to linux-kernel. Only after the patch is
|
||||
discussed should the patch then be submitted to Linus.
|
||||
|
||||
|
||||
|
||||
5) Select your CC (e-mail carbon copy) list.
|
||||
|
||||
Unless you have a reason NOT to do so, CC linux-kernel@vger.kernel.org.
|
||||
|
||||
Other kernel developers besides Linus need to be aware of your change,
|
||||
so that they may comment on it and offer code review and suggestions.
|
||||
linux-kernel is the primary Linux kernel developer mailing list.
|
||||
Other mailing lists are available for specific subsystems, such as
|
||||
USB, framebuffer devices, the VFS, the SCSI subsystem, etc. See the
|
||||
MAINTAINERS file for a mailing list that relates specifically to
|
||||
your change.
|
||||
|
||||
Majordomo lists of VGER.KERNEL.ORG at:
|
||||
<http://vger.kernel.org/vger-lists.html>
|
||||
|
||||
If changes affect userland-kernel interfaces, please send
|
||||
the MAN-PAGES maintainer (as listed in the MAINTAINERS file)
|
||||
a man-pages patch, or at least a notification of the change,
|
||||
so that some information makes its way into the manual pages.
|
||||
|
||||
Even if the maintainer did not respond in step #4, make sure to ALWAYS
|
||||
copy the maintainer when you change their code.
|
||||
|
||||
For small patches you may want to CC the Trivial Patch Monkey
|
||||
trivial@kernel.org managed by Adrian Bunk; which collects "trivial"
|
||||
patches. Trivial patches must qualify for one of the following rules:
|
||||
Spelling fixes in documentation
|
||||
Spelling fixes which could break grep(1)
|
||||
Warning fixes (cluttering with useless warnings is bad)
|
||||
Compilation fixes (only if they are actually correct)
|
||||
Runtime fixes (only if they actually fix things)
|
||||
Removing use of deprecated functions/macros (eg. check_region)
|
||||
Contact detail and documentation fixes
|
||||
Non-portable code replaced by portable code (even in arch-specific,
|
||||
since people copy, as long as it's trivial)
|
||||
Any fix by the author/maintainer of the file (ie. patch monkey
|
||||
in re-transmission mode)
|
||||
URL: <http://www.kernel.org/pub/linux/kernel/people/bunk/trivial/>
|
||||
|
||||
|
||||
|
||||
|
||||
6) No MIME, no links, no compression, no attachments. Just plain text.
|
||||
|
||||
Linus and other kernel developers need to be able to read and comment
|
||||
on the changes you are submitting. It is important for a kernel
|
||||
developer to be able to "quote" your changes, using standard e-mail
|
||||
tools, so that they may comment on specific portions of your code.
|
||||
|
||||
For this reason, all patches should be submitting e-mail "inline".
|
||||
WARNING: Be wary of your editor's word-wrap corrupting your patch,
|
||||
if you choose to cut-n-paste your patch.
|
||||
|
||||
Do not attach the patch as a MIME attachment, compressed or not.
|
||||
Many popular e-mail applications will not always transmit a MIME
|
||||
attachment as plain text, making it impossible to comment on your
|
||||
code. A MIME attachment also takes Linus a bit more time to process,
|
||||
decreasing the likelihood of your MIME-attached change being accepted.
|
||||
|
||||
Exception: If your mailer is mangling patches then someone may ask
|
||||
you to re-send them using MIME.
|
||||
|
||||
|
||||
WARNING: Some mailers like Mozilla send your messages with
|
||||
---- message header ----
|
||||
Content-Type: text/plain; charset=us-ascii; format=flowed
|
||||
---- message header ----
|
||||
The problem is that "format=flowed" makes some of the mailers
|
||||
on receiving side to replace TABs with spaces and do similar
|
||||
changes. Thus the patches from you can look corrupted.
|
||||
|
||||
To fix this just make your mozilla defaults/pref/mailnews.js file to look like:
|
||||
pref("mailnews.send_plaintext_flowed", false); // RFC 2646=======
|
||||
pref("mailnews.display.disable_format_flowed_support", true);
|
||||
|
||||
|
||||
|
||||
7) E-mail size.
|
||||
|
||||
When sending patches to Linus, always follow step #6.
|
||||
|
||||
Large changes are not appropriate for mailing lists, and some
|
||||
maintainers. If your patch, uncompressed, exceeds 40 kB in size,
|
||||
it is preferred that you store your patch on an Internet-accessible
|
||||
server, and provide instead a URL (link) pointing to your patch.
|
||||
|
||||
|
||||
|
||||
8) Name your kernel version.
|
||||
|
||||
It is important to note, either in the subject line or in the patch
|
||||
description, the kernel version to which this patch applies.
|
||||
|
||||
If the patch does not apply cleanly to the latest kernel version,
|
||||
Linus will not apply it.
|
||||
|
||||
|
||||
|
||||
9) Don't get discouraged. Re-submit.
|
||||
|
||||
After you have submitted your change, be patient and wait. If Linus
|
||||
likes your change and applies it, it will appear in the next version
|
||||
of the kernel that he releases.
|
||||
|
||||
However, if your change doesn't appear in the next version of the
|
||||
kernel, there could be any number of reasons. It's YOUR job to
|
||||
narrow down those reasons, correct what was wrong, and submit your
|
||||
updated change.
|
||||
|
||||
It is quite common for Linus to "drop" your patch without comment.
|
||||
That's the nature of the system. If he drops your patch, it could be
|
||||
due to
|
||||
* Your patch did not apply cleanly to the latest kernel version.
|
||||
* Your patch was not sufficiently discussed on linux-kernel.
|
||||
* A style issue (see section 2).
|
||||
* An e-mail formatting issue (re-read this section).
|
||||
* A technical problem with your change.
|
||||
* He gets tons of e-mail, and yours got lost in the shuffle.
|
||||
* You are being annoying.
|
||||
|
||||
When in doubt, solicit comments on linux-kernel mailing list.
|
||||
|
||||
|
||||
|
||||
10) Include PATCH in the subject
|
||||
|
||||
Due to high e-mail traffic to Linus, and to linux-kernel, it is common
|
||||
convention to prefix your subject line with [PATCH]. This lets Linus
|
||||
and other kernel developers more easily distinguish patches from other
|
||||
e-mail discussions.
|
||||
|
||||
|
||||
|
||||
11) Sign your work
|
||||
|
||||
To improve tracking of who did what, especially with patches that can
|
||||
percolate to their final resting place in the kernel through several
|
||||
layers of maintainers, we've introduced a "sign-off" procedure on
|
||||
patches that are being emailed around.
|
||||
|
||||
The sign-off is a simple line at the end of the explanation for the
|
||||
patch, which certifies that you wrote it or otherwise have the right to
|
||||
pass it on as a open-source patch. The rules are pretty simple: if you
|
||||
can certify the below:
|
||||
|
||||
Developer's Certificate of Origin 1.1
|
||||
|
||||
By making a contribution to this project, I certify that:
|
||||
|
||||
(a) The contribution was created in whole or in part by me and I
|
||||
have the right to submit it under the open source license
|
||||
indicated in the file; or
|
||||
|
||||
(b) The contribution is based upon previous work that, to the best
|
||||
of my knowledge, is covered under an appropriate open source
|
||||
license and I have the right under that license to submit that
|
||||
work with modifications, whether created in whole or in part
|
||||
by me, under the same open source license (unless I am
|
||||
permitted to submit under a different license), as indicated
|
||||
in the file; or
|
||||
|
||||
(c) The contribution was provided directly to me by some other
|
||||
person who certified (a), (b) or (c) and I have not modified
|
||||
it.
|
||||
|
||||
(d) I understand and agree that this project and the contribution
|
||||
are public and that a record of the contribution (including all
|
||||
personal information I submit with it, including my sign-off) is
|
||||
maintained indefinitely and may be redistributed consistent with
|
||||
this project or the open source license(s) involved.
|
||||
|
||||
then you just add a line saying
|
||||
|
||||
Signed-off-by: Random J Developer <random@developer.example.org>
|
||||
|
||||
using your real name (sorry, no pseudonyms or anonymous contributions.)
|
||||
|
||||
Some people also put extra tags at the end. They'll just be ignored for
|
||||
now, but you can do this to mark internal company procedures or just
|
||||
point out some special detail about the sign-off.
|
||||
|
||||
|
||||
12) The canonical patch format
|
||||
|
||||
The canonical patch subject line is:
|
||||
|
||||
Subject: [PATCH 001/123] subsystem: summary phrase
|
||||
|
||||
The canonical patch message body contains the following:
|
||||
|
||||
- A "from" line specifying the patch author.
|
||||
|
||||
- An empty line.
|
||||
|
||||
- The body of the explanation, which will be copied to the
|
||||
permanent changelog to describe this patch.
|
||||
|
||||
- The "Signed-off-by:" lines, described above, which will
|
||||
also go in the changelog.
|
||||
|
||||
- A marker line containing simply "---".
|
||||
|
||||
- Any additional comments not suitable for the changelog.
|
||||
|
||||
- The actual patch (diff output).
|
||||
|
||||
The Subject line format makes it very easy to sort the emails
|
||||
alphabetically by subject line - pretty much any email reader will
|
||||
support that - since because the sequence number is zero-padded,
|
||||
the numerical and alphabetic sort is the same.
|
||||
|
||||
The "subsystem" in the email's Subject should identify which
|
||||
area or subsystem of the kernel is being patched.
|
||||
|
||||
The "summary phrase" in the email's Subject should concisely
|
||||
describe the patch which that email contains. The "summary
|
||||
phrase" should not be a filename. Do not use the same "summary
|
||||
phrase" for every patch in a whole patch series.
|
||||
|
||||
Bear in mind that the "summary phrase" of your email becomes
|
||||
a globally-unique identifier for that patch. It propagates
|
||||
all the way into the git changelog. The "summary phrase" may
|
||||
later be used in developer discussions which refer to the patch.
|
||||
People will want to google for the "summary phrase" to read
|
||||
discussion regarding that patch.
|
||||
|
||||
A couple of example Subjects:
|
||||
|
||||
Subject: [patch 2/5] ext2: improve scalability of bitmap searching
|
||||
Subject: [PATCHv2 001/207] x86: fix eflags tracking
|
||||
|
||||
The "from" line must be the very first line in the message body,
|
||||
and has the form:
|
||||
|
||||
From: Original Author <author@example.com>
|
||||
|
||||
The "from" line specifies who will be credited as the author of the
|
||||
patch in the permanent changelog. If the "from" line is missing,
|
||||
then the "From:" line from the email header will be used to determine
|
||||
the patch author in the changelog.
|
||||
|
||||
The explanation body will be committed to the permanent source
|
||||
changelog, so should make sense to a competent reader who has long
|
||||
since forgotten the immediate details of the discussion that might
|
||||
have led to this patch.
|
||||
|
||||
The "---" marker line serves the essential purpose of marking for patch
|
||||
handling tools where the changelog message ends.
|
||||
|
||||
One good use for the additional comments after the "---" marker is for
|
||||
a diffstat, to show what files have changed, and the number of inserted
|
||||
and deleted lines per file. A diffstat is especially useful on bigger
|
||||
patches. Other comments relevant only to the moment or the maintainer,
|
||||
not suitable for the permanent changelog, should also go here.
|
||||
Use diffstat options "-p 1 -w 70" so that filenames are listed from the
|
||||
top of the kernel source tree and don't use too much horizontal space
|
||||
(easily fit in 80 columns, maybe with some indentation).
|
||||
|
||||
See more details on the proper patch format in the following
|
||||
references.
|
||||
|
||||
|
||||
|
||||
|
||||
-----------------------------------
|
||||
SECTION 2 - HINTS, TIPS, AND TRICKS
|
||||
-----------------------------------
|
||||
|
||||
This section lists many of the common "rules" associated with code
|
||||
submitted to the kernel. There are always exceptions... but you must
|
||||
have a really good reason for doing so. You could probably call this
|
||||
section Linus Computer Science 101.
|
||||
|
||||
|
||||
|
||||
1) Read Documentation/CodingStyle
|
||||
|
||||
Nuff said. If your code deviates too much from this, it is likely
|
||||
to be rejected without further review, and without comment.
|
||||
|
||||
|
||||
|
||||
2) #ifdefs are ugly
|
||||
|
||||
Code cluttered with ifdefs is difficult to read and maintain. Don't do
|
||||
it. Instead, put your ifdefs in a header, and conditionally define
|
||||
'static inline' functions, or macros, which are used in the code.
|
||||
Let the compiler optimize away the "no-op" case.
|
||||
|
||||
Simple example, of poor code:
|
||||
|
||||
dev = alloc_etherdev (sizeof(struct funky_private));
|
||||
if (!dev)
|
||||
return -ENODEV;
|
||||
#ifdef CONFIG_NET_FUNKINESS
|
||||
init_funky_net(dev);
|
||||
#endif
|
||||
|
||||
Cleaned-up example:
|
||||
|
||||
(in header)
|
||||
#ifndef CONFIG_NET_FUNKINESS
|
||||
static inline void init_funky_net (struct net_device *d) {}
|
||||
#endif
|
||||
|
||||
(in the code itself)
|
||||
dev = alloc_etherdev (sizeof(struct funky_private));
|
||||
if (!dev)
|
||||
return -ENODEV;
|
||||
init_funky_net(dev);
|
||||
|
||||
|
||||
|
||||
3) 'static inline' is better than a macro
|
||||
|
||||
Static inline functions are greatly preferred over macros.
|
||||
They provide type safety, have no length limitations, no formatting
|
||||
limitations, and under gcc they are as cheap as macros.
|
||||
|
||||
Macros should only be used for cases where a static inline is clearly
|
||||
suboptimal [there a few, isolated cases of this in fast paths],
|
||||
or where it is impossible to use a static inline function [such as
|
||||
string-izing].
|
||||
|
||||
'static inline' is preferred over 'static __inline__', 'extern inline',
|
||||
and 'extern __inline__'.
|
||||
|
||||
|
||||
|
||||
4) Don't over-design.
|
||||
|
||||
Don't try to anticipate nebulous future cases which may or may not
|
||||
be useful: "Make it as simple as you can, and no simpler."
|
||||
|
||||
|
||||
|
||||
----------------------
|
||||
SECTION 3 - REFERENCES
|
||||
----------------------
|
||||
|
||||
Andrew Morton, "The perfect patch" (tpp).
|
||||
<http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt>
|
||||
|
||||
Jeff Garzik, "Linux kernel patch submission format".
|
||||
<http://linux.yyz.us/patch-format.html>
|
||||
|
||||
Greg Kroah-Hartman, "How to piss off a kernel subsystem maintainer".
|
||||
<http://www.kroah.com/log/2005/03/31/>
|
||||
<http://www.kroah.com/log/2005/07/08/>
|
||||
<http://www.kroah.com/log/2005/10/19/>
|
||||
<http://www.kroah.com/log/2006/01/11/>
|
||||
|
||||
NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people!
|
||||
<http://marc.theaimsgroup.com/?l=linux-kernel&m=112112749912944&w=2>
|
||||
|
||||
Kernel Documentation/CodingStyle:
|
||||
<http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle>
|
||||
|
||||
Linus Torvalds's mail on the canonical patch format:
|
||||
<http://lkml.org/lkml/2005/4/7/183>
|
||||
--
|
||||
39
Documentation/VGA-softcursor.txt
Normal file
39
Documentation/VGA-softcursor.txt
Normal file
@ -0,0 +1,39 @@
|
||||
Software cursor for VGA by Pavel Machek <pavel@atrey.karlin.mff.cuni.cz>
|
||||
======================= and Martin Mares <mj@atrey.karlin.mff.cuni.cz>
|
||||
|
||||
Linux now has some ability to manipulate cursor appearance. Normally, you
|
||||
can set the size of hardware cursor (and also work around some ugly bugs in
|
||||
those miserable Trident cards--see #define TRIDENT_GLITCH in drivers/video/
|
||||
vgacon.c). You can now play a few new tricks: you can make your cursor look
|
||||
like a non-blinking red block, make it inverse background of the character it's
|
||||
over or to highlight that character and still choose whether the original
|
||||
hardware cursor should remain visible or not. There may be other things I have
|
||||
never thought of.
|
||||
|
||||
The cursor appearance is controlled by a "<ESC>[?1;2;3c" escape sequence
|
||||
where 1, 2 and 3 are parameters described below. If you omit any of them,
|
||||
they will default to zeroes.
|
||||
|
||||
Parameter 1 specifies cursor size (0=default, 1=invisible, 2=underline, ...,
|
||||
8=full block) + 16 if you want the software cursor to be applied + 32 if you
|
||||
want to always change the background color + 64 if you dislike having the
|
||||
background the same as the foreground. Highlights are ignored for the last two
|
||||
flags.
|
||||
|
||||
The second parameter selects character attribute bits you want to change
|
||||
(by simply XORing them with the value of this parameter). On standard VGA,
|
||||
the high four bits specify background and the low four the foreground. In both
|
||||
groups, low three bits set color (as in normal color codes used by the console)
|
||||
and the most significant one turns on highlight (or sometimes blinking--it
|
||||
depends on the configuration of your VGA).
|
||||
|
||||
The third parameter consists of character attribute bits you want to set.
|
||||
Bit setting takes place before bit toggling, so you can simply clear a bit by
|
||||
including it in both the set mask and the toggle mask.
|
||||
|
||||
Examples:
|
||||
=========
|
||||
|
||||
To get normal blinking underline, use: echo -e '\033[?2c'
|
||||
To get blinking block, use: echo -e '\033[?6c'
|
||||
To get red non-blinking block, use: echo -e '\033[?17;0;64c'
|
||||
112
Documentation/accounting/delay-accounting.txt
Normal file
112
Documentation/accounting/delay-accounting.txt
Normal file
@ -0,0 +1,112 @@
|
||||
Delay accounting
|
||||
----------------
|
||||
|
||||
Tasks encounter delays in execution when they wait
|
||||
for some kernel resource to become available e.g. a
|
||||
runnable task may wait for a free CPU to run on.
|
||||
|
||||
The per-task delay accounting functionality measures
|
||||
the delays experienced by a task while
|
||||
|
||||
a) waiting for a CPU (while being runnable)
|
||||
b) completion of synchronous block I/O initiated by the task
|
||||
c) swapping in pages
|
||||
|
||||
and makes these statistics available to userspace through
|
||||
the taskstats interface.
|
||||
|
||||
Such delays provide feedback for setting a task's cpu priority,
|
||||
io priority and rss limit values appropriately. Long delays for
|
||||
important tasks could be a trigger for raising its corresponding priority.
|
||||
|
||||
The functionality, through its use of the taskstats interface, also provides
|
||||
delay statistics aggregated for all tasks (or threads) belonging to a
|
||||
thread group (corresponding to a traditional Unix process). This is a commonly
|
||||
needed aggregation that is more efficiently done by the kernel.
|
||||
|
||||
Userspace utilities, particularly resource management applications, can also
|
||||
aggregate delay statistics into arbitrary groups. To enable this, delay
|
||||
statistics of a task are available both during its lifetime as well as on its
|
||||
exit, ensuring continuous and complete monitoring can be done.
|
||||
|
||||
|
||||
Interface
|
||||
---------
|
||||
|
||||
Delay accounting uses the taskstats interface which is described
|
||||
in detail in a separate document in this directory. Taskstats returns a
|
||||
generic data structure to userspace corresponding to per-pid and per-tgid
|
||||
statistics. The delay accounting functionality populates specific fields of
|
||||
this structure. See
|
||||
include/linux/taskstats.h
|
||||
for a description of the fields pertaining to delay accounting.
|
||||
It will generally be in the form of counters returning the cumulative
|
||||
delay seen for cpu, sync block I/O, swapin etc.
|
||||
|
||||
Taking the difference of two successive readings of a given
|
||||
counter (say cpu_delay_total) for a task will give the delay
|
||||
experienced by the task waiting for the corresponding resource
|
||||
in that interval.
|
||||
|
||||
When a task exits, records containing the per-task statistics
|
||||
are sent to userspace without requiring a command. If it is the last exiting
|
||||
task of a thread group, the per-tgid statistics are also sent. More details
|
||||
are given in the taskstats interface description.
|
||||
|
||||
The getdelays.c userspace utility in this directory allows simple commands to
|
||||
be run and the corresponding delay statistics to be displayed. It also serves
|
||||
as an example of using the taskstats interface.
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
Compile the kernel with
|
||||
CONFIG_TASK_DELAY_ACCT=y
|
||||
CONFIG_TASKSTATS=y
|
||||
|
||||
Delay accounting is enabled by default at boot up.
|
||||
To disable, add
|
||||
nodelayacct
|
||||
to the kernel boot options. The rest of the instructions
|
||||
below assume this has not been done.
|
||||
|
||||
After the system has booted up, use a utility
|
||||
similar to getdelays.c to access the delays
|
||||
seen by a given task or a task group (tgid).
|
||||
The utility also allows a given command to be
|
||||
executed and the corresponding delays to be
|
||||
seen.
|
||||
|
||||
General format of the getdelays command
|
||||
|
||||
getdelays [-t tgid] [-p pid] [-c cmd...]
|
||||
|
||||
|
||||
Get delays, since system boot, for pid 10
|
||||
# ./getdelays -p 10
|
||||
(output similar to next case)
|
||||
|
||||
Get sum of delays, since system boot, for all pids with tgid 5
|
||||
# ./getdelays -t 5
|
||||
|
||||
|
||||
CPU count real total virtual total delay total
|
||||
7876 92005750 100000000 24001500
|
||||
IO count delay total
|
||||
0 0
|
||||
MEM count delay total
|
||||
0 0
|
||||
|
||||
Get delays seen in executing a given simple command
|
||||
# ./getdelays -c ls /
|
||||
|
||||
bin data1 data3 data5 dev home media opt root srv sys usr
|
||||
boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var
|
||||
|
||||
|
||||
CPU count real total virtual total delay total
|
||||
6 4000250 4000000 0
|
||||
IO count delay total
|
||||
0 0
|
||||
MEM count delay total
|
||||
0 0
|
||||
425
Documentation/accounting/getdelays.c
Normal file
425
Documentation/accounting/getdelays.c
Normal file
@ -0,0 +1,425 @@
|
||||
/* getdelays.c
|
||||
*
|
||||
* Utility to get per-pid and per-tgid delay accounting statistics
|
||||
* Also illustrates usage of the taskstats interface
|
||||
*
|
||||
* Copyright (C) Shailabh Nagar, IBM Corp. 2005
|
||||
* Copyright (C) Balbir Singh, IBM Corp. 2006
|
||||
* Copyright (c) Jay Lan, SGI. 2006
|
||||
*
|
||||
* Compile with
|
||||
* gcc -I/usr/src/linux/include getdelays.c -o getdelays
|
||||
*/
|
||||
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <errno.h>
|
||||
#include <unistd.h>
|
||||
#include <poll.h>
|
||||
#include <string.h>
|
||||
#include <fcntl.h>
|
||||
#include <sys/types.h>
|
||||
#include <sys/stat.h>
|
||||
#include <sys/socket.h>
|
||||
#include <sys/types.h>
|
||||
#include <signal.h>
|
||||
|
||||
#include <linux/genetlink.h>
|
||||
#include <linux/taskstats.h>
|
||||
|
||||
/*
|
||||
* Generic macros for dealing with netlink sockets. Might be duplicated
|
||||
* elsewhere. It is recommended that commercial grade applications use
|
||||
* libnl or libnetlink and use the interfaces provided by the library
|
||||
*/
|
||||
#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
|
||||
#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
|
||||
#define NLA_DATA(na) ((void *)((char*)(na) + NLA_HDRLEN))
|
||||
#define NLA_PAYLOAD(len) (len - NLA_HDRLEN)
|
||||
|
||||
#define err(code, fmt, arg...) \
|
||||
do { \
|
||||
fprintf(stderr, fmt, ##arg); \
|
||||
exit(code); \
|
||||
} while (0)
|
||||
|
||||
int done;
|
||||
int rcvbufsz;
|
||||
char name[100];
|
||||
int dbg;
|
||||
int print_delays;
|
||||
int print_io_accounting;
|
||||
__u64 stime, utime;
|
||||
|
||||
#define PRINTF(fmt, arg...) { \
|
||||
if (dbg) { \
|
||||
printf(fmt, ##arg); \
|
||||
} \
|
||||
}
|
||||
|
||||
/* Maximum size of response requested or message sent */
|
||||
#define MAX_MSG_SIZE 1024
|
||||
/* Maximum number of cpus expected to be specified in a cpumask */
|
||||
#define MAX_CPUS 32
|
||||
/* Maximum length of pathname to log file */
|
||||
#define MAX_FILENAME 256
|
||||
|
||||
struct msgtemplate {
|
||||
struct nlmsghdr n;
|
||||
struct genlmsghdr g;
|
||||
char buf[MAX_MSG_SIZE];
|
||||
};
|
||||
|
||||
char cpumask[100+6*MAX_CPUS];
|
||||
|
||||
/*
|
||||
* Create a raw netlink socket and bind
|
||||
*/
|
||||
static int create_nl_socket(int protocol)
|
||||
{
|
||||
int fd;
|
||||
struct sockaddr_nl local;
|
||||
|
||||
fd = socket(AF_NETLINK, SOCK_RAW, protocol);
|
||||
if (fd < 0)
|
||||
return -1;
|
||||
|
||||
if (rcvbufsz)
|
||||
if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF,
|
||||
&rcvbufsz, sizeof(rcvbufsz)) < 0) {
|
||||
fprintf(stderr, "Unable to set socket rcv buf size "
|
||||
"to %d\n",
|
||||
rcvbufsz);
|
||||
return -1;
|
||||
}
|
||||
|
||||
memset(&local, 0, sizeof(local));
|
||||
local.nl_family = AF_NETLINK;
|
||||
|
||||
if (bind(fd, (struct sockaddr *) &local, sizeof(local)) < 0)
|
||||
goto error;
|
||||
|
||||
return fd;
|
||||
error:
|
||||
close(fd);
|
||||
return -1;
|
||||
}
|
||||
|
||||
|
||||
int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid,
|
||||
__u8 genl_cmd, __u16 nla_type,
|
||||
void *nla_data, int nla_len)
|
||||
{
|
||||
struct nlattr *na;
|
||||
struct sockaddr_nl nladdr;
|
||||
int r, buflen;
|
||||
char *buf;
|
||||
|
||||
struct msgtemplate msg;
|
||||
|
||||
msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
|
||||
msg.n.nlmsg_type = nlmsg_type;
|
||||
msg.n.nlmsg_flags = NLM_F_REQUEST;
|
||||
msg.n.nlmsg_seq = 0;
|
||||
msg.n.nlmsg_pid = nlmsg_pid;
|
||||
msg.g.cmd = genl_cmd;
|
||||
msg.g.version = 0x1;
|
||||
na = (struct nlattr *) GENLMSG_DATA(&msg);
|
||||
na->nla_type = nla_type;
|
||||
na->nla_len = nla_len + 1 + NLA_HDRLEN;
|
||||
memcpy(NLA_DATA(na), nla_data, nla_len);
|
||||
msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
|
||||
|
||||
buf = (char *) &msg;
|
||||
buflen = msg.n.nlmsg_len ;
|
||||
memset(&nladdr, 0, sizeof(nladdr));
|
||||
nladdr.nl_family = AF_NETLINK;
|
||||
while ((r = sendto(sd, buf, buflen, 0, (struct sockaddr *) &nladdr,
|
||||
sizeof(nladdr))) < buflen) {
|
||||
if (r > 0) {
|
||||
buf += r;
|
||||
buflen -= r;
|
||||
} else if (errno != EAGAIN)
|
||||
return -1;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* Probe the controller in genetlink to find the family id
|
||||
* for the TASKSTATS family
|
||||
*/
|
||||
int get_family_id(int sd)
|
||||
{
|
||||
struct {
|
||||
struct nlmsghdr n;
|
||||
struct genlmsghdr g;
|
||||
char buf[256];
|
||||
} ans;
|
||||
|
||||
int id, rc;
|
||||
struct nlattr *na;
|
||||
int rep_len;
|
||||
|
||||
strcpy(name, TASKSTATS_GENL_NAME);
|
||||
rc = send_cmd(sd, GENL_ID_CTRL, getpid(), CTRL_CMD_GETFAMILY,
|
||||
CTRL_ATTR_FAMILY_NAME, (void *)name,
|
||||
strlen(TASKSTATS_GENL_NAME)+1);
|
||||
|
||||
rep_len = recv(sd, &ans, sizeof(ans), 0);
|
||||
if (ans.n.nlmsg_type == NLMSG_ERROR ||
|
||||
(rep_len < 0) || !NLMSG_OK((&ans.n), rep_len))
|
||||
return 0;
|
||||
|
||||
na = (struct nlattr *) GENLMSG_DATA(&ans);
|
||||
na = (struct nlattr *) ((char *) na + NLA_ALIGN(na->nla_len));
|
||||
if (na->nla_type == CTRL_ATTR_FAMILY_ID) {
|
||||
id = *(__u16 *) NLA_DATA(na);
|
||||
}
|
||||
return id;
|
||||
}
|
||||
|
||||
void print_delayacct(struct taskstats *t)
|
||||
{
|
||||
printf("\n\nCPU %15s%15s%15s%15s\n"
|
||||
" %15llu%15llu%15llu%15llu\n"
|
||||
"IO %15s%15s\n"
|
||||
" %15llu%15llu\n"
|
||||
"MEM %15s%15s\n"
|
||||
" %15llu%15llu\n\n",
|
||||
"count", "real total", "virtual total", "delay total",
|
||||
t->cpu_count, t->cpu_run_real_total, t->cpu_run_virtual_total,
|
||||
t->cpu_delay_total,
|
||||
"count", "delay total",
|
||||
t->blkio_count, t->blkio_delay_total,
|
||||
"count", "delay total", t->swapin_count, t->swapin_delay_total);
|
||||
}
|
||||
|
||||
void print_ioacct(struct taskstats *t)
|
||||
{
|
||||
printf("%s: read=%llu, write=%llu, cancelled_write=%llu\n",
|
||||
t->ac_comm,
|
||||
(unsigned long long)t->read_bytes,
|
||||
(unsigned long long)t->write_bytes,
|
||||
(unsigned long long)t->cancelled_write_bytes);
|
||||
}
|
||||
|
||||
int main(int argc, char *argv[])
|
||||
{
|
||||
int c, rc, rep_len, aggr_len, len2, cmd_type;
|
||||
__u16 id;
|
||||
__u32 mypid;
|
||||
|
||||
struct nlattr *na;
|
||||
int nl_sd = -1;
|
||||
int len = 0;
|
||||
pid_t tid = 0;
|
||||
pid_t rtid = 0;
|
||||
|
||||
int fd = 0;
|
||||
int count = 0;
|
||||
int write_file = 0;
|
||||
int maskset = 0;
|
||||
char logfile[128];
|
||||
int loop = 0;
|
||||
|
||||
struct msgtemplate msg;
|
||||
|
||||
while (1) {
|
||||
c = getopt(argc, argv, "diw:r:m:t:p:v:l");
|
||||
if (c < 0)
|
||||
break;
|
||||
|
||||
switch (c) {
|
||||
case 'd':
|
||||
printf("print delayacct stats ON\n");
|
||||
print_delays = 1;
|
||||
break;
|
||||
case 'i':
|
||||
printf("printing IO accounting\n");
|
||||
print_io_accounting = 1;
|
||||
break;
|
||||
case 'w':
|
||||
strncpy(logfile, optarg, MAX_FILENAME);
|
||||
printf("write to file %s\n", logfile);
|
||||
write_file = 1;
|
||||
break;
|
||||
case 'r':
|
||||
rcvbufsz = atoi(optarg);
|
||||
printf("receive buf size %d\n", rcvbufsz);
|
||||
if (rcvbufsz < 0)
|
||||
err(1, "Invalid rcv buf size\n");
|
||||
break;
|
||||
case 'm':
|
||||
strncpy(cpumask, optarg, sizeof(cpumask));
|
||||
maskset = 1;
|
||||
printf("cpumask %s maskset %d\n", cpumask, maskset);
|
||||
break;
|
||||
case 't':
|
||||
tid = atoi(optarg);
|
||||
if (!tid)
|
||||
err(1, "Invalid tgid\n");
|
||||
cmd_type = TASKSTATS_CMD_ATTR_TGID;
|
||||
break;
|
||||
case 'p':
|
||||
tid = atoi(optarg);
|
||||
if (!tid)
|
||||
err(1, "Invalid pid\n");
|
||||
cmd_type = TASKSTATS_CMD_ATTR_PID;
|
||||
break;
|
||||
case 'v':
|
||||
printf("debug on\n");
|
||||
dbg = 1;
|
||||
break;
|
||||
case 'l':
|
||||
printf("listen forever\n");
|
||||
loop = 1;
|
||||
break;
|
||||
default:
|
||||
printf("Unknown option %d\n", c);
|
||||
exit(-1);
|
||||
}
|
||||
}
|
||||
|
||||
if (write_file) {
|
||||
fd = open(logfile, O_WRONLY | O_CREAT | O_TRUNC,
|
||||
S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
|
||||
if (fd == -1) {
|
||||
perror("Cannot open output file\n");
|
||||
exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
if ((nl_sd = create_nl_socket(NETLINK_GENERIC)) < 0)
|
||||
err(1, "error creating Netlink socket\n");
|
||||
|
||||
|
||||
mypid = getpid();
|
||||
id = get_family_id(nl_sd);
|
||||
if (!id) {
|
||||
fprintf(stderr, "Error getting family id, errno %d\n", errno);
|
||||
goto err;
|
||||
}
|
||||
PRINTF("family id %d\n", id);
|
||||
|
||||
if (maskset) {
|
||||
rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
|
||||
TASKSTATS_CMD_ATTR_REGISTER_CPUMASK,
|
||||
&cpumask, strlen(cpumask) + 1);
|
||||
PRINTF("Sent register cpumask, retval %d\n", rc);
|
||||
if (rc < 0) {
|
||||
fprintf(stderr, "error sending register cpumask\n");
|
||||
goto err;
|
||||
}
|
||||
}
|
||||
|
||||
if (tid) {
|
||||
rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
|
||||
cmd_type, &tid, sizeof(__u32));
|
||||
PRINTF("Sent pid/tgid, retval %d\n", rc);
|
||||
if (rc < 0) {
|
||||
fprintf(stderr, "error sending tid/tgid cmd\n");
|
||||
goto done;
|
||||
}
|
||||
}
|
||||
|
||||
do {
|
||||
int i;
|
||||
|
||||
rep_len = recv(nl_sd, &msg, sizeof(msg), 0);
|
||||
PRINTF("received %d bytes\n", rep_len);
|
||||
|
||||
if (rep_len < 0) {
|
||||
fprintf(stderr, "nonfatal reply error: errno %d\n",
|
||||
errno);
|
||||
continue;
|
||||
}
|
||||
if (msg.n.nlmsg_type == NLMSG_ERROR ||
|
||||
!NLMSG_OK((&msg.n), rep_len)) {
|
||||
struct nlmsgerr *err = NLMSG_DATA(&msg);
|
||||
fprintf(stderr, "fatal reply error, errno %d\n",
|
||||
err->error);
|
||||
goto done;
|
||||
}
|
||||
|
||||
PRINTF("nlmsghdr size=%d, nlmsg_len=%d, rep_len=%d\n",
|
||||
sizeof(struct nlmsghdr), msg.n.nlmsg_len, rep_len);
|
||||
|
||||
|
||||
rep_len = GENLMSG_PAYLOAD(&msg.n);
|
||||
|
||||
na = (struct nlattr *) GENLMSG_DATA(&msg);
|
||||
len = 0;
|
||||
i = 0;
|
||||
while (len < rep_len) {
|
||||
len += NLA_ALIGN(na->nla_len);
|
||||
switch (na->nla_type) {
|
||||
case TASKSTATS_TYPE_AGGR_TGID:
|
||||
/* Fall through */
|
||||
case TASKSTATS_TYPE_AGGR_PID:
|
||||
aggr_len = NLA_PAYLOAD(na->nla_len);
|
||||
len2 = 0;
|
||||
/* For nested attributes, na follows */
|
||||
na = (struct nlattr *) NLA_DATA(na);
|
||||
done = 0;
|
||||
while (len2 < aggr_len) {
|
||||
switch (na->nla_type) {
|
||||
case TASKSTATS_TYPE_PID:
|
||||
rtid = *(int *) NLA_DATA(na);
|
||||
if (print_delays)
|
||||
printf("PID\t%d\n", rtid);
|
||||
break;
|
||||
case TASKSTATS_TYPE_TGID:
|
||||
rtid = *(int *) NLA_DATA(na);
|
||||
if (print_delays)
|
||||
printf("TGID\t%d\n", rtid);
|
||||
break;
|
||||
case TASKSTATS_TYPE_STATS:
|
||||
count++;
|
||||
if (print_delays)
|
||||
print_delayacct((struct taskstats *) NLA_DATA(na));
|
||||
if (print_io_accounting)
|
||||
print_ioacct((struct taskstats *) NLA_DATA(na));
|
||||
if (fd) {
|
||||
if (write(fd, NLA_DATA(na), na->nla_len) < 0) {
|
||||
err(1,"write error\n");
|
||||
}
|
||||
}
|
||||
if (!loop)
|
||||
goto done;
|
||||
break;
|
||||
default:
|
||||
fprintf(stderr, "Unknown nested"
|
||||
" nla_type %d\n",
|
||||
na->nla_type);
|
||||
break;
|
||||
}
|
||||
len2 += NLA_ALIGN(na->nla_len);
|
||||
na = (struct nlattr *) ((char *) na + len2);
|
||||
}
|
||||
break;
|
||||
|
||||
default:
|
||||
fprintf(stderr, "Unknown nla_type %d\n",
|
||||
na->nla_type);
|
||||
break;
|
||||
}
|
||||
na = (struct nlattr *) (GENLMSG_DATA(&msg) + len);
|
||||
}
|
||||
} while (loop);
|
||||
done:
|
||||
if (maskset) {
|
||||
rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
|
||||
TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK,
|
||||
&cpumask, strlen(cpumask) + 1);
|
||||
printf("Sent deregister mask, retval %d\n", rc);
|
||||
if (rc < 0)
|
||||
err(rc, "error sending deregister cpumask\n");
|
||||
}
|
||||
err:
|
||||
close(nl_sd);
|
||||
if (fd)
|
||||
close(fd);
|
||||
return 0;
|
||||
}
|
||||
161
Documentation/accounting/taskstats-struct.txt
Normal file
161
Documentation/accounting/taskstats-struct.txt
Normal file
@ -0,0 +1,161 @@
|
||||
The struct taskstats
|
||||
--------------------
|
||||
|
||||
This document contains an explanation of the struct taskstats fields.
|
||||
|
||||
There are three different groups of fields in the struct taskstats:
|
||||
|
||||
1) Common and basic accounting fields
|
||||
If CONFIG_TASKSTATS is set, the taskstats inteface is enabled and
|
||||
the common fields and basic accounting fields are collected for
|
||||
delivery at do_exit() of a task.
|
||||
2) Delay accounting fields
|
||||
These fields are placed between
|
||||
/* Delay accounting fields start */
|
||||
and
|
||||
/* Delay accounting fields end */
|
||||
Their values are collected if CONFIG_TASK_DELAY_ACCT is set.
|
||||
3) Extended accounting fields
|
||||
These fields are placed between
|
||||
/* Extended accounting fields start */
|
||||
and
|
||||
/* Extended accounting fields end */
|
||||
Their values are collected if CONFIG_TASK_XACCT is set.
|
||||
|
||||
Future extension should add fields to the end of the taskstats struct, and
|
||||
should not change the relative position of each field within the struct.
|
||||
|
||||
|
||||
struct taskstats {
|
||||
|
||||
1) Common and basic accounting fields:
|
||||
/* The version number of this struct. This field is always set to
|
||||
* TAKSTATS_VERSION, which is defined in <linux/taskstats.h>.
|
||||
* Each time the struct is changed, the value should be incremented.
|
||||
*/
|
||||
__u16 version;
|
||||
|
||||
/* The exit code of a task. */
|
||||
__u32 ac_exitcode; /* Exit status */
|
||||
|
||||
/* The accounting flags of a task as defined in <linux/acct.h>
|
||||
* Defined values are AFORK, ASU, ACOMPAT, ACORE, and AXSIG.
|
||||
*/
|
||||
__u8 ac_flag; /* Record flags */
|
||||
|
||||
/* The value of task_nice() of a task. */
|
||||
__u8 ac_nice; /* task_nice */
|
||||
|
||||
/* The name of the command that started this task. */
|
||||
char ac_comm[TS_COMM_LEN]; /* Command name */
|
||||
|
||||
/* The scheduling discipline as set in task->policy field. */
|
||||
__u8 ac_sched; /* Scheduling discipline */
|
||||
|
||||
__u8 ac_pad[3];
|
||||
__u32 ac_uid; /* User ID */
|
||||
__u32 ac_gid; /* Group ID */
|
||||
__u32 ac_pid; /* Process ID */
|
||||
__u32 ac_ppid; /* Parent process ID */
|
||||
|
||||
/* The time when a task begins, in [secs] since 1970. */
|
||||
__u32 ac_btime; /* Begin time [sec since 1970] */
|
||||
|
||||
/* The elapsed time of a task, in [usec]. */
|
||||
__u64 ac_etime; /* Elapsed time [usec] */
|
||||
|
||||
/* The user CPU time of a task, in [usec]. */
|
||||
__u64 ac_utime; /* User CPU time [usec] */
|
||||
|
||||
/* The system CPU time of a task, in [usec]. */
|
||||
__u64 ac_stime; /* System CPU time [usec] */
|
||||
|
||||
/* The minor page fault count of a task, as set in task->min_flt. */
|
||||
__u64 ac_minflt; /* Minor Page Fault Count */
|
||||
|
||||
/* The major page fault count of a task, as set in task->maj_flt. */
|
||||
__u64 ac_majflt; /* Major Page Fault Count */
|
||||
|
||||
|
||||
2) Delay accounting fields:
|
||||
/* Delay accounting fields start
|
||||
*
|
||||
* All values, until the comment "Delay accounting fields end" are
|
||||
* available only if delay accounting is enabled, even though the last
|
||||
* few fields are not delays
|
||||
*
|
||||
* xxx_count is the number of delay values recorded
|
||||
* xxx_delay_total is the corresponding cumulative delay in nanoseconds
|
||||
*
|
||||
* xxx_delay_total wraps around to zero on overflow
|
||||
* xxx_count incremented regardless of overflow
|
||||
*/
|
||||
|
||||
/* Delay waiting for cpu, while runnable
|
||||
* count, delay_total NOT updated atomically
|
||||
*/
|
||||
__u64 cpu_count;
|
||||
__u64 cpu_delay_total;
|
||||
|
||||
/* Following four fields atomically updated using task->delays->lock */
|
||||
|
||||
/* Delay waiting for synchronous block I/O to complete
|
||||
* does not account for delays in I/O submission
|
||||
*/
|
||||
__u64 blkio_count;
|
||||
__u64 blkio_delay_total;
|
||||
|
||||
/* Delay waiting for page fault I/O (swap in only) */
|
||||
__u64 swapin_count;
|
||||
__u64 swapin_delay_total;
|
||||
|
||||
/* cpu "wall-clock" running time
|
||||
* On some architectures, value will adjust for cpu time stolen
|
||||
* from the kernel in involuntary waits due to virtualization.
|
||||
* Value is cumulative, in nanoseconds, without a corresponding count
|
||||
* and wraps around to zero silently on overflow
|
||||
*/
|
||||
__u64 cpu_run_real_total;
|
||||
|
||||
/* cpu "virtual" running time
|
||||
* Uses time intervals seen by the kernel i.e. no adjustment
|
||||
* for kernel's involuntary waits due to virtualization.
|
||||
* Value is cumulative, in nanoseconds, without a corresponding count
|
||||
* and wraps around to zero silently on overflow
|
||||
*/
|
||||
__u64 cpu_run_virtual_total;
|
||||
/* Delay accounting fields end */
|
||||
/* version 1 ends here */
|
||||
|
||||
|
||||
3) Extended accounting fields
|
||||
/* Extended accounting fields start */
|
||||
|
||||
/* Accumulated RSS usage in duration of a task, in MBytes-usecs.
|
||||
* The current rss usage is added to this counter every time
|
||||
* a tick is charged to a task's system time. So, at the end we
|
||||
* will have memory usage multiplied by system time. Thus an
|
||||
* average usage per system time unit can be calculated.
|
||||
*/
|
||||
__u64 coremem; /* accumulated RSS usage in MB-usec */
|
||||
|
||||
/* Accumulated virtual memory usage in duration of a task.
|
||||
* Same as acct_rss_mem1 above except that we keep track of VM usage.
|
||||
*/
|
||||
__u64 virtmem; /* accumulated VM usage in MB-usec */
|
||||
|
||||
/* High watermark of RSS usage in duration of a task, in KBytes. */
|
||||
__u64 hiwater_rss; /* High-watermark of RSS usage */
|
||||
|
||||
/* High watermark of VM usage in duration of a task, in KBytes. */
|
||||
__u64 hiwater_vm; /* High-water virtual memory usage */
|
||||
|
||||
/* The following four fields are I/O statistics of a task. */
|
||||
__u64 read_char; /* bytes read */
|
||||
__u64 write_char; /* bytes written */
|
||||
__u64 read_syscalls; /* read syscalls */
|
||||
__u64 write_syscalls; /* write syscalls */
|
||||
|
||||
/* Extended accounting fields end */
|
||||
|
||||
}
|
||||
181
Documentation/accounting/taskstats.txt
Normal file
181
Documentation/accounting/taskstats.txt
Normal file
@ -0,0 +1,181 @@
|
||||
Per-task statistics interface
|
||||
-----------------------------
|
||||
|
||||
|
||||
Taskstats is a netlink-based interface for sending per-task and
|
||||
per-process statistics from the kernel to userspace.
|
||||
|
||||
Taskstats was designed for the following benefits:
|
||||
|
||||
- efficiently provide statistics during lifetime of a task and on its exit
|
||||
- unified interface for multiple accounting subsystems
|
||||
- extensibility for use by future accounting patches
|
||||
|
||||
Terminology
|
||||
-----------
|
||||
|
||||
"pid", "tid" and "task" are used interchangeably and refer to the standard
|
||||
Linux task defined by struct task_struct. per-pid stats are the same as
|
||||
per-task stats.
|
||||
|
||||
"tgid", "process" and "thread group" are used interchangeably and refer to the
|
||||
tasks that share an mm_struct i.e. the traditional Unix process. Despite the
|
||||
use of tgid, there is no special treatment for the task that is thread group
|
||||
leader - a process is deemed alive as long as it has any task belonging to it.
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
To get statistics during a task's lifetime, userspace opens a unicast netlink
|
||||
socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
|
||||
The response contains statistics for a task (if pid is specified) or the sum of
|
||||
statistics for all tasks of the process (if tgid is specified).
|
||||
|
||||
To obtain statistics for tasks which are exiting, the userspace listener
|
||||
sends a register command and specifies a cpumask. Whenever a task exits on
|
||||
one of the cpus in the cpumask, its per-pid statistics are sent to the
|
||||
registered listener. Using cpumasks allows the data received by one listener
|
||||
to be limited and assists in flow control over the netlink interface and is
|
||||
explained in more detail below.
|
||||
|
||||
If the exiting task is the last thread exiting its thread group,
|
||||
an additional record containing the per-tgid stats is also sent to userspace.
|
||||
The latter contains the sum of per-pid stats for all threads in the thread
|
||||
group, both past and present.
|
||||
|
||||
getdelays.c is a simple utility demonstrating usage of the taskstats interface
|
||||
for reporting delay accounting statistics. Users can register cpumasks,
|
||||
send commands and process responses, listen for per-tid/tgid exit data,
|
||||
write the data received to a file and do basic flow control by increasing
|
||||
receive buffer sizes.
|
||||
|
||||
Interface
|
||||
---------
|
||||
|
||||
The user-kernel interface is encapsulated in include/linux/taskstats.h
|
||||
|
||||
To avoid this documentation becoming obsolete as the interface evolves, only
|
||||
an outline of the current version is given. taskstats.h always overrides the
|
||||
description here.
|
||||
|
||||
struct taskstats is the common accounting structure for both per-pid and
|
||||
per-tgid data. It is versioned and can be extended by each accounting subsystem
|
||||
that is added to the kernel. The fields and their semantics are defined in the
|
||||
taskstats.h file.
|
||||
|
||||
The data exchanged between user and kernel space is a netlink message belonging
|
||||
to the NETLINK_GENERIC family and using the netlink attributes interface.
|
||||
The messages are in the format
|
||||
|
||||
+----------+- - -+-------------+-------------------+
|
||||
| nlmsghdr | Pad | genlmsghdr | taskstats payload |
|
||||
+----------+- - -+-------------+-------------------+
|
||||
|
||||
|
||||
The taskstats payload is one of the following three kinds:
|
||||
|
||||
1. Commands: Sent from user to kernel. Commands to get data on
|
||||
a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID,
|
||||
containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes
|
||||
the task/process for which userspace wants statistics.
|
||||
|
||||
Commands to register/deregister interest in exit data from a set of cpus
|
||||
consist of one attribute, of type
|
||||
TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the
|
||||
attribute payload. The cpumask is specified as an ascii string of
|
||||
comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8
|
||||
the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest
|
||||
in cpus before closing the listening socket, the kernel cleans up its interest
|
||||
set over time. However, for the sake of efficiency, an explicit deregistration
|
||||
is advisable.
|
||||
|
||||
2. Response for a command: sent from the kernel in response to a userspace
|
||||
command. The payload is a series of three attributes of type:
|
||||
|
||||
a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
|
||||
a pid/tgid will be followed by some stats.
|
||||
|
||||
b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
|
||||
are being returned.
|
||||
|
||||
c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The
|
||||
same structure is used for both per-pid and per-tgid stats.
|
||||
|
||||
3. New message sent by kernel whenever a task exits. The payload consists of a
|
||||
series of attributes of the following type:
|
||||
|
||||
a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
|
||||
b) TASKSTATS_TYPE_PID: contains exiting task's pid
|
||||
c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
|
||||
d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
|
||||
e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
|
||||
f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
|
||||
|
||||
|
||||
per-tgid stats
|
||||
--------------
|
||||
|
||||
Taskstats provides per-process stats, in addition to per-task stats, since
|
||||
resource management is often done at a process granularity and aggregating task
|
||||
stats in userspace alone is inefficient and potentially inaccurate (due to lack
|
||||
of atomicity).
|
||||
|
||||
However, maintaining per-process, in addition to per-task stats, within the
|
||||
kernel has space and time overheads. To address this, the taskstats code
|
||||
accumulates each exiting task's statistics into a process-wide data structure.
|
||||
When the last task of a process exits, the process level data accumulated also
|
||||
gets sent to userspace (along with the per-task data).
|
||||
|
||||
When a user queries to get per-tgid data, the sum of all other live threads in
|
||||
the group is added up and added to the accumulated total for previously exited
|
||||
threads of the same thread group.
|
||||
|
||||
Extending taskstats
|
||||
-------------------
|
||||
|
||||
There are two ways to extend the taskstats interface to export more
|
||||
per-task/process stats as patches to collect them get added to the kernel
|
||||
in future:
|
||||
|
||||
1. Adding more fields to the end of the existing struct taskstats. Backward
|
||||
compatibility is ensured by the version number within the
|
||||
structure. Userspace will use only the fields of the struct that correspond
|
||||
to the version its using.
|
||||
|
||||
2. Defining separate statistic structs and using the netlink attributes
|
||||
interface to return them. Since userspace processes each netlink attribute
|
||||
independently, it can always ignore attributes whose type it does not
|
||||
understand (because it is using an older version of the interface).
|
||||
|
||||
|
||||
Choosing between 1. and 2. is a matter of trading off flexibility and
|
||||
overhead. If only a few fields need to be added, then 1. is the preferable
|
||||
path since the kernel and userspace don't need to incur the overhead of
|
||||
processing new netlink attributes. But if the new fields expand the existing
|
||||
struct too much, requiring disparate userspace accounting utilities to
|
||||
unnecessarily receive large structures whose fields are of no interest, then
|
||||
extending the attributes structure would be worthwhile.
|
||||
|
||||
Flow control for taskstats
|
||||
--------------------------
|
||||
|
||||
When the rate of task exits becomes large, a listener may not be able to keep
|
||||
up with the kernel's rate of sending per-tid/tgid exit data leading to data
|
||||
loss. This possibility gets compounded when the taskstats structure gets
|
||||
extended and the number of cpus grows large.
|
||||
|
||||
To avoid losing statistics, userspace should do one or more of the following:
|
||||
|
||||
- increase the receive buffer sizes for the netlink sockets opened by
|
||||
listeners to receive exit data.
|
||||
|
||||
- create more listeners and reduce the number of cpus being listened to by
|
||||
each listener. In the extreme case, there could be one listener for each cpu.
|
||||
Users may also consider setting the cpu affinity of the listener to the subset
|
||||
of cpus to which it listens, especially if they are listening to just one cpu.
|
||||
|
||||
Despite these measures, if the userspace receives ENOBUFS error messages
|
||||
indicated overflow of receive buffers, it should take measures to handle the
|
||||
loss of data.
|
||||
|
||||
----
|
||||
123
Documentation/aoe/aoe.txt
Normal file
123
Documentation/aoe/aoe.txt
Normal file
@ -0,0 +1,123 @@
|
||||
The EtherDrive (R) HOWTO for users of 2.6 kernels is found at ...
|
||||
|
||||
http://www.coraid.com/support/linux/EtherDrive-2.6-HOWTO.html
|
||||
|
||||
It has many tips and hints!
|
||||
|
||||
The aoetools are userland programs that are designed to work with this
|
||||
driver. The aoetools are on sourceforge.
|
||||
|
||||
http://aoetools.sourceforge.net/
|
||||
|
||||
The scripts in this Documentation/aoe directory are intended to
|
||||
document the use of the driver and are not necessary if you install
|
||||
the aoetools.
|
||||
|
||||
|
||||
CREATING DEVICE NODES
|
||||
|
||||
Users of udev should find the block device nodes created
|
||||
automatically, but to create all the necessary device nodes, use the
|
||||
udev configuration rules provided in udev.txt (in this directory).
|
||||
|
||||
There is a udev-install.sh script that shows how to install these
|
||||
rules on your system.
|
||||
|
||||
If you are not using udev, two scripts are provided in
|
||||
Documentation/aoe as examples of static device node creation for
|
||||
using the aoe driver.
|
||||
|
||||
rm -rf /dev/etherd
|
||||
sh Documentation/aoe/mkdevs.sh /dev/etherd
|
||||
|
||||
... or to make just one shelf's worth of block device nodes ...
|
||||
|
||||
sh Documentation/aoe/mkshelf.sh /dev/etherd 0
|
||||
|
||||
There is also an autoload script that shows how to edit
|
||||
/etc/modprobe.conf to ensure that the aoe module is loaded when
|
||||
necessary.
|
||||
|
||||
USING DEVICE NODES
|
||||
|
||||
"cat /dev/etherd/err" blocks, waiting for error diagnostic output,
|
||||
like any retransmitted packets.
|
||||
|
||||
"echo eth2 eth4 > /dev/etherd/interfaces" tells the aoe driver to
|
||||
limit ATA over Ethernet traffic to eth2 and eth4. AoE traffic from
|
||||
untrusted networks should be ignored as a matter of security. See
|
||||
also the aoe_iflist driver option described below.
|
||||
|
||||
"echo > /dev/etherd/discover" tells the driver to find out what AoE
|
||||
devices are available.
|
||||
|
||||
These character devices may disappear and be replaced by sysfs
|
||||
counterparts. Using the commands in aoetools insulates users from
|
||||
these implementation details.
|
||||
|
||||
The block devices are named like this:
|
||||
|
||||
e{shelf}.{slot}
|
||||
e{shelf}.{slot}p{part}
|
||||
|
||||
... so that "e0.2" is the third blade from the left (slot 2) in the
|
||||
first shelf (shelf address zero). That's the whole disk. The first
|
||||
partition on that disk would be "e0.2p1".
|
||||
|
||||
USING SYSFS
|
||||
|
||||
Each aoe block device in /sys/block has the extra attributes of
|
||||
state, mac, and netif. The state attribute is "up" when the device
|
||||
is ready for I/O and "down" if detected but unusable. The
|
||||
"down,closewait" state shows that the device is still open and
|
||||
cannot come up again until it has been closed.
|
||||
|
||||
The mac attribute is the ethernet address of the remote AoE device.
|
||||
The netif attribute is the network interface on the localhost
|
||||
through which we are communicating with the remote AoE device.
|
||||
|
||||
There is a script in this directory that formats this information
|
||||
in a convenient way. Users with aoetools can use the aoe-stat
|
||||
command.
|
||||
|
||||
root@makki root# sh Documentation/aoe/status.sh
|
||||
e10.0 eth3 up
|
||||
e10.1 eth3 up
|
||||
e10.2 eth3 up
|
||||
e10.3 eth3 up
|
||||
e10.4 eth3 up
|
||||
e10.5 eth3 up
|
||||
e10.6 eth3 up
|
||||
e10.7 eth3 up
|
||||
e10.8 eth3 up
|
||||
e10.9 eth3 up
|
||||
e4.0 eth1 up
|
||||
e4.1 eth1 up
|
||||
e4.2 eth1 up
|
||||
e4.3 eth1 up
|
||||
e4.4 eth1 up
|
||||
e4.5 eth1 up
|
||||
e4.6 eth1 up
|
||||
e4.7 eth1 up
|
||||
e4.8 eth1 up
|
||||
e4.9 eth1 up
|
||||
|
||||
Use /sys/module/aoe/parameters/aoe_iflist (or better, the driver
|
||||
option discussed below) instead of /dev/etherd/interfaces to limit
|
||||
AoE traffic to the network interfaces in the given
|
||||
whitespace-separated list. Unlike the old character device, the
|
||||
sysfs entry can be read from as well as written to.
|
||||
|
||||
It's helpful to trigger discovery after setting the list of allowed
|
||||
interfaces. The aoetools package provides an aoe-discover script
|
||||
for this purpose. You can also directly use the
|
||||
/dev/etherd/discover special file described above.
|
||||
|
||||
DRIVER OPTIONS
|
||||
|
||||
There is a boot option for the built-in aoe driver and a
|
||||
corresponding module parameter, aoe_iflist. Without this option,
|
||||
all network interfaces may be used for ATA over Ethernet. Here is a
|
||||
usage example for the module parameter.
|
||||
|
||||
modprobe aoe_iflist="eth1 eth3"
|
||||
17
Documentation/aoe/autoload.sh
Normal file
17
Documentation/aoe/autoload.sh
Normal file
@ -0,0 +1,17 @@
|
||||
#!/bin/sh
|
||||
# set aoe to autoload by installing the
|
||||
# aliases in /etc/modprobe.conf
|
||||
|
||||
f=/etc/modprobe.conf
|
||||
|
||||
if test ! -r $f || test ! -w $f; then
|
||||
echo "cannot configure $f for module autoloading" 1>&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
grep major-152 $f >/dev/null
|
||||
if [ $? = 1 ]; then
|
||||
echo alias block-major-152 aoe >> $f
|
||||
echo alias char-major-152 aoe >> $f
|
||||
fi
|
||||
|
||||
39
Documentation/aoe/mkdevs.sh
Normal file
39
Documentation/aoe/mkdevs.sh
Normal file
@ -0,0 +1,39 @@
|
||||
#!/bin/sh
|
||||
|
||||
n_shelves=${n_shelves:-10}
|
||||
n_partitions=${n_partitions:-16}
|
||||
|
||||
if test "$#" != "1"; then
|
||||
echo "Usage: sh `basename $0` {dir}" 1>&2
|
||||
echo " n_partitions=16 sh `basename $0` {dir}" 1>&2
|
||||
exit 1
|
||||
fi
|
||||
dir=$1
|
||||
|
||||
MAJOR=152
|
||||
|
||||
echo "Creating AoE devnode files in $dir ..."
|
||||
|
||||
set -e
|
||||
|
||||
mkdir -p $dir
|
||||
|
||||
# (Status info is in sysfs. See status.sh.)
|
||||
# rm -f $dir/stat
|
||||
# mknod -m 0400 $dir/stat c $MAJOR 1
|
||||
rm -f $dir/err
|
||||
mknod -m 0400 $dir/err c $MAJOR 2
|
||||
rm -f $dir/discover
|
||||
mknod -m 0200 $dir/discover c $MAJOR 3
|
||||
rm -f $dir/interfaces
|
||||
mknod -m 0200 $dir/interfaces c $MAJOR 4
|
||||
rm -f $dir/revalidate
|
||||
mknod -m 0200 $dir/revalidate c $MAJOR 5
|
||||
|
||||
export n_partitions
|
||||
mkshelf=`echo $0 | sed 's!mkdevs!mkshelf!'`
|
||||
i=0
|
||||
while test $i -lt $n_shelves; do
|
||||
sh -xc "sh $mkshelf $dir $i"
|
||||
i=`expr $i + 1`
|
||||
done
|
||||
28
Documentation/aoe/mkshelf.sh
Normal file
28
Documentation/aoe/mkshelf.sh
Normal file
@ -0,0 +1,28 @@
|
||||
#! /bin/sh
|
||||
|
||||
if test "$#" != "2"; then
|
||||
echo "Usage: sh `basename $0` {dir} {shelfaddress}" 1>&2
|
||||
echo " n_partitions=16 sh `basename $0` {dir} {shelfaddress}" 1>&2
|
||||
exit 1
|
||||
fi
|
||||
n_partitions=${n_partitions:-16}
|
||||
dir=$1
|
||||
shelf=$2
|
||||
nslots=16
|
||||
maxslot=`echo $nslots 1 - p | dc`
|
||||
MAJOR=152
|
||||
|
||||
set -e
|
||||
|
||||
minor=`echo $nslots \* $shelf \* $n_partitions | bc`
|
||||
endp=`echo $n_partitions - 1 | bc`
|
||||
for slot in `seq 0 $maxslot`; do
|
||||
for part in `seq 0 $endp`; do
|
||||
name=e$shelf.$slot
|
||||
test "$part" != "0" && name=${name}p$part
|
||||
rm -f $dir/$name
|
||||
mknod -m 0660 $dir/$name b $MAJOR $minor
|
||||
|
||||
minor=`expr $minor + 1`
|
||||
done
|
||||
done
|
||||
27
Documentation/aoe/status.sh
Normal file
27
Documentation/aoe/status.sh
Normal file
@ -0,0 +1,27 @@
|
||||
#! /bin/sh
|
||||
# collate and present sysfs information about AoE storage
|
||||
|
||||
set -e
|
||||
format="%8s\t%8s\t%8s\n"
|
||||
me=`basename $0`
|
||||
sysd=${sysfs_dir:-/sys}
|
||||
|
||||
# printf "$format" device mac netif state
|
||||
|
||||
# Suse 9.1 Pro doesn't put /sys in /etc/mtab
|
||||
#test -z "`mount | grep sysfs`" && {
|
||||
test ! -d "$sysd/block" && {
|
||||
echo "$me Error: sysfs is not mounted" 1>&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
for d in `ls -d $sysd/block/etherd* 2>/dev/null | grep -v p` end; do
|
||||
# maybe ls comes up empty, so we use "end"
|
||||
test $d = end && continue
|
||||
|
||||
dev=`echo "$d" | sed 's/.*!//'`
|
||||
printf "$format" \
|
||||
"$dev" \
|
||||
"`cat \"$d/netif\"`" \
|
||||
"`cat \"$d/state\"`"
|
||||
done | sort
|
||||
14
Documentation/aoe/todo.txt
Normal file
14
Documentation/aoe/todo.txt
Normal file
@ -0,0 +1,14 @@
|
||||
There is a potential for deadlock when allocating a struct sk_buff for
|
||||
data that needs to be written out to aoe storage. If the data is
|
||||
being written from a dirty page in order to free that page, and if
|
||||
there are no other pages available, then deadlock may occur when a
|
||||
free page is needed for the sk_buff allocation. This situation has
|
||||
not been observed, but it would be nice to eliminate any potential for
|
||||
deadlock under memory pressure.
|
||||
|
||||
Because ATA over Ethernet is not fragmented by the kernel's IP code,
|
||||
the destructor member of the struct sk_buff is available to the aoe
|
||||
driver. By using a mempool for allocating all but the first few
|
||||
sk_buffs, and by registering a destructor, we should be able to
|
||||
efficiently allocate sk_buffs without introducing any potential for
|
||||
deadlock.
|
||||
30
Documentation/aoe/udev-install.sh
Normal file
30
Documentation/aoe/udev-install.sh
Normal file
@ -0,0 +1,30 @@
|
||||
# install the aoe-specific udev rules from udev.txt into
|
||||
# the system's udev configuration
|
||||
#
|
||||
|
||||
me="`basename $0`"
|
||||
|
||||
# find udev.conf, often /etc/udev/udev.conf
|
||||
# (or environment can specify where to find udev.conf)
|
||||
#
|
||||
if test -z "$conf"; then
|
||||
if test -r /etc/udev/udev.conf; then
|
||||
conf=/etc/udev/udev.conf
|
||||
else
|
||||
conf="`find /etc -type f -name udev.conf 2> /dev/null`"
|
||||
if test -z "$conf" || test ! -r "$conf"; then
|
||||
echo "$me Error: no udev.conf found" 1>&2
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
# find the directory where udev rules are stored, often
|
||||
# /etc/udev/rules.d
|
||||
#
|
||||
rules_d="`sed -n '/^udev_rules=/{ s!udev_rules=!!; s!\"!!g; p; }' $conf`"
|
||||
if test -z "$rules_d" || test ! -d "$rules_d"; then
|
||||
echo "$me Error: cannot find udev rules directory" 1>&2
|
||||
exit 1
|
||||
fi
|
||||
sh -xc "cp `dirname $0`/udev.txt $rules_d/60-aoe.rules"
|
||||
24
Documentation/aoe/udev.txt
Normal file
24
Documentation/aoe/udev.txt
Normal file
@ -0,0 +1,24 @@
|
||||
# These rules tell udev what device nodes to create for aoe support.
|
||||
# They may be installed along the following lines (adjusted to what
|
||||
# you see on your system).
|
||||
#
|
||||
# ecashin@makki ~$ su
|
||||
# Password:
|
||||
# bash# find /etc -type f -name udev.conf
|
||||
# /etc/udev/udev.conf
|
||||
# bash# grep udev_rules= /etc/udev/udev.conf
|
||||
# udev_rules="/etc/udev/rules.d/"
|
||||
# bash# ls /etc/udev/rules.d/
|
||||
# 10-wacom.rules 50-udev.rules
|
||||
# bash# cp /path/to/linux-2.6.xx/Documentation/aoe/udev.txt \
|
||||
# /etc/udev/rules.d/60-aoe.rules
|
||||
#
|
||||
|
||||
# aoe char devices
|
||||
SUBSYSTEM="aoe", KERNEL="discover", NAME="etherd/%k", GROUP="disk", MODE="0220"
|
||||
SUBSYSTEM="aoe", KERNEL="err", NAME="etherd/%k", GROUP="disk", MODE="0440"
|
||||
SUBSYSTEM="aoe", KERNEL="interfaces", NAME="etherd/%k", GROUP="disk", MODE="0220"
|
||||
SUBSYSTEM="aoe", KERNEL="revalidate", NAME="etherd/%k", GROUP="disk", MODE="0220"
|
||||
|
||||
# aoe block devices
|
||||
KERNEL="etherd*", NAME="%k", GROUP="disk"
|
||||
454
Documentation/applying-patches.txt
Normal file
454
Documentation/applying-patches.txt
Normal file
@ -0,0 +1,454 @@
|
||||
|
||||
Applying Patches To The Linux Kernel
|
||||
------------------------------------
|
||||
|
||||
Original by: Jesper Juhl, August 2005
|
||||
Last update: 2006-01-05
|
||||
|
||||
|
||||
A frequently asked question on the Linux Kernel Mailing List is how to apply
|
||||
a patch to the kernel or, more specifically, what base kernel a patch for
|
||||
one of the many trees/branches should be applied to. Hopefully this document
|
||||
will explain this to you.
|
||||
|
||||
In addition to explaining how to apply and revert patches, a brief
|
||||
description of the different kernel trees (and examples of how to apply
|
||||
their specific patches) is also provided.
|
||||
|
||||
|
||||
What is a patch?
|
||||
---
|
||||
A patch is a small text document containing a delta of changes between two
|
||||
different versions of a source tree. Patches are created with the `diff'
|
||||
program.
|
||||
To correctly apply a patch you need to know what base it was generated from
|
||||
and what new version the patch will change the source tree into. These
|
||||
should both be present in the patch file metadata or be possible to deduce
|
||||
from the filename.
|
||||
|
||||
|
||||
How do I apply or revert a patch?
|
||||
---
|
||||
You apply a patch with the `patch' program. The patch program reads a diff
|
||||
(or patch) file and makes the changes to the source tree described in it.
|
||||
|
||||
Patches for the Linux kernel are generated relative to the parent directory
|
||||
holding the kernel source dir.
|
||||
|
||||
This means that paths to files inside the patch file contain the name of the
|
||||
kernel source directories it was generated against (or some other directory
|
||||
names like "a/" and "b/").
|
||||
Since this is unlikely to match the name of the kernel source dir on your
|
||||
local machine (but is often useful info to see what version an otherwise
|
||||
unlabeled patch was generated against) you should change into your kernel
|
||||
source directory and then strip the first element of the path from filenames
|
||||
in the patch file when applying it (the -p1 argument to `patch' does this).
|
||||
|
||||
To revert a previously applied patch, use the -R argument to patch.
|
||||
So, if you applied a patch like this:
|
||||
patch -p1 < ../patch-x.y.z
|
||||
|
||||
You can revert (undo) it like this:
|
||||
patch -R -p1 < ../patch-x.y.z
|
||||
|
||||
|
||||
How do I feed a patch/diff file to `patch'?
|
||||
---
|
||||
This (as usual with Linux and other UNIX like operating systems) can be
|
||||
done in several different ways.
|
||||
In all the examples below I feed the file (in uncompressed form) to patch
|
||||
via stdin using the following syntax:
|
||||
patch -p1 < path/to/patch-x.y.z
|
||||
|
||||
If you just want to be able to follow the examples below and don't want to
|
||||
know of more than one way to use patch, then you can stop reading this
|
||||
section here.
|
||||
|
||||
Patch can also get the name of the file to use via the -i argument, like
|
||||
this:
|
||||
patch -p1 -i path/to/patch-x.y.z
|
||||
|
||||
If your patch file is compressed with gzip or bzip2 and you don't want to
|
||||
uncompress it before applying it, then you can feed it to patch like this
|
||||
instead:
|
||||
zcat path/to/patch-x.y.z.gz | patch -p1
|
||||
bzcat path/to/patch-x.y.z.bz2 | patch -p1
|
||||
|
||||
If you wish to uncompress the patch file by hand first before applying it
|
||||
(what I assume you've done in the examples below), then you simply run
|
||||
gunzip or bunzip2 on the file -- like this:
|
||||
gunzip patch-x.y.z.gz
|
||||
bunzip2 patch-x.y.z.bz2
|
||||
|
||||
Which will leave you with a plain text patch-x.y.z file that you can feed to
|
||||
patch via stdin or the -i argument, as you prefer.
|
||||
|
||||
A few other nice arguments for patch are -s which causes patch to be silent
|
||||
except for errors which is nice to prevent errors from scrolling out of the
|
||||
screen too fast, and --dry-run which causes patch to just print a listing of
|
||||
what would happen, but doesn't actually make any changes. Finally --verbose
|
||||
tells patch to print more information about the work being done.
|
||||
|
||||
|
||||
Common errors when patching
|
||||
---
|
||||
When patch applies a patch file it attempts to verify the sanity of the
|
||||
file in different ways.
|
||||
Checking that the file looks like a valid patch file & checking the code
|
||||
around the bits being modified matches the context provided in the patch are
|
||||
just two of the basic sanity checks patch does.
|
||||
|
||||
If patch encounters something that doesn't look quite right it has two
|
||||
options. It can either refuse to apply the changes and abort or it can try
|
||||
to find a way to make the patch apply with a few minor changes.
|
||||
|
||||
One example of something that's not 'quite right' that patch will attempt to
|
||||
fix up is if all the context matches, the lines being changed match, but the
|
||||
line numbers are different. This can happen, for example, if the patch makes
|
||||
a change in the middle of the file but for some reasons a few lines have
|
||||
been added or removed near the beginning of the file. In that case
|
||||
everything looks good it has just moved up or down a bit, and patch will
|
||||
usually adjust the line numbers and apply the patch.
|
||||
|
||||
Whenever patch applies a patch that it had to modify a bit to make it fit
|
||||
it'll tell you about it by saying the patch applied with 'fuzz'.
|
||||
You should be wary of such changes since even though patch probably got it
|
||||
right it doesn't /always/ get it right, and the result will sometimes be
|
||||
wrong.
|
||||
|
||||
When patch encounters a change that it can't fix up with fuzz it rejects it
|
||||
outright and leaves a file with a .rej extension (a reject file). You can
|
||||
read this file to see exactly what change couldn't be applied, so you can
|
||||
go fix it up by hand if you wish.
|
||||
|
||||
If you don't have any third-party patches applied to your kernel source, but
|
||||
only patches from kernel.org and you apply the patches in the correct order,
|
||||
and have made no modifications yourself to the source files, then you should
|
||||
never see a fuzz or reject message from patch. If you do see such messages
|
||||
anyway, then there's a high risk that either your local source tree or the
|
||||
patch file is corrupted in some way. In that case you should probably try
|
||||
re-downloading the patch and if things are still not OK then you'd be advised
|
||||
to start with a fresh tree downloaded in full from kernel.org.
|
||||
|
||||
Let's look a bit more at some of the messages patch can produce.
|
||||
|
||||
If patch stops and presents a "File to patch:" prompt, then patch could not
|
||||
find a file to be patched. Most likely you forgot to specify -p1 or you are
|
||||
in the wrong directory. Less often, you'll find patches that need to be
|
||||
applied with -p0 instead of -p1 (reading the patch file should reveal if
|
||||
this is the case -- if so, then this is an error by the person who created
|
||||
the patch but is not fatal).
|
||||
|
||||
If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a
|
||||
message similar to that, then it means that patch had to adjust the location
|
||||
of the change (in this example it needed to move 7 lines from where it
|
||||
expected to make the change to make it fit).
|
||||
The resulting file may or may not be OK, depending on the reason the file
|
||||
was different than expected.
|
||||
This often happens if you try to apply a patch that was generated against a
|
||||
different kernel version than the one you are trying to patch.
|
||||
|
||||
If you get a message like "Hunk #3 FAILED at 2387.", then it means that the
|
||||
patch could not be applied correctly and the patch program was unable to
|
||||
fuzz its way through. This will generate a .rej file with the change that
|
||||
caused the patch to fail and also a .orig file showing you the original
|
||||
content that couldn't be changed.
|
||||
|
||||
If you get "Reversed (or previously applied) patch detected! Assume -R? [n]"
|
||||
then patch detected that the change contained in the patch seems to have
|
||||
already been made.
|
||||
If you actually did apply this patch previously and you just re-applied it
|
||||
in error, then just say [n]o and abort this patch. If you applied this patch
|
||||
previously and actually intended to revert it, but forgot to specify -R,
|
||||
then you can say [y]es here to make patch revert it for you.
|
||||
This can also happen if the creator of the patch reversed the source and
|
||||
destination directories when creating the patch, and in that case reverting
|
||||
the patch will in fact apply it.
|
||||
|
||||
A message similar to "patch: **** unexpected end of file in patch" or "patch
|
||||
unexpectedly ends in middle of line" means that patch could make no sense of
|
||||
the file you fed to it. Either your download is broken, you tried to feed
|
||||
patch a compressed patch file without uncompressing it first, or the patch
|
||||
file that you are using has been mangled by a mail client or mail transfer
|
||||
agent along the way somewhere, e.g., by splitting a long line into two lines.
|
||||
Often these warnings can easily be fixed by joining (concatenating) the
|
||||
two lines that had been split.
|
||||
|
||||
As I already mentioned above, these errors should never happen if you apply
|
||||
a patch from kernel.org to the correct version of an unmodified source tree.
|
||||
So if you get these errors with kernel.org patches then you should probably
|
||||
assume that either your patch file or your tree is broken and I'd advise you
|
||||
to start over with a fresh download of a full kernel tree and the patch you
|
||||
wish to apply.
|
||||
|
||||
|
||||
Are there any alternatives to `patch'?
|
||||
---
|
||||
Yes there are alternatives.
|
||||
|
||||
You can use the `interdiff' program (http://cyberelk.net/tim/patchutils/) to
|
||||
generate a patch representing the differences between two patches and then
|
||||
apply the result.
|
||||
This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single
|
||||
step. The -z flag to interdiff will even let you feed it patches in gzip or
|
||||
bzip2 compressed form directly without the use of zcat or bzcat or manual
|
||||
decompression.
|
||||
|
||||
Here's how you'd go from 2.6.12.2 to 2.6.12.3 in a single step:
|
||||
interdiff -z ../patch-2.6.12.2.bz2 ../patch-2.6.12.3.gz | patch -p1
|
||||
|
||||
Although interdiff may save you a step or two you are generally advised to
|
||||
do the additional steps since interdiff can get things wrong in some cases.
|
||||
|
||||
Another alternative is `ketchup', which is a python script for automatic
|
||||
downloading and applying of patches (http://www.selenic.com/ketchup/).
|
||||
|
||||
Other nice tools are diffstat, which shows a summary of changes made by a
|
||||
patch; lsdiff, which displays a short listing of affected files in a patch
|
||||
file, along with (optionally) the line numbers of the start of each patch;
|
||||
and grepdiff, which displays a list of the files modified by a patch where
|
||||
the patch contains a given regular expression.
|
||||
|
||||
|
||||
Where can I download the patches?
|
||||
---
|
||||
The patches are available at http://kernel.org/
|
||||
Most recent patches are linked from the front page, but they also have
|
||||
specific homes.
|
||||
|
||||
The 2.6.x.y (-stable) and 2.6.x patches live at
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/
|
||||
|
||||
The -rc patches live at
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/
|
||||
|
||||
The -git patches live at
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots/
|
||||
|
||||
The -mm kernels live at
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/
|
||||
|
||||
In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a
|
||||
country code. This way you'll be downloading from a mirror site that's most
|
||||
likely geographically closer to you, resulting in faster downloads for you,
|
||||
less bandwidth used globally and less load on the main kernel.org servers --
|
||||
these are good things, so do use mirrors when possible.
|
||||
|
||||
|
||||
The 2.6.x kernels
|
||||
---
|
||||
These are the base stable releases released by Linus. The highest numbered
|
||||
release is the most recent.
|
||||
|
||||
If regressions or other serious flaws are found, then a -stable fix patch
|
||||
will be released (see below) on top of this base. Once a new 2.6.x base
|
||||
kernel is released, a patch is made available that is a delta between the
|
||||
previous 2.6.x kernel and the new one.
|
||||
|
||||
To apply a patch moving from 2.6.11 to 2.6.12, you'd do the following (note
|
||||
that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the
|
||||
base 2.6.x kernel -- if you need to move from 2.6.x.y to 2.6.x+1 you need to
|
||||
first revert the 2.6.x.y patch).
|
||||
|
||||
Here are some examples:
|
||||
|
||||
# moving from 2.6.11 to 2.6.12
|
||||
$ cd ~/linux-2.6.11 # change to kernel source dir
|
||||
$ patch -p1 < ../patch-2.6.12 # apply the 2.6.12 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.11 linux-2.6.12 # rename source dir
|
||||
|
||||
# moving from 2.6.11.1 to 2.6.12
|
||||
$ cd ~/linux-2.6.11.1 # change to kernel source dir
|
||||
$ patch -p1 -R < ../patch-2.6.11.1 # revert the 2.6.11.1 patch
|
||||
# source dir is now 2.6.11
|
||||
$ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.11.1 linux-2.6.12 # rename source dir
|
||||
|
||||
|
||||
The 2.6.x.y kernels
|
||||
---
|
||||
Kernels with 4-digit versions are -stable kernels. They contain small(ish)
|
||||
critical fixes for security problems or significant regressions discovered
|
||||
in a given 2.6.x kernel.
|
||||
|
||||
This is the recommended branch for users who want the most recent stable
|
||||
kernel and are not interested in helping test development/experimental
|
||||
versions.
|
||||
|
||||
If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is
|
||||
the current stable kernel.
|
||||
|
||||
note: the -stable team usually do make incremental patches available as well
|
||||
as patches against the latest mainline release, but I only cover the
|
||||
non-incremental ones below. The incremental ones can be found at
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/incr/
|
||||
|
||||
These patches are not incremental, meaning that for example the 2.6.12.3
|
||||
patch does not apply on top of the 2.6.12.2 kernel source, but rather on top
|
||||
of the base 2.6.12 kernel source .
|
||||
So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel
|
||||
source you have to first back out the 2.6.12.2 patch (so you are left with a
|
||||
base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch.
|
||||
|
||||
Here's a small example:
|
||||
|
||||
$ cd ~/linux-2.6.12.2 # change into the kernel source dir
|
||||
$ patch -p1 -R < ../patch-2.6.12.2 # revert the 2.6.12.2 patch
|
||||
$ patch -p1 < ../patch-2.6.12.3 # apply the new 2.6.12.3 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12.2 linux-2.6.12.3 # rename the kernel source dir
|
||||
|
||||
|
||||
The -rc kernels
|
||||
---
|
||||
These are release-candidate kernels. These are development kernels released
|
||||
by Linus whenever he deems the current git (the kernel's source management
|
||||
tool) tree to be in a reasonably sane state adequate for testing.
|
||||
|
||||
These kernels are not stable and you should expect occasional breakage if
|
||||
you intend to run them. This is however the most stable of the main
|
||||
development branches and is also what will eventually turn into the next
|
||||
stable kernel, so it is important that it be tested by as many people as
|
||||
possible.
|
||||
|
||||
This is a good branch to run for people who want to help out testing
|
||||
development kernels but do not want to run some of the really experimental
|
||||
stuff (such people should see the sections about -git and -mm kernels below).
|
||||
|
||||
The -rc patches are not incremental, they apply to a base 2.6.x kernel, just
|
||||
like the 2.6.x.y patches described above. The kernel version before the -rcN
|
||||
suffix denotes the version of the kernel that this -rc kernel will eventually
|
||||
turn into.
|
||||
So, 2.6.13-rc5 means that this is the fifth release candidate for the 2.6.13
|
||||
kernel and the patch should be applied on top of the 2.6.12 kernel source.
|
||||
|
||||
Here are 3 examples of how to apply these patches:
|
||||
|
||||
# first an example of moving from 2.6.12 to 2.6.13-rc3
|
||||
$ cd ~/linux-2.6.12 # change into the 2.6.12 source dir
|
||||
$ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12 linux-2.6.13-rc3 # rename the source dir
|
||||
|
||||
# now let's move from 2.6.13-rc3 to 2.6.13-rc5
|
||||
$ cd ~/linux-2.6.13-rc3 # change into the 2.6.13-rc3 dir
|
||||
$ patch -p1 -R < ../patch-2.6.13-rc3 # revert the 2.6.13-rc3 patch
|
||||
$ patch -p1 < ../patch-2.6.13-rc5 # apply the new 2.6.13-rc5 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.13-rc3 linux-2.6.13-rc5 # rename the source dir
|
||||
|
||||
# finally let's try and move from 2.6.12.3 to 2.6.13-rc5
|
||||
$ cd ~/linux-2.6.12.3 # change to the kernel source dir
|
||||
$ patch -p1 -R < ../patch-2.6.12.3 # revert the 2.6.12.3 patch
|
||||
$ patch -p1 < ../patch-2.6.13-rc5 # apply new 2.6.13-rc5 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12.3 linux-2.6.13-rc5 # rename the kernel source dir
|
||||
|
||||
|
||||
The -git kernels
|
||||
---
|
||||
These are daily snapshots of Linus' kernel tree (managed in a git
|
||||
repository, hence the name).
|
||||
|
||||
These patches are usually released daily and represent the current state of
|
||||
Linus's tree. They are more experimental than -rc kernels since they are
|
||||
generated automatically without even a cursory glance to see if they are
|
||||
sane.
|
||||
|
||||
-git patches are not incremental and apply either to a base 2.6.x kernel or
|
||||
a base 2.6.x-rc kernel -- you can see which from their name.
|
||||
A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch
|
||||
named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel.
|
||||
|
||||
Here are some examples of how to apply these patches:
|
||||
|
||||
# moving from 2.6.12 to 2.6.12-git1
|
||||
$ cd ~/linux-2.6.12 # change to the kernel source dir
|
||||
$ patch -p1 < ../patch-2.6.12-git1 # apply the 2.6.12-git1 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12 linux-2.6.12-git1 # rename the kernel source dir
|
||||
|
||||
# moving from 2.6.12-git1 to 2.6.13-rc2-git3
|
||||
$ cd ~/linux-2.6.12-git1 # change to the kernel source dir
|
||||
$ patch -p1 -R < ../patch-2.6.12-git1 # revert the 2.6.12-git1 patch
|
||||
# we now have a 2.6.12 kernel
|
||||
$ patch -p1 < ../patch-2.6.13-rc2 # apply the 2.6.13-rc2 patch
|
||||
# the kernel is now 2.6.13-rc2
|
||||
$ patch -p1 < ../patch-2.6.13-rc2-git3 # apply the 2.6.13-rc2-git3 patch
|
||||
# the kernel is now 2.6.13-rc2-git3
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12-git1 linux-2.6.13-rc2-git3 # rename source dir
|
||||
|
||||
|
||||
The -mm kernels
|
||||
---
|
||||
These are experimental kernels released by Andrew Morton.
|
||||
|
||||
The -mm tree serves as a sort of proving ground for new features and other
|
||||
experimental patches.
|
||||
Once a patch has proved its worth in -mm for a while Andrew pushes it on to
|
||||
Linus for inclusion in mainline.
|
||||
|
||||
Although it's encouraged that patches flow to Linus via the -mm tree, this
|
||||
is not always enforced.
|
||||
Subsystem maintainers (or individuals) sometimes push their patches directly
|
||||
to Linus, even though (or after) they have been merged and tested in -mm (or
|
||||
sometimes even without prior testing in -mm).
|
||||
|
||||
You should generally strive to get your patches into mainline via -mm to
|
||||
ensure maximum testing.
|
||||
|
||||
This branch is in constant flux and contains many experimental features, a
|
||||
lot of debugging patches not appropriate for mainline etc., and is the most
|
||||
experimental of the branches described in this document.
|
||||
|
||||
These kernels are not appropriate for use on systems that are supposed to be
|
||||
stable and they are more risky to run than any of the other branches (make
|
||||
sure you have up-to-date backups -- that goes for any experimental kernel but
|
||||
even more so for -mm kernels).
|
||||
|
||||
These kernels in addition to all the other experimental patches they contain
|
||||
usually also contain any changes in the mainline -git kernels available at
|
||||
the time of release.
|
||||
|
||||
Testing of -mm kernels is greatly appreciated since the whole point of the
|
||||
tree is to weed out regressions, crashes, data corruption bugs, build
|
||||
breakage (and any other bug in general) before changes are merged into the
|
||||
more stable mainline Linus tree.
|
||||
But testers of -mm should be aware that breakage in this tree is more common
|
||||
than in any other tree.
|
||||
|
||||
The -mm kernels are not released on a fixed schedule, but usually a few -mm
|
||||
kernels are released in between each -rc kernel (1 to 3 is common).
|
||||
The -mm kernels apply to either a base 2.6.x kernel (when no -rc kernels
|
||||
have been released yet) or to a Linus -rc kernel.
|
||||
|
||||
Here are some examples of applying the -mm patches:
|
||||
|
||||
# moving from 2.6.12 to 2.6.12-mm1
|
||||
$ cd ~/linux-2.6.12 # change to the 2.6.12 source dir
|
||||
$ patch -p1 < ../2.6.12-mm1 # apply the 2.6.12-mm1 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12 linux-2.6.12-mm1 # rename the source appropriately
|
||||
|
||||
# moving from 2.6.12-mm1 to 2.6.13-rc3-mm3
|
||||
$ cd ~/linux-2.6.12-mm1
|
||||
$ patch -p1 -R < ../2.6.12-mm1 # revert the 2.6.12-mm1 patch
|
||||
# we now have a 2.6.12 source
|
||||
$ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch
|
||||
# we now have a 2.6.13-rc3 source
|
||||
$ patch -p1 < ../2.6.13-rc3-mm3 # apply the 2.6.13-rc3-mm3 patch
|
||||
$ cd ..
|
||||
$ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir
|
||||
|
||||
|
||||
This concludes this list of explanations of the various kernel trees.
|
||||
I hope you are now clear on how to apply the various patches and help testing
|
||||
the kernel.
|
||||
|
||||
Thank you's to Randy Dunlap, Rolf Eike Beer, Linus Torvalds, Bodo Eggert,
|
||||
Johannes Stezenbach, Grant Coady, Pavel Machek and others that I may have
|
||||
forgotten for their reviews and contributions to this document.
|
||||
|
||||
22
Documentation/arm/00-INDEX
Normal file
22
Documentation/arm/00-INDEX
Normal file
@ -0,0 +1,22 @@
|
||||
00-INDEX
|
||||
- this file
|
||||
Booting
|
||||
- requirements for booting
|
||||
Interrupts
|
||||
- ARM Interrupt subsystem documentation
|
||||
Netwinder
|
||||
- Netwinder specific documentation
|
||||
README
|
||||
- General ARM documentation
|
||||
SA1100
|
||||
- SA1100 documentation
|
||||
XScale
|
||||
- XScale documentation
|
||||
empeg
|
||||
- Empeg documentation
|
||||
mem_alignment
|
||||
- alignment abort handler documentation
|
||||
memory.txt
|
||||
- description of the virtual memory layout
|
||||
nwfpe
|
||||
- NWFPE floating point emulator documentation
|
||||
141
Documentation/arm/Booting
Normal file
141
Documentation/arm/Booting
Normal file
@ -0,0 +1,141 @@
|
||||
Booting ARM Linux
|
||||
=================
|
||||
|
||||
Author: Russell King
|
||||
Date : 18 May 2002
|
||||
|
||||
The following documentation is relevant to 2.4.18-rmk6 and beyond.
|
||||
|
||||
In order to boot ARM Linux, you require a boot loader, which is a small
|
||||
program that runs before the main kernel. The boot loader is expected
|
||||
to initialise various devices, and eventually call the Linux kernel,
|
||||
passing information to the kernel.
|
||||
|
||||
Essentially, the boot loader should provide (as a minimum) the
|
||||
following:
|
||||
|
||||
1. Setup and initialise the RAM.
|
||||
2. Initialise one serial port.
|
||||
3. Detect the machine type.
|
||||
4. Setup the kernel tagged list.
|
||||
5. Call the kernel image.
|
||||
|
||||
|
||||
1. Setup and initialise RAM
|
||||
---------------------------
|
||||
|
||||
Existing boot loaders: MANDATORY
|
||||
New boot loaders: MANDATORY
|
||||
|
||||
The boot loader is expected to find and initialise all RAM that the
|
||||
kernel will use for volatile data storage in the system. It performs
|
||||
this in a machine dependent manner. (It may use internal algorithms
|
||||
to automatically locate and size all RAM, or it may use knowledge of
|
||||
the RAM in the machine, or any other method the boot loader designer
|
||||
sees fit.)
|
||||
|
||||
|
||||
2. Initialise one serial port
|
||||
-----------------------------
|
||||
|
||||
Existing boot loaders: OPTIONAL, RECOMMENDED
|
||||
New boot loaders: OPTIONAL, RECOMMENDED
|
||||
|
||||
The boot loader should initialise and enable one serial port on the
|
||||
target. This allows the kernel serial driver to automatically detect
|
||||
which serial port it should use for the kernel console (generally
|
||||
used for debugging purposes, or communication with the target.)
|
||||
|
||||
As an alternative, the boot loader can pass the relevant 'console='
|
||||
option to the kernel via the tagged lists specifying the port, and
|
||||
serial format options as described in
|
||||
|
||||
Documentation/kernel-parameters.txt.
|
||||
|
||||
|
||||
3. Detect the machine type
|
||||
--------------------------
|
||||
|
||||
Existing boot loaders: OPTIONAL
|
||||
New boot loaders: MANDATORY
|
||||
|
||||
The boot loader should detect the machine type its running on by some
|
||||
method. Whether this is a hard coded value or some algorithm that
|
||||
looks at the connected hardware is beyond the scope of this document.
|
||||
The boot loader must ultimately be able to provide a MACH_TYPE_xxx
|
||||
value to the kernel. (see linux/arch/arm/tools/mach-types).
|
||||
|
||||
|
||||
4. Setup the kernel tagged list
|
||||
-------------------------------
|
||||
|
||||
Existing boot loaders: OPTIONAL, HIGHLY RECOMMENDED
|
||||
New boot loaders: MANDATORY
|
||||
|
||||
The boot loader must create and initialise the kernel tagged list.
|
||||
A valid tagged list starts with ATAG_CORE and ends with ATAG_NONE.
|
||||
The ATAG_CORE tag may or may not be empty. An empty ATAG_CORE tag
|
||||
has the size field set to '2' (0x00000002). The ATAG_NONE must set
|
||||
the size field to zero.
|
||||
|
||||
Any number of tags can be placed in the list. It is undefined
|
||||
whether a repeated tag appends to the information carried by the
|
||||
previous tag, or whether it replaces the information in its
|
||||
entirety; some tags behave as the former, others the latter.
|
||||
|
||||
The boot loader must pass at a minimum the size and location of
|
||||
the system memory, and root filesystem location. Therefore, the
|
||||
minimum tagged list should look:
|
||||
|
||||
+-----------+
|
||||
base -> | ATAG_CORE | |
|
||||
+-----------+ |
|
||||
| ATAG_MEM | | increasing address
|
||||
+-----------+ |
|
||||
| ATAG_NONE | |
|
||||
+-----------+ v
|
||||
|
||||
The tagged list should be stored in system RAM.
|
||||
|
||||
The tagged list must be placed in a region of memory where neither
|
||||
the kernel decompressor nor initrd 'bootp' program will overwrite
|
||||
it. The recommended placement is in the first 16KiB of RAM.
|
||||
|
||||
5. Calling the kernel image
|
||||
---------------------------
|
||||
|
||||
Existing boot loaders: MANDATORY
|
||||
New boot loaders: MANDATORY
|
||||
|
||||
There are two options for calling the kernel zImage. If the zImage
|
||||
is stored in flash, and is linked correctly to be run from flash,
|
||||
then it is legal for the boot loader to call the zImage in flash
|
||||
directly.
|
||||
|
||||
The zImage may also be placed in system RAM (at any location) and
|
||||
called there. Note that the kernel uses 16K of RAM below the image
|
||||
to store page tables. The recommended placement is 32KiB into RAM.
|
||||
|
||||
In either case, the following conditions must be met:
|
||||
|
||||
- Quiesce all DMA capable devices so that memory does not get
|
||||
corrupted by bogus network packets or disk data. This will save
|
||||
you many hours of debug.
|
||||
|
||||
- CPU register settings
|
||||
r0 = 0,
|
||||
r1 = machine type number discovered in (3) above.
|
||||
r2 = physical address of tagged list in system RAM.
|
||||
|
||||
- CPU mode
|
||||
All forms of interrupts must be disabled (IRQs and FIQs)
|
||||
The CPU must be in SVC mode. (A special exception exists for Angel)
|
||||
|
||||
- Caches, MMUs
|
||||
The MMU must be off.
|
||||
Instruction cache may be on or off.
|
||||
Data cache must be off.
|
||||
|
||||
- The boot loader is expected to call the kernel image by jumping
|
||||
directly to the first instruction of the kernel image.
|
||||
|
||||
69
Documentation/arm/IXP2000
Normal file
69
Documentation/arm/IXP2000
Normal file
@ -0,0 +1,69 @@
|
||||
|
||||
-------------------------------------------------------------------------
|
||||
Release Notes for Linux on Intel's IXP2000 Network Processor
|
||||
|
||||
Maintained by Deepak Saxena <dsaxena@plexity.net>
|
||||
-------------------------------------------------------------------------
|
||||
|
||||
1. Overview
|
||||
|
||||
Intel's IXP2000 family of NPUs (IXP2400, IXP2800, IXP2850) is designed
|
||||
for high-performance network applications such high-availability
|
||||
telecom systems. In addition to an XScale core, it contains up to 8
|
||||
"MicroEngines" that run special code, several high-end networking
|
||||
interfaces (UTOPIA, SPI, etc), a PCI host bridge, one serial port,
|
||||
flash interface, and some other odds and ends. For more information, see:
|
||||
|
||||
http://developer.intel.com/design/network/products/npfamily/ixp2xxx.htm
|
||||
|
||||
2. Linux Support
|
||||
|
||||
Linux currently supports the following features on the IXP2000 NPUs:
|
||||
|
||||
- On-chip serial
|
||||
- PCI
|
||||
- Flash (MTD/JFFS2)
|
||||
- I2C through GPIO
|
||||
- Timers (watchdog, OS)
|
||||
|
||||
That is about all we can support under Linux ATM b/c the core networking
|
||||
components of the chip are accessed via Intel's closed source SDK.
|
||||
Please contact Intel directly on issues with using those. There is
|
||||
also a mailing list run by some folks at Princeton University that might
|
||||
be of help: https://lists.cs.princeton.edu/mailman/listinfo/ixp2xxx
|
||||
|
||||
WHATEVER YOU DO, DO NOT POST EMAIL TO THE LINUX-ARM OR LINUX-ARM-KERNEL
|
||||
MAILING LISTS REGARDING THE INTEL SDK.
|
||||
|
||||
3. Supported Platforms
|
||||
|
||||
- Intel IXDP2400 Reference Platform
|
||||
- Intel IXDP2800 Reference Platform
|
||||
- Intel IXDP2401 Reference Platform
|
||||
- Intel IXDP2801 Reference Platform
|
||||
- RadiSys ENP-2611
|
||||
|
||||
4. Usage Notes
|
||||
|
||||
- The IXP2000 platforms usually have rather complex PCI bus topologies
|
||||
with large memory space requirements. In addition, b/c of the way the
|
||||
Intel SDK is designed, devices are enumerated in a very specific
|
||||
way. B/c of this this, we use "pci=firmware" option in the kernel
|
||||
command line so that we do not re-enumerate the bus.
|
||||
|
||||
- IXDP2x01 systems have variable clock tick rates that we cannot determine
|
||||
via HW registers. The "ixdp2x01_clk=XXX" cmd line options allow you
|
||||
to pass the clock rate to the board port.
|
||||
|
||||
5. Thanks
|
||||
|
||||
The IXP2000 work has been funded by Intel Corp. and MontaVista Software, Inc.
|
||||
|
||||
The following people have contributed patches/comments/etc:
|
||||
|
||||
Naeem F. Afzal
|
||||
Lennert Buytenhek
|
||||
Jeffrey Daly
|
||||
|
||||
-------------------------------------------------------------------------
|
||||
Last Update: 8/09/2004
|
||||
174
Documentation/arm/IXP4xx
Normal file
174
Documentation/arm/IXP4xx
Normal file
@ -0,0 +1,174 @@
|
||||
|
||||
-------------------------------------------------------------------------
|
||||
Release Notes for Linux on Intel's IXP4xx Network Processor
|
||||
|
||||
Maintained by Deepak Saxena <dsaxena@plexity.net>
|
||||
-------------------------------------------------------------------------
|
||||
|
||||
1. Overview
|
||||
|
||||
Intel's IXP4xx network processor is a highly integrated SOC that
|
||||
is targeted for network applications, though it has become popular
|
||||
in industrial control and other areas due to low cost and power
|
||||
consumption. The IXP4xx family currently consists of several processors
|
||||
that support different network offload functions such as encryption,
|
||||
routing, firewalling, etc. The IXP46x family is an updated version which
|
||||
supports faster speeds, new memory and flash configurations, and more
|
||||
integration such as an on-chip I2C controller.
|
||||
|
||||
For more information on the various versions of the CPU, see:
|
||||
|
||||
http://developer.intel.com/design/network/products/npfamily/ixp4xx.htm
|
||||
|
||||
Intel also made the IXCP1100 CPU for sometime which is an IXP4xx
|
||||
stripped of much of the network intelligence.
|
||||
|
||||
2. Linux Support
|
||||
|
||||
Linux currently supports the following features on the IXP4xx chips:
|
||||
|
||||
- Dual serial ports
|
||||
- PCI interface
|
||||
- Flash access (MTD/JFFS)
|
||||
- I2C through GPIO on IXP42x
|
||||
- GPIO for input/output/interrupts
|
||||
See include/asm-arm/arch-ixp4xx/platform.h for access functions.
|
||||
- Timers (watchdog, OS)
|
||||
|
||||
The following components of the chips are not supported by Linux and
|
||||
require the use of Intel's propietary CSR softare:
|
||||
|
||||
- USB device interface
|
||||
- Network interfaces (HSS, Utopia, NPEs, etc)
|
||||
- Network offload functionality
|
||||
|
||||
If you need to use any of the above, you need to download Intel's
|
||||
software from:
|
||||
|
||||
http://developer.intel.com/design/network/products/npfamily/ixp425swr1.htm
|
||||
|
||||
DO NOT POST QUESTIONS TO THE LINUX MAILING LISTS REGARDING THE PROPIETARY
|
||||
SOFTWARE.
|
||||
|
||||
There are several websites that provide directions/pointers on using
|
||||
Intel's software:
|
||||
|
||||
http://ixp4xx-osdg.sourceforge.net/
|
||||
Open Source Developer's Guide for using uClinux and the Intel libraries
|
||||
|
||||
http://gatewaymaker.sourceforge.net/
|
||||
Simple one page summary of building a gateway using an IXP425 and Linux
|
||||
|
||||
http://ixp425.sourceforge.net/
|
||||
ATM device driver for IXP425 that relies on Intel's libraries
|
||||
|
||||
3. Known Issues/Limitations
|
||||
|
||||
3a. Limited inbound PCI window
|
||||
|
||||
The IXP4xx family allows for up to 256MB of memory but the PCI interface
|
||||
can only expose 64MB of that memory to the PCI bus. This means that if
|
||||
you are running with > 64MB, all PCI buffers outside of the accessible
|
||||
range will be bounced using the routines in arch/arm/common/dmabounce.c.
|
||||
|
||||
3b. Limited outbound PCI window
|
||||
|
||||
IXP4xx provides two methods of accessing PCI memory space:
|
||||
|
||||
1) A direct mapped window from 0x48000000 to 0x4bffffff (64MB).
|
||||
To access PCI via this space, we simply ioremap() the BAR
|
||||
into the kernel and we can use the standard read[bwl]/write[bwl]
|
||||
macros. This is the preffered method due to speed but it
|
||||
limits the system to just 64MB of PCI memory. This can be
|
||||
problamatic if using video cards and other memory-heavy devices.
|
||||
|
||||
2) If > 64MB of memory space is required, the IXP4xx can be
|
||||
configured to use indirect registers to access PCI This allows
|
||||
for up to 128MB (0x48000000 to 0x4fffffff) of memory on the bus.
|
||||
The disadvantage of this is that every PCI access requires
|
||||
three local register accesses plus a spinlock, but in some
|
||||
cases the performance hit is acceptable. In addition, you cannot
|
||||
mmap() PCI devices in this case due to the indirect nature
|
||||
of the PCI window.
|
||||
|
||||
By default, the direct method is used for performance reasons. If
|
||||
you need more PCI memory, enable the IXP4XX_INDIRECT_PCI config option.
|
||||
|
||||
3c. GPIO as Interrupts
|
||||
|
||||
Currently the code only handles level-sensitive GPIO interrupts
|
||||
|
||||
4. Supported platforms
|
||||
|
||||
ADI Engineering Coyote Gateway Reference Platform
|
||||
http://www.adiengineering.com/productsCoyote.html
|
||||
|
||||
The ADI Coyote platform is reference design for those building
|
||||
small residential/office gateways. One NPE is connected to a 10/100
|
||||
interface, one to 4-port 10/100 switch, and the third to and ADSL
|
||||
interface. In addition, it also supports to POTs interfaces connected
|
||||
via SLICs. Note that those are not supported by Linux ATM. Finally,
|
||||
the platform has two mini-PCI slots used for 802.11[bga] cards.
|
||||
Finally, there is an IDE port hanging off the expansion bus.
|
||||
|
||||
Gateworks Avila Network Platform
|
||||
http://www.gateworks.com/avila_sbc.htm
|
||||
|
||||
The Avila platform is basically and IXDP425 with the 4 PCI slots
|
||||
replaced with mini-PCI slots and a CF IDE interface hanging off
|
||||
the expansion bus.
|
||||
|
||||
Intel IXDP425 Development Platform
|
||||
http://developer.intel.com/design/network/products/npfamily/ixdp425.htm
|
||||
|
||||
This is Intel's standard reference platform for the IXDP425 and is
|
||||
also known as the Richfield board. It contains 4 PCI slots, 16MB
|
||||
of flash, two 10/100 ports and one ADSL port.
|
||||
|
||||
Intel IXDP465 Development Platform
|
||||
http://developer.intel.com/design/network/products/npfamily/ixdp465.htm
|
||||
|
||||
This is basically an IXDP425 with an IXP465 and 32M of flash instead
|
||||
of just 16.
|
||||
|
||||
Intel IXDPG425 Development Platform
|
||||
|
||||
This is basically and ADI Coyote board with a NEC EHCI controller
|
||||
added. One issue with this board is that the mini-PCI slots only
|
||||
have the 3.3v line connected, so you can't use a PCI to mini-PCI
|
||||
adapter with an E100 card. So to NFS root you need to use either
|
||||
the CSR or a WiFi card and a ramdisk that BOOTPs and then does
|
||||
a pivot_root to NFS.
|
||||
|
||||
Motorola PrPMC1100 Processor Mezanine Card
|
||||
http://www.fountainsys.com/datasheet/PrPMC1100.pdf
|
||||
|
||||
The PrPMC1100 is based on the IXCP1100 and is meant to plug into
|
||||
and IXP2400/2800 system to act as the system controller. It simply
|
||||
contains a CPU and 16MB of flash on the board and needs to be
|
||||
plugged into a carrier board to function. Currently Linux only
|
||||
supports the Motorola PrPMC carrier board for this platform.
|
||||
See https://mcg.motorola.com/us/ds/pdf/ds0144.pdf for info
|
||||
on the carrier board.
|
||||
|
||||
5. TODO LIST
|
||||
|
||||
- Add support for Coyote IDE
|
||||
- Add support for edge-based GPIO interrupts
|
||||
- Add support for CF IDE on expansion bus
|
||||
|
||||
6. Thanks
|
||||
|
||||
The IXP4xx work has been funded by Intel Corp. and MontaVista Software, Inc.
|
||||
|
||||
The following people have contributed patches/comments/etc:
|
||||
|
||||
Lennerty Buytenhek
|
||||
Lutz Jaenicke
|
||||
Justin Mayfield
|
||||
Robert E. Ranslam
|
||||
[I know I've forgotten others, please email me to be added]
|
||||
|
||||
-------------------------------------------------------------------------
|
||||
|
||||
Last Update: 01/04/2005
|
||||
173
Documentation/arm/Interrupts
Normal file
173
Documentation/arm/Interrupts
Normal file
@ -0,0 +1,173 @@
|
||||
2.5.2-rmk5
|
||||
----------
|
||||
|
||||
This is the first kernel that contains a major shake up of some of the
|
||||
major architecture-specific subsystems.
|
||||
|
||||
Firstly, it contains some pretty major changes to the way we handle the
|
||||
MMU TLB. Each MMU TLB variant is now handled completely separately -
|
||||
we have TLB v3, TLB v4 (without write buffer), TLB v4 (with write buffer),
|
||||
and finally TLB v4 (with write buffer, with I TLB invalidate entry).
|
||||
There is more assembly code inside each of these functions, mainly to
|
||||
allow more flexible TLB handling for the future.
|
||||
|
||||
Secondly, the IRQ subsystem.
|
||||
|
||||
The 2.5 kernels will be having major changes to the way IRQs are handled.
|
||||
Unfortunately, this means that machine types that touch the irq_desc[]
|
||||
array (basically all machine types) will break, and this means every
|
||||
machine type that we currently have.
|
||||
|
||||
Lets take an example. On the Assabet with Neponset, we have:
|
||||
|
||||
GPIO25 IRR:2
|
||||
SA1100 ------------> Neponset -----------> SA1111
|
||||
IIR:1
|
||||
-----------> USAR
|
||||
IIR:0
|
||||
-----------> SMC9196
|
||||
|
||||
The way stuff currently works, all SA1111 interrupts are mutually
|
||||
exclusive of each other - if you're processing one interrupt from the
|
||||
SA1111 and another comes in, you have to wait for that interrupt to
|
||||
finish processing before you can service the new interrupt. Eg, an
|
||||
IDE PIO-based interrupt on the SA1111 excludes all other SA1111 and
|
||||
SMC9196 interrupts until it has finished transferring its multi-sector
|
||||
data, which can be a long time. Note also that since we loop in the
|
||||
SA1111 IRQ handler, SA1111 IRQs can hold off SMC9196 IRQs indefinitely.
|
||||
|
||||
|
||||
The new approach brings several new ideas...
|
||||
|
||||
We introduce the concept of a "parent" and a "child". For example,
|
||||
to the Neponset handler, the "parent" is GPIO25, and the "children"d
|
||||
are SA1111, SMC9196 and USAR.
|
||||
|
||||
We also bring the idea of an IRQ "chip" (mainly to reduce the size of
|
||||
the irqdesc array). This doesn't have to be a real "IC"; indeed the
|
||||
SA11x0 IRQs are handled by two separate "chip" structures, one for
|
||||
GPIO0-10, and another for all the rest. It is just a container for
|
||||
the various operations (maybe this'll change to a better name).
|
||||
This structure has the following operations:
|
||||
|
||||
struct irqchip {
|
||||
/*
|
||||
* Acknowledge the IRQ.
|
||||
* If this is a level-based IRQ, then it is expected to mask the IRQ
|
||||
* as well.
|
||||
*/
|
||||
void (*ack)(unsigned int irq);
|
||||
/*
|
||||
* Mask the IRQ in hardware.
|
||||
*/
|
||||
void (*mask)(unsigned int irq);
|
||||
/*
|
||||
* Unmask the IRQ in hardware.
|
||||
*/
|
||||
void (*unmask)(unsigned int irq);
|
||||
/*
|
||||
* Re-run the IRQ
|
||||
*/
|
||||
void (*rerun)(unsigned int irq);
|
||||
/*
|
||||
* Set the type of the IRQ.
|
||||
*/
|
||||
int (*type)(unsigned int irq, unsigned int, type);
|
||||
};
|
||||
|
||||
ack - required. May be the same function as mask for IRQs
|
||||
handled by do_level_IRQ.
|
||||
mask - required.
|
||||
unmask - required.
|
||||
rerun - optional. Not required if you're using do_level_IRQ for all
|
||||
IRQs that use this 'irqchip'. Generally expected to re-trigger
|
||||
the hardware IRQ if possible. If not, may call the handler
|
||||
directly.
|
||||
type - optional. If you don't support changing the type of an IRQ,
|
||||
it should be null so people can detect if they are unable to
|
||||
set the IRQ type.
|
||||
|
||||
For each IRQ, we keep the following information:
|
||||
|
||||
- "disable" depth (number of disable_irq()s without enable_irq()s)
|
||||
- flags indicating what we can do with this IRQ (valid, probe,
|
||||
noautounmask) as before
|
||||
- status of the IRQ (probing, enable, etc)
|
||||
- chip
|
||||
- per-IRQ handler
|
||||
- irqaction structure list
|
||||
|
||||
The handler can be one of the 3 standard handlers - "level", "edge" and
|
||||
"simple", or your own specific handler if you need to do something special.
|
||||
|
||||
The "level" handler is what we currently have - its pretty simple.
|
||||
"edge" knows about the brokenness of such IRQ implementations - that you
|
||||
need to leave the hardware IRQ enabled while processing it, and queueing
|
||||
further IRQ events should the IRQ happen again while processing. The
|
||||
"simple" handler is very basic, and does not perform any hardware
|
||||
manipulation, nor state tracking. This is useful for things like the
|
||||
SMC9196 and USAR above.
|
||||
|
||||
So, what's changed?
|
||||
|
||||
1. Machine implementations must not write to the irqdesc array.
|
||||
|
||||
2. New functions to manipulate the irqdesc array. The first 4 are expected
|
||||
to be useful only to machine specific code. The last is recommended to
|
||||
only be used by machine specific code, but may be used in drivers if
|
||||
absolutely necessary.
|
||||
|
||||
set_irq_chip(irq,chip)
|
||||
|
||||
Set the mask/unmask methods for handling this IRQ
|
||||
|
||||
set_irq_handler(irq,handler)
|
||||
|
||||
Set the handler for this IRQ (level, edge, simple)
|
||||
|
||||
set_irq_chained_handler(irq,handler)
|
||||
|
||||
Set a "chained" handler for this IRQ - automatically
|
||||
enables this IRQ (eg, Neponset and SA1111 handlers).
|
||||
|
||||
set_irq_flags(irq,flags)
|
||||
|
||||
Set the valid/probe/noautoenable flags.
|
||||
|
||||
set_irq_type(irq,type)
|
||||
|
||||
Set active the IRQ edge(s)/level. This replaces the
|
||||
SA1111 INTPOL manipulation, and the set_GPIO_IRQ_edge()
|
||||
function. Type should be one of the following:
|
||||
|
||||
#define IRQT_NOEDGE (0)
|
||||
#define IRQT_RISING (__IRQT_RISEDGE)
|
||||
#define IRQT_FALLING (__IRQT_FALEDGE)
|
||||
#define IRQT_BOTHEDGE (__IRQT_RISEDGE|__IRQT_FALEDGE)
|
||||
#define IRQT_LOW (__IRQT_LOWLVL)
|
||||
#define IRQT_HIGH (__IRQT_HIGHLVL)
|
||||
|
||||
3. set_GPIO_IRQ_edge() is obsolete, and should be replaced by set_irq_type.
|
||||
|
||||
4. Direct access to SA1111 INTPOL is depreciated. Use set_irq_type instead.
|
||||
|
||||
5. A handler is expected to perform any necessary acknowledgement of the
|
||||
parent IRQ via the correct chip specific function. For instance, if
|
||||
the SA1111 is directly connected to a SA1110 GPIO, then you should
|
||||
acknowledge the SA1110 IRQ each time you re-read the SA1111 IRQ status.
|
||||
|
||||
6. For any child which doesn't have its own IRQ enable/disable controls
|
||||
(eg, SMC9196), the handler must mask or acknowledge the parent IRQ
|
||||
while the child handler is called, and the child handler should be the
|
||||
"simple" handler (not "edge" nor "level"). After the handler completes,
|
||||
the parent IRQ should be unmasked, and the status of all children must
|
||||
be re-checked for pending events. (see the Neponset IRQ handler for
|
||||
details).
|
||||
|
||||
7. fixup_irq() is gone, as is include/asm-arm/arch-*/irq.h
|
||||
|
||||
Please note that this will not solve all problems - some of them are
|
||||
hardware based. Mixing level-based and edge-based IRQs on the same
|
||||
parent signal (eg neponset) is one such area where a software based
|
||||
solution can't provide the full answer to low IRQ latency.
|
||||
|
||||
78
Documentation/arm/Netwinder
Normal file
78
Documentation/arm/Netwinder
Normal file
@ -0,0 +1,78 @@
|
||||
NetWinder specific documentation
|
||||
================================
|
||||
|
||||
The NetWinder is a small low-power computer, primarily designed
|
||||
to run Linux. It is based around the StrongARM RISC processor,
|
||||
DC21285 PCI bridge, with PC-type hardware glued around it.
|
||||
|
||||
Port usage
|
||||
==========
|
||||
|
||||
Min - Max Description
|
||||
---------------------------
|
||||
0x0000 - 0x000f DMA1
|
||||
0x0020 - 0x0021 PIC1
|
||||
0x0060 - 0x006f Keyboard
|
||||
0x0070 - 0x007f RTC
|
||||
0x0080 - 0x0087 DMA1
|
||||
0x0088 - 0x008f DMA2
|
||||
0x00a0 - 0x00a3 PIC2
|
||||
0x00c0 - 0x00df DMA2
|
||||
0x0180 - 0x0187 IRDA
|
||||
0x01f0 - 0x01f6 ide0
|
||||
0x0201 Game port
|
||||
0x0203 RWA010 configuration read
|
||||
0x0220 - ? SoundBlaster
|
||||
0x0250 - ? WaveArtist
|
||||
0x0279 RWA010 configuration index
|
||||
0x02f8 - 0x02ff Serial ttyS1
|
||||
0x0300 - 0x031f Ether10
|
||||
0x0338 GPIO1
|
||||
0x033a GPIO2
|
||||
0x0370 - 0x0371 W83977F configuration registers
|
||||
0x0388 - ? AdLib
|
||||
0x03c0 - 0x03df VGA
|
||||
0x03f6 ide0
|
||||
0x03f8 - 0x03ff Serial ttyS0
|
||||
0x0400 - 0x0408 DC21143
|
||||
0x0480 - 0x0487 DMA1
|
||||
0x0488 - 0x048f DMA2
|
||||
0x0a79 RWA010 configuration write
|
||||
0xe800 - 0xe80f ide0/ide1 BM DMA
|
||||
|
||||
|
||||
Interrupt usage
|
||||
===============
|
||||
|
||||
IRQ type Description
|
||||
---------------------------
|
||||
0 ISA 100Hz timer
|
||||
1 ISA Keyboard
|
||||
2 ISA cascade
|
||||
3 ISA Serial ttyS1
|
||||
4 ISA Serial ttyS0
|
||||
5 ISA PS/2 mouse
|
||||
6 ISA IRDA
|
||||
7 ISA Printer
|
||||
8 ISA RTC alarm
|
||||
9 ISA
|
||||
10 ISA GP10 (Orange reset button)
|
||||
11 ISA
|
||||
12 ISA WaveArtist
|
||||
13 ISA
|
||||
14 ISA hda1
|
||||
15 ISA
|
||||
|
||||
DMA usage
|
||||
=========
|
||||
|
||||
DMA type Description
|
||||
---------------------------
|
||||
0 ISA IRDA
|
||||
1 ISA
|
||||
2 ISA cascade
|
||||
3 ISA WaveArtist
|
||||
4 ISA
|
||||
5 ISA
|
||||
6 ISA
|
||||
7 ISA WaveArtist
|
||||
135
Documentation/arm/Porting
Normal file
135
Documentation/arm/Porting
Normal file
@ -0,0 +1,135 @@
|
||||
Taken from list archive at http://lists.arm.linux.org.uk/pipermail/linux-arm-kernel/2001-July/004064.html
|
||||
|
||||
Initial definitions
|
||||
-------------------
|
||||
|
||||
The following symbol definitions rely on you knowing the translation that
|
||||
__virt_to_phys() does for your machine. This macro converts the passed
|
||||
virtual address to a physical address. Normally, it is simply:
|
||||
|
||||
phys = virt - PAGE_OFFSET + PHYS_OFFSET
|
||||
|
||||
|
||||
Decompressor Symbols
|
||||
--------------------
|
||||
|
||||
ZTEXTADDR
|
||||
Start address of decompressor. There's no point in talking about
|
||||
virtual or physical addresses here, since the MMU will be off at
|
||||
the time when you call the decompressor code. You normally call
|
||||
the kernel at this address to start it booting. This doesn't have
|
||||
to be located in RAM, it can be in flash or other read-only or
|
||||
read-write addressable medium.
|
||||
|
||||
ZBSSADDR
|
||||
Start address of zero-initialised work area for the decompressor.
|
||||
This must be pointing at RAM. The decompressor will zero initialise
|
||||
this for you. Again, the MMU will be off.
|
||||
|
||||
ZRELADDR
|
||||
This is the address where the decompressed kernel will be written,
|
||||
and eventually executed. The following constraint must be valid:
|
||||
|
||||
__virt_to_phys(TEXTADDR) == ZRELADDR
|
||||
|
||||
The initial part of the kernel is carefully coded to be position
|
||||
independent.
|
||||
|
||||
INITRD_PHYS
|
||||
Physical address to place the initial RAM disk. Only relevant if
|
||||
you are using the bootpImage stuff (which only works on the old
|
||||
struct param_struct).
|
||||
|
||||
INITRD_VIRT
|
||||
Virtual address of the initial RAM disk. The following constraint
|
||||
must be valid:
|
||||
|
||||
__virt_to_phys(INITRD_VIRT) == INITRD_PHYS
|
||||
|
||||
PARAMS_PHYS
|
||||
Physical address of the struct param_struct or tag list, giving the
|
||||
kernel various parameters about its execution environment.
|
||||
|
||||
|
||||
Kernel Symbols
|
||||
--------------
|
||||
|
||||
PHYS_OFFSET
|
||||
Physical start address of the first bank of RAM.
|
||||
|
||||
PAGE_OFFSET
|
||||
Virtual start address of the first bank of RAM. During the kernel
|
||||
boot phase, virtual address PAGE_OFFSET will be mapped to physical
|
||||
address PHYS_OFFSET, along with any other mappings you supply.
|
||||
This should be the same value as TASK_SIZE.
|
||||
|
||||
TASK_SIZE
|
||||
The maximum size of a user process in bytes. Since user space
|
||||
always starts at zero, this is the maximum address that a user
|
||||
process can access+1. The user space stack grows down from this
|
||||
address.
|
||||
|
||||
Any virtual address below TASK_SIZE is deemed to be user process
|
||||
area, and therefore managed dynamically on a process by process
|
||||
basis by the kernel. I'll call this the user segment.
|
||||
|
||||
Anything above TASK_SIZE is common to all processes. I'll call
|
||||
this the kernel segment.
|
||||
|
||||
(In other words, you can't put IO mappings below TASK_SIZE, and
|
||||
hence PAGE_OFFSET).
|
||||
|
||||
TEXTADDR
|
||||
Virtual start address of kernel, normally PAGE_OFFSET + 0x8000.
|
||||
This is where the kernel image ends up. With the latest kernels,
|
||||
it must be located at 32768 bytes into a 128MB region. Previous
|
||||
kernels placed a restriction of 256MB here.
|
||||
|
||||
DATAADDR
|
||||
Virtual address for the kernel data segment. Must not be defined
|
||||
when using the decompressor.
|
||||
|
||||
VMALLOC_START
|
||||
VMALLOC_END
|
||||
Virtual addresses bounding the vmalloc() area. There must not be
|
||||
any static mappings in this area; vmalloc will overwrite them.
|
||||
The addresses must also be in the kernel segment (see above).
|
||||
Normally, the vmalloc() area starts VMALLOC_OFFSET bytes above the
|
||||
last virtual RAM address (found using variable high_memory).
|
||||
|
||||
VMALLOC_OFFSET
|
||||
Offset normally incorporated into VMALLOC_START to provide a hole
|
||||
between virtual RAM and the vmalloc area. We do this to allow
|
||||
out of bounds memory accesses (eg, something writing off the end
|
||||
of the mapped memory map) to be caught. Normally set to 8MB.
|
||||
|
||||
Architecture Specific Macros
|
||||
----------------------------
|
||||
|
||||
BOOT_MEM(pram,pio,vio)
|
||||
`pram' specifies the physical start address of RAM. Must always
|
||||
be present, and should be the same as PHYS_OFFSET.
|
||||
|
||||
`pio' is the physical address of an 8MB region containing IO for
|
||||
use with the debugging macros in arch/arm/kernel/debug-armv.S.
|
||||
|
||||
`vio' is the virtual address of the 8MB debugging region.
|
||||
|
||||
It is expected that the debugging region will be re-initialised
|
||||
by the architecture specific code later in the code (via the
|
||||
MAPIO function).
|
||||
|
||||
BOOT_PARAMS
|
||||
Same as, and see PARAMS_PHYS.
|
||||
|
||||
FIXUP(func)
|
||||
Machine specific fixups, run before memory subsystems have been
|
||||
initialised.
|
||||
|
||||
MAPIO(func)
|
||||
Machine specific function to map IO areas (including the debug
|
||||
region above).
|
||||
|
||||
INITIRQ(func)
|
||||
Machine specific function to initialise interrupts.
|
||||
|
||||
197
Documentation/arm/README
Normal file
197
Documentation/arm/README
Normal file
@ -0,0 +1,197 @@
|
||||
ARM Linux 2.6
|
||||
=============
|
||||
|
||||
Please check <ftp://ftp.arm.linux.org.uk/pub/armlinux> for
|
||||
updates.
|
||||
|
||||
Compilation of kernel
|
||||
---------------------
|
||||
|
||||
In order to compile ARM Linux, you will need a compiler capable of
|
||||
generating ARM ELF code with GNU extensions. GCC 3.3 is known to be
|
||||
a good compiler. Fortunately, you needn't guess. The kernel will report
|
||||
an error if your compiler is a recognized offender.
|
||||
|
||||
To build ARM Linux natively, you shouldn't have to alter the ARCH = line
|
||||
in the top level Makefile. However, if you don't have the ARM Linux ELF
|
||||
tools installed as default, then you should change the CROSS_COMPILE
|
||||
line as detailed below.
|
||||
|
||||
If you wish to cross-compile, then alter the following lines in the top
|
||||
level make file:
|
||||
|
||||
ARCH = <whatever>
|
||||
with
|
||||
ARCH = arm
|
||||
|
||||
and
|
||||
|
||||
CROSS_COMPILE=
|
||||
to
|
||||
CROSS_COMPILE=<your-path-to-your-compiler-without-gcc>
|
||||
eg.
|
||||
CROSS_COMPILE=arm-linux-
|
||||
|
||||
Do a 'make config', followed by 'make Image' to build the kernel
|
||||
(arch/arm/boot/Image). A compressed image can be built by doing a
|
||||
'make zImage' instead of 'make Image'.
|
||||
|
||||
|
||||
Bug reports etc
|
||||
---------------
|
||||
|
||||
Please send patches to the patch system. For more information, see
|
||||
http://www.arm.linux.org.uk/patches/info.html Always include some
|
||||
explanation as to what the patch does and why it is needed.
|
||||
|
||||
Bug reports should be sent to linux-arm-kernel@lists.arm.linux.org.uk,
|
||||
or submitted through the web form at
|
||||
http://www.arm.linux.org.uk/forms/solution.shtml
|
||||
|
||||
When sending bug reports, please ensure that they contain all relevant
|
||||
information, eg. the kernel messages that were printed before/during
|
||||
the problem, what you were doing, etc.
|
||||
|
||||
|
||||
Include files
|
||||
-------------
|
||||
|
||||
Several new include directories have been created under include/asm-arm,
|
||||
which are there to reduce the clutter in the top-level directory. These
|
||||
directories, and their purpose is listed below:
|
||||
|
||||
arch-* machine/platform specific header files
|
||||
hardware driver-internal ARM specific data structures/definitions
|
||||
mach descriptions of generic ARM to specific machine interfaces
|
||||
proc-* processor dependent header files (currently only two
|
||||
categories)
|
||||
|
||||
|
||||
Machine/Platform support
|
||||
------------------------
|
||||
|
||||
The ARM tree contains support for a lot of different machine types. To
|
||||
continue supporting these differences, it has become necessary to split
|
||||
machine-specific parts by directory. For this, the machine category is
|
||||
used to select which directories and files get included (we will use
|
||||
$(MACHINE) to refer to the category)
|
||||
|
||||
To this end, we now have arch/arm/mach-$(MACHINE) directories which are
|
||||
designed to house the non-driver files for a particular machine (eg, PCI,
|
||||
memory management, architecture definitions etc). For all future
|
||||
machines, there should be a corresponding include/asm-arm/arch-$(MACHINE)
|
||||
directory.
|
||||
|
||||
|
||||
Modules
|
||||
-------
|
||||
|
||||
Although modularisation is supported (and required for the FP emulator),
|
||||
each module on an ARM2/ARM250/ARM3 machine when is loaded will take
|
||||
memory up to the next 32k boundary due to the size of the pages.
|
||||
Therefore, is modularisation on these machines really worth it?
|
||||
|
||||
However, ARM6 and up machines allow modules to take multiples of 4k, and
|
||||
as such Acorn RiscPCs and other architectures using these processors can
|
||||
make good use of modularisation.
|
||||
|
||||
|
||||
ADFS Image files
|
||||
----------------
|
||||
|
||||
You can access image files on your ADFS partitions by mounting the ADFS
|
||||
partition, and then using the loopback device driver. You must have
|
||||
losetup installed.
|
||||
|
||||
Please note that the PCEmulator DOS partitions have a partition table at
|
||||
the start, and as such, you will have to give '-o offset' to losetup.
|
||||
|
||||
|
||||
Request to developers
|
||||
---------------------
|
||||
|
||||
When writing device drivers which include a separate assembler file, please
|
||||
include it in with the C file, and not the arch/arm/lib directory. This
|
||||
allows the driver to be compiled as a loadable module without requiring
|
||||
half the code to be compiled into the kernel image.
|
||||
|
||||
In general, try to avoid using assembler unless it is really necessary. It
|
||||
makes drivers far less easy to port to other hardware.
|
||||
|
||||
|
||||
ST506 hard drives
|
||||
-----------------
|
||||
|
||||
The ST506 hard drive controllers seem to be working fine (if a little
|
||||
slowly). At the moment they will only work off the controllers on an
|
||||
A4x0's motherboard, but for it to work off a Podule just requires
|
||||
someone with a podule to add the addresses for the IRQ mask and the
|
||||
HDC base to the source.
|
||||
|
||||
As of 31/3/96 it works with two drives (you should get the ADFS
|
||||
*configure harddrive set to 2). I've got an internal 20MB and a great
|
||||
big external 5.25" FH 64MB drive (who could ever want more :-) ).
|
||||
|
||||
I've just got 240K/s off it (a dd with bs=128k); thats about half of what
|
||||
RiscOS gets; but it's a heck of a lot better than the 50K/s I was getting
|
||||
last week :-)
|
||||
|
||||
Known bug: Drive data errors can cause a hang; including cases where
|
||||
the controller has fixed the error using ECC. (Possibly ONLY
|
||||
in that case...hmm).
|
||||
|
||||
|
||||
1772 Floppy
|
||||
-----------
|
||||
This also seems to work OK, but hasn't been stressed much lately. It
|
||||
hasn't got any code for disc change detection in there at the moment which
|
||||
could be a bit of a problem! Suggestions on the correct way to do this
|
||||
are welcome.
|
||||
|
||||
|
||||
CONFIG_MACH_ and CONFIG_ARCH_
|
||||
-----------------------------
|
||||
A change was made in 2003 to the macro names for new machines.
|
||||
Historically, CONFIG_ARCH_ was used for the bonafide architecture,
|
||||
e.g. SA1100, as well as implementations of the architecture,
|
||||
e.g. Assabet. It was decided to change the implementation macros
|
||||
to read CONFIG_MACH_ for clarity. Moreover, a retroactive fixup has
|
||||
not been made because it would complicate patching.
|
||||
|
||||
Previous registrations may be found online.
|
||||
|
||||
<http://www.arm.linux.org.uk/developer/machines/>
|
||||
|
||||
Kernel entry (head.S)
|
||||
--------------------------
|
||||
The initial entry into the kernel is via head.S, which uses machine
|
||||
independent code. The machine is selected by the value of 'r1' on
|
||||
entry, which must be kept unique.
|
||||
|
||||
Due to the large number of machines which the ARM port of Linux provides
|
||||
for, we have a method to manage this which ensures that we don't end up
|
||||
duplicating large amounts of code.
|
||||
|
||||
We group machine (or platform) support code into machine classes. A
|
||||
class typically based around one or more system on a chip devices, and
|
||||
acts as a natural container around the actual implementations. These
|
||||
classes are given directories - arch/arm/mach-<class> and
|
||||
include/asm-arm/arch-<class> - which contain the source files to
|
||||
support the machine class. This directories also contain any machine
|
||||
specific supporting code.
|
||||
|
||||
For example, the SA1100 class is based upon the SA1100 and SA1110 SoC
|
||||
devices, and contains the code to support the way the on-board and off-
|
||||
board devices are used, or the device is setup, and provides that
|
||||
machine specific "personality."
|
||||
|
||||
This fine-grained machine specific selection is controlled by the machine
|
||||
type ID, which acts both as a run-time and a compile-time code selection
|
||||
method.
|
||||
|
||||
You can register a new machine via the web site at:
|
||||
|
||||
<http://www.arm.linux.org.uk/developer/machines/>
|
||||
|
||||
---
|
||||
Russell King (15/03/2004)
|
||||
43
Documentation/arm/SA1100/ADSBitsy
Normal file
43
Documentation/arm/SA1100/ADSBitsy
Normal file
@ -0,0 +1,43 @@
|
||||
ADS Bitsy Single Board Computer
|
||||
(It is different from Bitsy(iPAQ) of Compaq)
|
||||
|
||||
For more details, contact Applied Data Systems or see
|
||||
http://www.applieddata.net/products.html
|
||||
|
||||
The Linux support for this product has been provided by
|
||||
Woojung Huh <whuh@applieddata.net>
|
||||
|
||||
Use 'make adsbitsy_config' before any 'make config'.
|
||||
This will set up defaults for ADS Bitsy support.
|
||||
|
||||
The kernel zImage is linked to be loaded and executed at 0xc0400000.
|
||||
|
||||
Linux can be used with the ADS BootLoader that ships with the
|
||||
newer rev boards. See their documentation on how to load Linux.
|
||||
|
||||
Supported peripherals:
|
||||
- SA1100 LCD frame buffer (8/16bpp...sort of)
|
||||
- SA1111 USB Master
|
||||
- SA1100 serial port
|
||||
- pcmcia, compact flash
|
||||
- touchscreen(ucb1200)
|
||||
- console on LCD screen
|
||||
- serial ports (ttyS[0-2])
|
||||
- ttyS0 is default for serial console
|
||||
|
||||
To do:
|
||||
- everything else! :-)
|
||||
|
||||
Notes:
|
||||
|
||||
- The flash on board is divided into 3 partitions.
|
||||
You should be careful to use flash on board.
|
||||
It's partition is different from GraphicsClient Plus and GraphicsMaster
|
||||
|
||||
- 16bpp mode requires a different cable than what ships with the board.
|
||||
Contact ADS or look through the manual to wire your own. Currently,
|
||||
if you compile with 16bit mode support and switch into a lower bpp
|
||||
mode, the timing is off so the image is corrupted. This will be
|
||||
fixed soon.
|
||||
|
||||
Any contribution can be sent to nico@cam.org and will be greatly welcome!
|
||||
301
Documentation/arm/SA1100/Assabet
Normal file
301
Documentation/arm/SA1100/Assabet
Normal file
@ -0,0 +1,301 @@
|
||||
The Intel Assabet (SA-1110 evaluation) board
|
||||
============================================
|
||||
|
||||
Please see:
|
||||
http://developer.intel.com/design/strong/quicklist/eval-plat/sa-1110.htm
|
||||
http://developer.intel.com/design/strong/guides/278278.htm
|
||||
|
||||
Also some notes from John G Dorsey <jd5q@andrew.cmu.edu>:
|
||||
http://www.cs.cmu.edu/~wearable/software/assabet.html
|
||||
|
||||
|
||||
Building the kernel
|
||||
-------------------
|
||||
|
||||
To build the kernel with current defaults:
|
||||
|
||||
make assabet_config
|
||||
make oldconfig
|
||||
make zImage
|
||||
|
||||
The resulting kernel image should be available in linux/arch/arm/boot/zImage.
|
||||
|
||||
|
||||
Installing a bootloader
|
||||
-----------------------
|
||||
|
||||
A couple of bootloaders able to boot Linux on Assabet are available:
|
||||
|
||||
BLOB (http://www.lartmaker.nl/lartware/blob/)
|
||||
|
||||
BLOB is a bootloader used within the LART project. Some contributed
|
||||
patches were merged into BLOB to add support for Assabet.
|
||||
|
||||
Compaq's Bootldr + John Dorsey's patch for Assabet support
|
||||
(http://www.handhelds.org/Compaq/bootldr.html)
|
||||
(http://www.wearablegroup.org/software/bootldr/)
|
||||
|
||||
Bootldr is the bootloader developed by Compaq for the iPAQ Pocket PC.
|
||||
John Dorsey has produced add-on patches to add support for Assabet and
|
||||
the JFFS filesystem.
|
||||
|
||||
RedBoot (http://sources.redhat.com/redboot/)
|
||||
|
||||
RedBoot is a bootloader developed by Red Hat based on the eCos RTOS
|
||||
hardware abstraction layer. It supports Assabet amongst many other
|
||||
hardware platforms.
|
||||
|
||||
RedBoot is currently the recommended choice since it's the only one to have
|
||||
networking support, and is the most actively maintained.
|
||||
|
||||
Brief examples on how to boot Linux with RedBoot are shown below. But first
|
||||
you need to have RedBoot installed in your flash memory. A known to work
|
||||
precompiled RedBoot binary is available from the following location:
|
||||
|
||||
ftp://ftp.netwinder.org/users/n/nico/
|
||||
ftp://ftp.arm.linux.org.uk/pub/linux/arm/people/nico/
|
||||
ftp://ftp.handhelds.org/pub/linux/arm/sa-1100-patches/
|
||||
|
||||
Look for redboot-assabet*.tgz. Some installation infos are provided in
|
||||
redboot-assabet*.txt.
|
||||
|
||||
|
||||
Initial RedBoot configuration
|
||||
-----------------------------
|
||||
|
||||
The commands used here are explained in The RedBoot User's Guide available
|
||||
on-line at http://sources.redhat.com/ecos/docs-latest/redboot/redboot.html.
|
||||
Please refer to it for explanations.
|
||||
|
||||
If you have a CF network card (my Assabet kit contained a CF+ LP-E from
|
||||
Socket Communications Inc.), you should strongly consider using it for TFTP
|
||||
file transfers. You must insert it before RedBoot runs since it can't detect
|
||||
it dynamically.
|
||||
|
||||
To initialize the flash directory:
|
||||
|
||||
fis init -f
|
||||
|
||||
To initialize the non-volatile settings, like whether you want to use BOOTP or
|
||||
a static IP address, etc, use this command:
|
||||
|
||||
fconfig -i
|
||||
|
||||
|
||||
Writing a kernel image into flash
|
||||
---------------------------------
|
||||
|
||||
First, the kernel image must be loaded into RAM. If you have the zImage file
|
||||
available on a TFTP server:
|
||||
|
||||
load zImage -r -b 0x100000
|
||||
|
||||
If you rather want to use Y-Modem upload over the serial port:
|
||||
|
||||
load -m ymodem -r -b 0x100000
|
||||
|
||||
To write it to flash:
|
||||
|
||||
fis create "Linux kernel" -b 0x100000 -l 0xc0000
|
||||
|
||||
|
||||
Booting the kernel
|
||||
------------------
|
||||
|
||||
The kernel still requires a filesystem to boot. A ramdisk image can be loaded
|
||||
as follows:
|
||||
|
||||
load ramdisk_image.gz -r -b 0x800000
|
||||
|
||||
Again, Y-Modem upload can be used instead of TFTP by replacing the file name
|
||||
by '-y ymodem'.
|
||||
|
||||
Now the kernel can be retrieved from flash like this:
|
||||
|
||||
fis load "Linux kernel"
|
||||
|
||||
or loaded as described previously. To boot the kernel:
|
||||
|
||||
exec -b 0x100000 -l 0xc0000
|
||||
|
||||
The ramdisk image could be stored into flash as well, but there are better
|
||||
solutions for on-flash filesystems as mentioned below.
|
||||
|
||||
|
||||
Using JFFS2
|
||||
-----------
|
||||
|
||||
Using JFFS2 (the Second Journalling Flash File System) is probably the most
|
||||
convenient way to store a writable filesystem into flash. JFFS2 is used in
|
||||
conjunction with the MTD layer which is responsible for low-level flash
|
||||
management. More information on the Linux MTD can be found on-line at:
|
||||
http://www.linux-mtd.infradead.org/. A JFFS howto with some infos about
|
||||
creating JFFS/JFFS2 images is available from the same site.
|
||||
|
||||
For instance, a sample JFFS2 image can be retrieved from the same FTP sites
|
||||
mentioned below for the precompiled RedBoot image.
|
||||
|
||||
To load this file:
|
||||
|
||||
load sample_img.jffs2 -r -b 0x100000
|
||||
|
||||
The result should look like:
|
||||
|
||||
RedBoot> load sample_img.jffs2 -r -b 0x100000
|
||||
Raw file loaded 0x00100000-0x00377424
|
||||
|
||||
Now we must know the size of the unallocated flash:
|
||||
|
||||
fis free
|
||||
|
||||
Result:
|
||||
|
||||
RedBoot> fis free
|
||||
0x500E0000 .. 0x503C0000
|
||||
|
||||
The values above may be different depending on the size of the filesystem and
|
||||
the type of flash. See their usage below as an example and take care of
|
||||
substituting yours appropriately.
|
||||
|
||||
We must determine some values:
|
||||
|
||||
size of unallocated flash: 0x503c0000 - 0x500e0000 = 0x2e0000
|
||||
size of the filesystem image: 0x00377424 - 0x00100000 = 0x277424
|
||||
|
||||
We want to fit the filesystem image of course, but we also want to give it all
|
||||
the remaining flash space as well. To write it:
|
||||
|
||||
fis unlock -f 0x500E0000 -l 0x2e0000
|
||||
fis erase -f 0x500E0000 -l 0x2e0000
|
||||
fis write -b 0x100000 -l 0x277424 -f 0x500E0000
|
||||
fis create "JFFS2" -n -f 0x500E0000 -l 0x2e0000
|
||||
|
||||
Now the filesystem is associated to a MTD "partition" once Linux has discovered
|
||||
what they are in the boot process. From Redboot, the 'fis list' command
|
||||
displays them:
|
||||
|
||||
RedBoot> fis list
|
||||
Name FLASH addr Mem addr Length Entry point
|
||||
RedBoot 0x50000000 0x50000000 0x00020000 0x00000000
|
||||
RedBoot config 0x503C0000 0x503C0000 0x00020000 0x00000000
|
||||
FIS directory 0x503E0000 0x503E0000 0x00020000 0x00000000
|
||||
Linux kernel 0x50020000 0x00100000 0x000C0000 0x00000000
|
||||
JFFS2 0x500E0000 0x500E0000 0x002E0000 0x00000000
|
||||
|
||||
However Linux should display something like:
|
||||
|
||||
SA1100 flash: probing 32-bit flash bus
|
||||
SA1100 flash: Found 2 x16 devices at 0x0 in 32-bit mode
|
||||
Using RedBoot partition definition
|
||||
Creating 5 MTD partitions on "SA1100 flash":
|
||||
0x00000000-0x00020000 : "RedBoot"
|
||||
0x00020000-0x000e0000 : "Linux kernel"
|
||||
0x000e0000-0x003c0000 : "JFFS2"
|
||||
0x003c0000-0x003e0000 : "RedBoot config"
|
||||
0x003e0000-0x00400000 : "FIS directory"
|
||||
|
||||
What's important here is the position of the partition we are interested in,
|
||||
which is the third one. Within Linux, this correspond to /dev/mtdblock2.
|
||||
Therefore to boot Linux with the kernel and its root filesystem in flash, we
|
||||
need this RedBoot command:
|
||||
|
||||
fis load "Linux kernel"
|
||||
exec -b 0x100000 -l 0xc0000 -c "root=/dev/mtdblock2"
|
||||
|
||||
Of course other filesystems than JFFS might be used, like cramfs for example.
|
||||
You might want to boot with a root filesystem over NFS, etc. It is also
|
||||
possible, and sometimes more convenient, to flash a filesystem directly from
|
||||
within Linux while booted from a ramdisk or NFS. The Linux MTD repository has
|
||||
many tools to deal with flash memory as well, to erase it for example. JFFS2
|
||||
can then be mounted directly on a freshly erased partition and files can be
|
||||
copied over directly. Etc...
|
||||
|
||||
|
||||
RedBoot scripting
|
||||
-----------------
|
||||
|
||||
All the commands above aren't so useful if they have to be typed in every
|
||||
time the Assabet is rebooted. Therefore it's possible to automatize the boot
|
||||
process using RedBoot's scripting capability.
|
||||
|
||||
For example, I use this to boot Linux with both the kernel and the ramdisk
|
||||
images retrieved from a TFTP server on the network:
|
||||
|
||||
RedBoot> fconfig
|
||||
Run script at boot: false true
|
||||
Boot script:
|
||||
Enter script, terminate with empty line
|
||||
>> load zImage -r -b 0x100000
|
||||
>> load ramdisk_ks.gz -r -b 0x800000
|
||||
>> exec -b 0x100000 -l 0xc0000
|
||||
>>
|
||||
Boot script timeout (1000ms resolution): 3
|
||||
Use BOOTP for network configuration: true
|
||||
GDB connection port: 9000
|
||||
Network debug at boot time: false
|
||||
Update RedBoot non-volatile configuration - are you sure (y/n)? y
|
||||
|
||||
Then, rebooting the Assabet is just a matter of waiting for the login prompt.
|
||||
|
||||
|
||||
|
||||
Nicolas Pitre
|
||||
nico@cam.org
|
||||
June 12, 2001
|
||||
|
||||
|
||||
Status of peripherals in -rmk tree (updated 14/10/2001)
|
||||
-------------------------------------------------------
|
||||
|
||||
Assabet:
|
||||
Serial ports:
|
||||
Radio: TX, RX, CTS, DSR, DCD, RI
|
||||
PM: Not tested.
|
||||
COM: TX, RX, CTS, DSR, DCD, RTS, DTR, PM
|
||||
PM: Not tested.
|
||||
I2C: Implemented, not fully tested.
|
||||
L3: Fully tested, pass.
|
||||
PM: Not tested.
|
||||
|
||||
Video:
|
||||
LCD: Fully tested. PM
|
||||
(LCD doesn't like being blanked with
|
||||
neponset connected)
|
||||
Video out: Not fully
|
||||
|
||||
Audio:
|
||||
UDA1341:
|
||||
Playback: Fully tested, pass.
|
||||
Record: Implemented, not tested.
|
||||
PM: Not tested.
|
||||
|
||||
UCB1200:
|
||||
Audio play: Implemented, not heavily tested.
|
||||
Audio rec: Implemented, not heavily tested.
|
||||
Telco audio play: Implemented, not heavily tested.
|
||||
Telco audio rec: Implemented, not heavily tested.
|
||||
POTS control: No
|
||||
Touchscreen: Yes
|
||||
PM: Not tested.
|
||||
|
||||
Other:
|
||||
PCMCIA:
|
||||
LPE: Fully tested, pass.
|
||||
USB: No
|
||||
IRDA:
|
||||
SIR: Fully tested, pass.
|
||||
FIR: Fully tested, pass.
|
||||
PM: Not tested.
|
||||
|
||||
Neponset:
|
||||
Serial ports:
|
||||
COM1,2: TX, RX, CTS, DSR, DCD, RTS, DTR
|
||||
PM: Not tested.
|
||||
USB: Implemented, not heavily tested.
|
||||
PCMCIA: Implemented, not heavily tested.
|
||||
PM: Not tested.
|
||||
CF: Implemented, not heavily tested.
|
||||
PM: Not tested.
|
||||
|
||||
More stuff can be found in the -np (Nicolas Pitre's) tree.
|
||||
|
||||
66
Documentation/arm/SA1100/Brutus
Normal file
66
Documentation/arm/SA1100/Brutus
Normal file
@ -0,0 +1,66 @@
|
||||
Brutus is an evaluation platform for the SA1100 manufactured by Intel.
|
||||
For more details, see:
|
||||
|
||||
http://developer.intel.com/design/strong/applnots/sa1100lx/getstart.htm
|
||||
|
||||
To compile for Brutus, you must issue the following commands:
|
||||
|
||||
make brutus_config
|
||||
make config
|
||||
[accept all the defaults]
|
||||
make zImage
|
||||
|
||||
The resulting kernel will end up in linux/arch/arm/boot/zImage. This file
|
||||
must be loaded at 0xc0008000 in Brutus's memory and execution started at
|
||||
0xc0008000 as well with the value of registers r0 = 0 and r1 = 16 upon
|
||||
entry.
|
||||
|
||||
But prior to execute the kernel, a ramdisk image must also be loaded in
|
||||
memory. Use memory address 0xd8000000 for this. Note that the file
|
||||
containing the (compressed) ramdisk image must not exceed 4 MB.
|
||||
|
||||
Typically, you'll need angelboot to load the kernel.
|
||||
The following angelboot.opt file should be used:
|
||||
|
||||
----- begin angelboot.opt -----
|
||||
base 0xc0008000
|
||||
entry 0xc0008000
|
||||
r0 0x00000000
|
||||
r1 0x00000010
|
||||
device /dev/ttyS0
|
||||
options "9600 8N1"
|
||||
baud 115200
|
||||
otherfile ramdisk_img.gz
|
||||
otherbase 0xd8000000
|
||||
----- end angelboot.opt -----
|
||||
|
||||
Then load the kernel and ramdisk with:
|
||||
|
||||
angelboot -f angelboot.opt zImage
|
||||
|
||||
The first Brutus serial port (assumed to be linked to /dev/ttyS0 on your
|
||||
host PC) is used by angel to load the kernel and ramdisk image. The serial
|
||||
console is provided through the second Brutus serial port. To access it,
|
||||
you may use minicom configured with /dev/ttyS1, 9600 baud, 8N1, no flow
|
||||
control.
|
||||
|
||||
Currently supported:
|
||||
- RS232 serial ports
|
||||
- audio output
|
||||
- LCD screen
|
||||
- keyboard
|
||||
|
||||
The actual Brutus support may not be complete without extra patches.
|
||||
If such patches exist, they should be found from
|
||||
ftp.netwinder.org/users/n/nico.
|
||||
|
||||
A full PCMCIA support is still missing, although it's possible to hack
|
||||
some drivers in order to drive already inserted cards at boot time with
|
||||
little modifications.
|
||||
|
||||
Any contribution is welcome.
|
||||
|
||||
Please send patches to nico@cam.org
|
||||
|
||||
Have Fun !
|
||||
|
||||
29
Documentation/arm/SA1100/CERF
Normal file
29
Documentation/arm/SA1100/CERF
Normal file
@ -0,0 +1,29 @@
|
||||
*** The StrongARM version of the CerfBoard/Cube has been discontinued ***
|
||||
|
||||
The Intrinsyc CerfBoard is a StrongARM 1110-based computer on a board
|
||||
that measures approximately 2" square. It includes an Ethernet
|
||||
controller, an RS232-compatible serial port, a USB function port, and
|
||||
one CompactFlash+ slot on the back. Pictures can be found at the
|
||||
Intrinsyc website, http://www.intrinsyc.com.
|
||||
|
||||
This document describes the support in the Linux kernel for the
|
||||
Intrinsyc CerfBoard.
|
||||
|
||||
Supported in this version:
|
||||
- CompactFlash+ slot (select PCMCIA in General Setup and any options
|
||||
that may be required)
|
||||
- Onboard Crystal CS8900 Ethernet controller (Cerf CS8900A support in
|
||||
Network Devices)
|
||||
- Serial ports with a serial console (hardcoded to 38400 8N1)
|
||||
|
||||
In order to get this kernel onto your Cerf, you need a server that runs
|
||||
both BOOTP and TFTP. Detailed instructions should have come with your
|
||||
evaluation kit on how to use the bootloader. This series of commands
|
||||
will suffice:
|
||||
|
||||
make ARCH=arm CROSS_COMPILE=arm-linux- cerfcube_defconfig
|
||||
make ARCH=arm CROSS_COMPILE=arm-linux- zImage
|
||||
make ARCH=arm CROSS_COMPILE=arm-linux- modules
|
||||
cp arch/arm/boot/zImage <TFTP directory>
|
||||
|
||||
support@intrinsyc.com
|
||||
21
Documentation/arm/SA1100/FreeBird
Normal file
21
Documentation/arm/SA1100/FreeBird
Normal file
@ -0,0 +1,21 @@
|
||||
Freebird-1.1 is produced by Legned(C) ,Inc.
|
||||
(http://www.legend.com.cn)
|
||||
and software/linux mainatined by Coventive(C),Inc.
|
||||
(http://www.coventive.com)
|
||||
|
||||
Based on the Nicolas's strongarm kernel tree.
|
||||
|
||||
===============================================================
|
||||
Maintainer:
|
||||
|
||||
Chester Kuo <chester@coventive.com>
|
||||
<chester@linux.org.tw>
|
||||
|
||||
Author :
|
||||
Tim wu <timwu@coventive.com>
|
||||
CIH <cih@coventive.com>
|
||||
Eric Peng <ericpeng@coventive.com>
|
||||
Jeff Lee <jeff_lee@coventive.com>
|
||||
Allen Cheng
|
||||
Tony Liu <tonyliu@coventive.com>
|
||||
|
||||
98
Documentation/arm/SA1100/GraphicsClient
Normal file
98
Documentation/arm/SA1100/GraphicsClient
Normal file
@ -0,0 +1,98 @@
|
||||
ADS GraphicsClient Plus Single Board Computer
|
||||
|
||||
For more details, contact Applied Data Systems or see
|
||||
http://www.applieddata.net/products.html
|
||||
|
||||
The original Linux support for this product has been provided by
|
||||
Nicolas Pitre <nico@cam.org>. Continued development work by
|
||||
Woojung Huh <whuh@applieddata.net>
|
||||
|
||||
It's currently possible to mount a root filesystem via NFS providing a
|
||||
complete Linux environment. Otherwise a ramdisk image may be used. The
|
||||
board supports MTD/JFFS, so you could also mount something on there.
|
||||
|
||||
Use 'make graphicsclient_config' before any 'make config'. This will set up
|
||||
defaults for GraphicsClient Plus support.
|
||||
|
||||
The kernel zImage is linked to be loaded and executed at 0xc0200000.
|
||||
Also the following registers should have the specified values upon entry:
|
||||
|
||||
r0 = 0
|
||||
r1 = 29 (this is the GraphicsClient architecture number)
|
||||
|
||||
Linux can be used with the ADS BootLoader that ships with the
|
||||
newer rev boards. See their documentation on how to load Linux.
|
||||
Angel is not available for the GraphicsClient Plus AFAIK.
|
||||
|
||||
There is a board known as just the GraphicsClient that ADS used to
|
||||
produce but has end of lifed. This code will not work on the older
|
||||
board with the ADS bootloader, but should still work with Angel,
|
||||
as outlined below. In any case, if you're planning on deploying
|
||||
something en masse, you should probably get the newer board.
|
||||
|
||||
If using Angel on the older boards, here is a typical angel.opt option file
|
||||
if the kernel is loaded through the Angel Debug Monitor:
|
||||
|
||||
----- begin angelboot.opt -----
|
||||
base 0xc0200000
|
||||
entry 0xc0200000
|
||||
r0 0x00000000
|
||||
r1 0x0000001d
|
||||
device /dev/ttyS1
|
||||
options "38400 8N1"
|
||||
baud 115200
|
||||
#otherfile ramdisk.gz
|
||||
#otherbase 0xc0800000
|
||||
exec minicom
|
||||
----- end angelboot.opt -----
|
||||
|
||||
Then the kernel (and ramdisk if otherfile/otherbase lines above are
|
||||
uncommented) would be loaded with:
|
||||
|
||||
angelboot -f angelboot.opt zImage
|
||||
|
||||
Here it is assumed that the board is connected to ttyS1 on your PC
|
||||
and that minicom is preconfigured with /dev/ttyS1, 38400 baud, 8N1, no flow
|
||||
control by default.
|
||||
|
||||
If any other bootloader is used, ensure it accomplish the same, especially
|
||||
for r0/r1 register values before jumping into the kernel.
|
||||
|
||||
|
||||
Supported peripherals:
|
||||
- SA1100 LCD frame buffer (8/16bpp...sort of)
|
||||
- on-board SMC 92C96 ethernet NIC
|
||||
- SA1100 serial port
|
||||
- flash memory access (MTD/JFFS)
|
||||
- pcmcia
|
||||
- touchscreen(ucb1200)
|
||||
- ps/2 keyboard
|
||||
- console on LCD screen
|
||||
- serial ports (ttyS[0-2])
|
||||
- ttyS0 is default for serial console
|
||||
- Smart I/O (ADC, keypad, digital inputs, etc)
|
||||
See http://www.applieddata.com/developers/linux for IOCTL documentation
|
||||
and example user space code. ps/2 keybd is multiplexed through this driver
|
||||
|
||||
To do:
|
||||
- UCB1200 audio with new ucb_generic layer
|
||||
- everything else! :-)
|
||||
|
||||
Notes:
|
||||
|
||||
- The flash on board is divided into 3 partitions. mtd0 is where
|
||||
the ADS boot ROM and zImage is stored. It's been marked as
|
||||
read-only to keep you from blasting over the bootloader. :) mtd1 is
|
||||
for the ramdisk.gz image. mtd2 is user flash space and can be
|
||||
utilized for either JFFS or if you're feeling crazy, running ext2
|
||||
on top of it. If you're not using the ADS bootloader, you're
|
||||
welcome to blast over the mtd1 partition also.
|
||||
|
||||
- 16bpp mode requires a different cable than what ships with the board.
|
||||
Contact ADS or look through the manual to wire your own. Currently,
|
||||
if you compile with 16bit mode support and switch into a lower bpp
|
||||
mode, the timing is off so the image is corrupted. This will be
|
||||
fixed soon.
|
||||
|
||||
Any contribution can be sent to nico@cam.org and will be greatly welcome!
|
||||
|
||||
53
Documentation/arm/SA1100/GraphicsMaster
Normal file
53
Documentation/arm/SA1100/GraphicsMaster
Normal file
@ -0,0 +1,53 @@
|
||||
ADS GraphicsMaster Single Board Computer
|
||||
|
||||
For more details, contact Applied Data Systems or see
|
||||
http://www.applieddata.net/products.html
|
||||
|
||||
The original Linux support for this product has been provided by
|
||||
Nicolas Pitre <nico@cam.org>. Continued development work by
|
||||
Woojung Huh <whuh@applieddata.net>
|
||||
|
||||
Use 'make graphicsmaster_config' before any 'make config'.
|
||||
This will set up defaults for GraphicsMaster support.
|
||||
|
||||
The kernel zImage is linked to be loaded and executed at 0xc0400000.
|
||||
|
||||
Linux can be used with the ADS BootLoader that ships with the
|
||||
newer rev boards. See their documentation on how to load Linux.
|
||||
|
||||
Supported peripherals:
|
||||
- SA1100 LCD frame buffer (8/16bpp...sort of)
|
||||
- SA1111 USB Master
|
||||
- on-board SMC 92C96 ethernet NIC
|
||||
- SA1100 serial port
|
||||
- flash memory access (MTD/JFFS)
|
||||
- pcmcia, compact flash
|
||||
- touchscreen(ucb1200)
|
||||
- ps/2 keyboard
|
||||
- console on LCD screen
|
||||
- serial ports (ttyS[0-2])
|
||||
- ttyS0 is default for serial console
|
||||
- Smart I/O (ADC, keypad, digital inputs, etc)
|
||||
See http://www.applieddata.com/developers/linux for IOCTL documentation
|
||||
and example user space code. ps/2 keybd is multiplexed through this driver
|
||||
|
||||
To do:
|
||||
- everything else! :-)
|
||||
|
||||
Notes:
|
||||
|
||||
- The flash on board is divided into 3 partitions. mtd0 is where
|
||||
the zImage is stored. It's been marked as read-only to keep you
|
||||
from blasting over the bootloader. :) mtd1 is
|
||||
for the ramdisk.gz image. mtd2 is user flash space and can be
|
||||
utilized for either JFFS or if you're feeling crazy, running ext2
|
||||
on top of it. If you're not using the ADS bootloader, you're
|
||||
welcome to blast over the mtd1 partition also.
|
||||
|
||||
- 16bpp mode requires a different cable than what ships with the board.
|
||||
Contact ADS or look through the manual to wire your own. Currently,
|
||||
if you compile with 16bit mode support and switch into a lower bpp
|
||||
mode, the timing is off so the image is corrupted. This will be
|
||||
fixed soon.
|
||||
|
||||
Any contribution can be sent to nico@cam.org and will be greatly welcome!
|
||||
17
Documentation/arm/SA1100/HUW_WEBPANEL
Normal file
17
Documentation/arm/SA1100/HUW_WEBPANEL
Normal file
@ -0,0 +1,17 @@
|
||||
The HUW_WEBPANEL is a product of the german company Hoeft & Wessel AG
|
||||
|
||||
If you want more information, please visit
|
||||
http://www.hoeft-wessel.de
|
||||
|
||||
To build the kernel:
|
||||
make huw_webpanel_config
|
||||
make oldconfig
|
||||
[accept all defaults]
|
||||
make zImage
|
||||
|
||||
Mostly of the work is done by:
|
||||
Roman Jordan jor@hoeft-wessel.de
|
||||
Christoph Schulz schu@hoeft-wessel.de
|
||||
|
||||
2000/12/18/
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
x
Reference in New Issue
Block a user