Creation of Cybook 2416 (actually Gen4) repository
This commit is contained in:
94
Documentation/filesystems/00-INDEX
Normal file
94
Documentation/filesystems/00-INDEX
Normal file
@@ -0,0 +1,94 @@
|
||||
00-INDEX
|
||||
- this file (info on some of the filesystems supported by linux).
|
||||
Exporting
|
||||
- explanation of how to make filesystems exportable.
|
||||
Locking
|
||||
- info on locking rules as they pertain to Linux VFS.
|
||||
9p.txt
|
||||
- 9p (v9fs) is an implementation of the Plan 9 remote fs protocol.
|
||||
adfs.txt
|
||||
- info and mount options for the Acorn Advanced Disc Filing System.
|
||||
afs.txt
|
||||
- info and examples for the distributed AFS (Andrew File System) fs.
|
||||
affs.txt
|
||||
- info and mount options for the Amiga Fast File System.
|
||||
automount-support.txt
|
||||
- information about filesystem automount support.
|
||||
befs.txt
|
||||
- information about the BeOS filesystem for Linux.
|
||||
bfs.txt
|
||||
- info for the SCO UnixWare Boot Filesystem (BFS).
|
||||
cifs.txt
|
||||
- description of the CIFS filesystem.
|
||||
coda.txt
|
||||
- description of the CODA filesystem.
|
||||
configfs/
|
||||
- directory containing configfs documentation and example code.
|
||||
cramfs.txt
|
||||
- info on the cram filesystem for small storage (ROMs etc).
|
||||
dentry-locking.txt
|
||||
- info on the RCU-based dcache locking model.
|
||||
directory-locking
|
||||
- info about the locking scheme used for directory operations.
|
||||
dlmfs.txt
|
||||
- info on the userspace interface to the OCFS2 DLM.
|
||||
ext2.txt
|
||||
- info, mount options and specifications for the Ext2 filesystem.
|
||||
ext3.txt
|
||||
- info, mount options and specifications for the Ext3 filesystem.
|
||||
ext4.txt
|
||||
- info, mount options and specifications for the Ext4 filesystem.
|
||||
files.txt
|
||||
- info on file management in the Linux kernel.
|
||||
fuse.txt
|
||||
- info on the Filesystem in User SpacE including mount options.
|
||||
hfs.txt
|
||||
- info on the Macintosh HFS Filesystem for Linux.
|
||||
hpfs.txt
|
||||
- info and mount options for the OS/2 HPFS.
|
||||
isofs.txt
|
||||
- info and mount options for the ISO 9660 (CDROM) filesystem.
|
||||
jfs.txt
|
||||
- info and mount options for the JFS filesystem.
|
||||
ncpfs.txt
|
||||
- info on Novell Netware(tm) filesystem using NCP protocol.
|
||||
ntfs.txt
|
||||
- info and mount options for the NTFS filesystem (Windows NT).
|
||||
ocfs2.txt
|
||||
- info and mount options for the OCFS2 clustered filesystem.
|
||||
porting
|
||||
- various information on filesystem porting.
|
||||
proc.txt
|
||||
- info on Linux's /proc filesystem.
|
||||
ramfs-rootfs-initramfs.txt
|
||||
- info on the 'in memory' filesystems ramfs, rootfs and initramfs.
|
||||
reiser4.txt
|
||||
- info on the Reiser4 filesystem based on dancing tree algorithms.
|
||||
relay.txt
|
||||
- info on relay, for efficient streaming from kernel to user space.
|
||||
romfs.txt
|
||||
- description of the ROMFS filesystem.
|
||||
smbfs.txt
|
||||
- info on using filesystems with the SMB protocol (Win 3.11 and NT).
|
||||
spufs.txt
|
||||
- info and mount options for the SPU filesystem used on Cell.
|
||||
sysfs-pci.txt
|
||||
- info on accessing PCI device resources through sysfs.
|
||||
sysfs.txt
|
||||
- info on sysfs, a ram-based filesystem for exporting kernel objects.
|
||||
sysv-fs.txt
|
||||
- info on the SystemV/V7/Xenix/Coherent filesystem.
|
||||
tmpfs.txt
|
||||
- info on tmpfs, a filesystem that holds all files in virtual memory.
|
||||
udf.txt
|
||||
- info and mount options for the UDF filesystem.
|
||||
ufs.txt
|
||||
- info on the ufs filesystem.
|
||||
vfat.txt
|
||||
- info on using the VFAT filesystem used in Windows NT and Windows 95
|
||||
vfs.txt
|
||||
- overview of the Virtual File System
|
||||
xfs.txt
|
||||
- info and mount options for the XFS filesystem.
|
||||
xip.txt
|
||||
- info on execute-in-place for file mappings.
|
||||
118
Documentation/filesystems/9p.txt
Normal file
118
Documentation/filesystems/9p.txt
Normal file
@@ -0,0 +1,118 @@
|
||||
v9fs: Plan 9 Resource Sharing for Linux
|
||||
=======================================
|
||||
|
||||
ABOUT
|
||||
=====
|
||||
|
||||
v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
|
||||
|
||||
This software was originally developed by Ron Minnich <rminnich@lanl.gov>
|
||||
and Maya Gokhale <maya@lanl.gov>. Additional development by Greg Watson
|
||||
<gwatson@lanl.gov> and most recently Eric Van Hensbergen
|
||||
<ericvh@gmail.com>, Latchesar Ionkov <lucho@ionkov.net> and Russ Cox
|
||||
<rsc@swtch.com>.
|
||||
|
||||
USAGE
|
||||
=====
|
||||
|
||||
For remote file server:
|
||||
|
||||
mount -t 9p 10.10.1.2 /mnt/9
|
||||
|
||||
For Plan 9 From User Space applications (http://swtch.com/plan9)
|
||||
|
||||
mount -t 9p `namespace`/acme /mnt/9 -o proto=unix,uname=$USER
|
||||
|
||||
OPTIONS
|
||||
=======
|
||||
|
||||
proto=name select an alternative transport. Valid options are
|
||||
currently:
|
||||
unix - specifying a named pipe mount point
|
||||
tcp - specifying a normal TCP/IP connection
|
||||
fd - used passed file descriptors for connection
|
||||
(see rfdno and wfdno)
|
||||
|
||||
uname=name user name to attempt mount as on the remote server. The
|
||||
server may override or ignore this value. Certain user
|
||||
names may require authentication.
|
||||
|
||||
aname=name aname specifies the file tree to access when the server is
|
||||
offering several exported file systems.
|
||||
|
||||
cache=mode specifies a cacheing policy. By default, no caches are used.
|
||||
loose = no attempts are made at consistency,
|
||||
intended for exclusive, read-only mounts
|
||||
|
||||
debug=n specifies debug level. The debug level is a bitmask.
|
||||
0x01 = display verbose error messages
|
||||
0x02 = developer debug (DEBUG_CURRENT)
|
||||
0x04 = display 9p trace
|
||||
0x08 = display VFS trace
|
||||
0x10 = display Marshalling debug
|
||||
0x20 = display RPC debug
|
||||
0x40 = display transport debug
|
||||
0x80 = display allocation debug
|
||||
|
||||
rfdno=n the file descriptor for reading with proto=fd
|
||||
|
||||
wfdno=n the file descriptor for writing with proto=fd
|
||||
|
||||
maxdata=n the number of bytes to use for 9p packet payload (msize)
|
||||
|
||||
port=n port to connect to on the remote server
|
||||
|
||||
noextend force legacy mode (no 9p2000.u semantics)
|
||||
|
||||
uid attempt to mount as a particular uid
|
||||
|
||||
gid attempt to mount with a particular gid
|
||||
|
||||
afid security channel - used by Plan 9 authentication protocols
|
||||
|
||||
nodevmap do not map special files - represent them as normal files.
|
||||
This can be used to share devices/named pipes/sockets between
|
||||
hosts. This functionality will be expanded in later versions.
|
||||
|
||||
RESOURCES
|
||||
=========
|
||||
|
||||
Our current recommendation is to use Inferno (http://www.vitanuova.com/inferno)
|
||||
as the 9p server. You can start a 9p server under Inferno by issuing the
|
||||
following command:
|
||||
; styxlisten -A tcp!*!564 export '#U*'
|
||||
|
||||
The -A specifies an unauthenticated export. The 564 is the port # (you may
|
||||
have to choose a higher port number if running as a normal user). The '#U*'
|
||||
specifies exporting the root of the Linux name space. You may specify a
|
||||
subset of the namespace by extending the path: '#U*'/tmp would just export
|
||||
/tmp. For more information, see the Inferno manual pages covering styxlisten
|
||||
and export.
|
||||
|
||||
A Linux version of the 9p server is now maintained under the npfs project
|
||||
on sourceforge (http://sourceforge.net/projects/npfs). There is also a
|
||||
more stable single-threaded version of the server (named spfs) available from
|
||||
the same CVS repository.
|
||||
|
||||
There are user and developer mailing lists available through the v9fs project
|
||||
on sourceforge (http://sourceforge.net/projects/v9fs).
|
||||
|
||||
News and other information is maintained on SWiK (http://swik.net/v9fs).
|
||||
|
||||
Bug reports may be issued through the kernel.org bugzilla
|
||||
(http://bugzilla.kernel.org)
|
||||
|
||||
For more information on the Plan 9 Operating System check out
|
||||
http://plan9.bell-labs.com/plan9
|
||||
|
||||
For information on Plan 9 from User Space (Plan 9 applications and libraries
|
||||
ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
|
||||
|
||||
|
||||
STATUS
|
||||
======
|
||||
|
||||
The 2.6 kernel support is working on PPC and x86.
|
||||
|
||||
PLEASE USE THE KERNEL BUGZILLA TO REPORT PROBLEMS. (http://bugzilla.kernel.org)
|
||||
|
||||
176
Documentation/filesystems/Exporting
Normal file
176
Documentation/filesystems/Exporting
Normal file
@@ -0,0 +1,176 @@
|
||||
|
||||
Making Filesystems Exportable
|
||||
=============================
|
||||
|
||||
Most filesystem operations require a dentry (or two) as a starting
|
||||
point. Local applications have a reference-counted hold on suitable
|
||||
dentrys via open file descriptors or cwd/root. However remote
|
||||
applications that access a filesystem via a remote filesystem protocol
|
||||
such as NFS may not be able to hold such a reference, and so need a
|
||||
different way to refer to a particular dentry. As the alternative
|
||||
form of reference needs to be stable across renames, truncates, and
|
||||
server-reboot (among other things, though these tend to be the most
|
||||
problematic), there is no simple answer like 'filename'.
|
||||
|
||||
The mechanism discussed here allows each filesystem implementation to
|
||||
specify how to generate an opaque (out side of the filesystem) byte
|
||||
string for any dentry, and how to find an appropriate dentry for any
|
||||
given opaque byte string.
|
||||
This byte string will be called a "filehandle fragment" as it
|
||||
corresponds to part of an NFS filehandle.
|
||||
|
||||
A filesystem which supports the mapping between filehandle fragments
|
||||
and dentrys will be termed "exportable".
|
||||
|
||||
|
||||
|
||||
Dcache Issues
|
||||
-------------
|
||||
|
||||
The dcache normally contains a proper prefix of any given filesystem
|
||||
tree. This means that if any filesystem object is in the dcache, then
|
||||
all of the ancestors of that filesystem object are also in the dcache.
|
||||
As normal access is by filename this prefix is created naturally and
|
||||
maintained easily (by each object maintaining a reference count on
|
||||
its parent).
|
||||
|
||||
However when objects are included into the dcache by interpreting a
|
||||
filehandle fragment, there is no automatic creation of a path prefix
|
||||
for the object. This leads to two related but distinct features of
|
||||
the dcache that are not needed for normal filesystem access.
|
||||
|
||||
1/ The dcache must sometimes contain objects that are not part of the
|
||||
proper prefix. i.e that are not connected to the root.
|
||||
2/ The dcache must be prepared for a newly found (via ->lookup) directory
|
||||
to already have a (non-connected) dentry, and must be able to move
|
||||
that dentry into place (based on the parent and name in the
|
||||
->lookup). This is particularly needed for directories as
|
||||
it is a dcache invariant that directories only have one dentry.
|
||||
|
||||
To implement these features, the dcache has:
|
||||
|
||||
a/ A dentry flag DCACHE_DISCONNECTED which is set on
|
||||
any dentry that might not be part of the proper prefix.
|
||||
This is set when anonymous dentries are created, and cleared when a
|
||||
dentry is noticed to be a child of a dentry which is in the proper
|
||||
prefix.
|
||||
|
||||
b/ A per-superblock list "s_anon" of dentries which are the roots of
|
||||
subtrees that are not in the proper prefix. These dentries, as
|
||||
well as the proper prefix, need to be released at unmount time. As
|
||||
these dentries will not be hashed, they are linked together on the
|
||||
d_hash list_head.
|
||||
|
||||
c/ Helper routines to allocate anonymous dentries, and to help attach
|
||||
loose directory dentries at lookup time. They are:
|
||||
d_alloc_anon(inode) will return a dentry for the given inode.
|
||||
If the inode already has a dentry, one of those is returned.
|
||||
If it doesn't, a new anonymous (IS_ROOT and
|
||||
DCACHE_DISCONNECTED) dentry is allocated and attached.
|
||||
In the case of a directory, care is taken that only one dentry
|
||||
can ever be attached.
|
||||
d_splice_alias(inode, dentry) will make sure that there is a
|
||||
dentry with the same name and parent as the given dentry, and
|
||||
which refers to the given inode.
|
||||
If the inode is a directory and already has a dentry, then that
|
||||
dentry is d_moved over the given dentry.
|
||||
If the passed dentry gets attached, care is taken that this is
|
||||
mutually exclusive to a d_alloc_anon operation.
|
||||
If the passed dentry is used, NULL is returned, else the used
|
||||
dentry is returned. This corresponds to the calling pattern of
|
||||
->lookup.
|
||||
|
||||
|
||||
Filesystem Issues
|
||||
-----------------
|
||||
|
||||
For a filesystem to be exportable it must:
|
||||
|
||||
1/ provide the filehandle fragment routines described below.
|
||||
2/ make sure that d_splice_alias is used rather than d_add
|
||||
when ->lookup finds an inode for a given parent and name.
|
||||
Typically the ->lookup routine will end:
|
||||
if (inode)
|
||||
return d_splice(inode, dentry);
|
||||
d_add(dentry, inode);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
|
||||
|
||||
A file system implementation declares that instances of the filesystem
|
||||
are exportable by setting the s_export_op field in the struct
|
||||
super_block. This field must point to a "struct export_operations"
|
||||
struct which could potentially be full of NULLs, though normally at
|
||||
least get_parent will be set.
|
||||
|
||||
The primary operations are decode_fh and encode_fh.
|
||||
decode_fh takes a filehandle fragment and tries to find or create a
|
||||
dentry for the object referred to by the filehandle.
|
||||
encode_fh takes a dentry and creates a filehandle fragment which can
|
||||
later be used to find/create a dentry for the same object.
|
||||
|
||||
decode_fh will probably make use of "find_exported_dentry".
|
||||
This function lives in the "exportfs" module which a filesystem does
|
||||
not need unless it is being exported. So rather that calling
|
||||
find_exported_dentry directly, each filesystem should call it through
|
||||
the find_exported_dentry pointer in it's export_operations table.
|
||||
This field is set correctly by the exporting agent (e.g. nfsd) when a
|
||||
filesystem is exported, and before any export operations are called.
|
||||
|
||||
find_exported_dentry needs three support functions from the
|
||||
filesystem:
|
||||
get_name. When given a parent dentry and a child dentry, this
|
||||
should find a name in the directory identified by the parent
|
||||
dentry, which leads to the object identified by the child dentry.
|
||||
If no get_name function is supplied, a default implementation is
|
||||
provided which uses vfs_readdir to find potential names, and
|
||||
matches inode numbers to find the correct match.
|
||||
|
||||
get_parent. When given a dentry for a directory, this should return
|
||||
a dentry for the parent. Quite possibly the parent dentry will
|
||||
have been allocated by d_alloc_anon.
|
||||
The default get_parent function just returns an error so any
|
||||
filehandle lookup that requires finding a parent will fail.
|
||||
->lookup("..") is *not* used as a default as it can leave ".."
|
||||
entries in the dcache which are too messy to work with.
|
||||
|
||||
get_dentry. When given an opaque datum, this should find the
|
||||
implied object and create a dentry for it (possibly with
|
||||
d_alloc_anon).
|
||||
The opaque datum is whatever is passed down by the decode_fh
|
||||
function, and is often simply a fragment of the filehandle
|
||||
fragment.
|
||||
decode_fh passes two datums through find_exported_dentry. One that
|
||||
should be used to identify the target object, and one that can be
|
||||
used to identify the object's parent, should that be necessary.
|
||||
The default get_dentry function assumes that the datum contains an
|
||||
inode number and a generation number, and it attempts to get the
|
||||
inode using "iget" and check it's validity by matching the
|
||||
generation number. A filesystem should only depend on the default
|
||||
if iget can safely be used this way.
|
||||
|
||||
If decode_fh and/or encode_fh are left as NULL, then default
|
||||
implementations are used. These defaults are suitable for ext2 and
|
||||
extremely similar filesystems (like ext3).
|
||||
|
||||
The default encode_fh creates a filehandle fragment from the inode
|
||||
number and generation number of the target together with the inode
|
||||
number and generation number of the parent (if the parent is
|
||||
required).
|
||||
|
||||
The default decode_fh extract the target and parent datums from the
|
||||
filehandle assuming the format used by the default encode_fh and
|
||||
passed them to find_exported_dentry.
|
||||
|
||||
|
||||
A filehandle fragment consists of an array of 1 or more 4byte words,
|
||||
together with a one byte "type".
|
||||
The decode_fh routine should not depend on the stated size that is
|
||||
passed to it. This size may be larger than the original filehandle
|
||||
generated by encode_fh, in which case it will have been padded with
|
||||
nuls. Rather, the encode_fh routine should choose a "type" which
|
||||
indicates the decode_fh how much of the filehandle is valid, and how
|
||||
it should be interpreted.
|
||||
|
||||
|
||||
527
Documentation/filesystems/Locking
Normal file
527
Documentation/filesystems/Locking
Normal file
@@ -0,0 +1,527 @@
|
||||
The text below describes the locking rules for VFS-related methods.
|
||||
It is (believed to be) up-to-date. *Please*, if you change anything in
|
||||
prototypes or locking protocols - update this file. And update the relevant
|
||||
instances in the tree, don't leave that to maintainers of filesystems/devices/
|
||||
etc. At the very least, put the list of dubious cases in the end of this file.
|
||||
Don't turn it into log - maintainers of out-of-the-tree code are supposed to
|
||||
be able to use diff(1).
|
||||
Thing currently missing here: socket operations. Alexey?
|
||||
|
||||
--------------------------- dentry_operations --------------------------
|
||||
prototypes:
|
||||
int (*d_revalidate)(struct dentry *, int);
|
||||
int (*d_hash) (struct dentry *, struct qstr *);
|
||||
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
|
||||
int (*d_delete)(struct dentry *);
|
||||
void (*d_release)(struct dentry *);
|
||||
void (*d_iput)(struct dentry *, struct inode *);
|
||||
|
||||
locking rules:
|
||||
none have BKL
|
||||
dcache_lock rename_lock ->d_lock may block
|
||||
d_revalidate: no no no yes
|
||||
d_hash no no no yes
|
||||
d_compare: no yes no no
|
||||
d_delete: yes no yes no
|
||||
d_release: no no no yes
|
||||
d_iput: no no no yes
|
||||
|
||||
--------------------------- inode_operations ---------------------------
|
||||
prototypes:
|
||||
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
|
||||
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid
|
||||
ata *);
|
||||
int (*link) (struct dentry *,struct inode *,struct dentry *);
|
||||
int (*unlink) (struct inode *,struct dentry *);
|
||||
int (*symlink) (struct inode *,struct dentry *,const char *);
|
||||
int (*mkdir) (struct inode *,struct dentry *,int);
|
||||
int (*rmdir) (struct inode *,struct dentry *);
|
||||
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
|
||||
int (*rename) (struct inode *, struct dentry *,
|
||||
struct inode *, struct dentry *);
|
||||
int (*readlink) (struct dentry *, char __user *,int);
|
||||
int (*follow_link) (struct dentry *, struct nameidata *);
|
||||
void (*truncate) (struct inode *);
|
||||
int (*permission) (struct inode *, int, struct nameidata *);
|
||||
int (*setattr) (struct dentry *, struct iattr *);
|
||||
int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *);
|
||||
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
|
||||
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
|
||||
ssize_t (*listxattr) (struct dentry *, char *, size_t);
|
||||
int (*removexattr) (struct dentry *, const char *);
|
||||
|
||||
locking rules:
|
||||
all may block, none have BKL
|
||||
i_sem(inode)
|
||||
lookup: yes
|
||||
create: yes
|
||||
link: yes (both)
|
||||
mknod: yes
|
||||
symlink: yes
|
||||
mkdir: yes
|
||||
unlink: yes (both)
|
||||
rmdir: yes (both) (see below)
|
||||
rename: yes (all) (see below)
|
||||
readlink: no
|
||||
follow_link: no
|
||||
truncate: yes (see below)
|
||||
setattr: yes
|
||||
permission: no
|
||||
getattr: no
|
||||
setxattr: yes
|
||||
getxattr: no
|
||||
listxattr: no
|
||||
removexattr: yes
|
||||
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on
|
||||
victim.
|
||||
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
|
||||
->truncate() is never called directly - it's a callback, not a
|
||||
method. It's called by vmtruncate() - library function normally used by
|
||||
->setattr(). Locking information above applies to that call (i.e. is
|
||||
inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
|
||||
passed).
|
||||
|
||||
See Documentation/filesystems/directory-locking for more detailed discussion
|
||||
of the locking scheme for directory operations.
|
||||
|
||||
--------------------------- super_operations ---------------------------
|
||||
prototypes:
|
||||
struct inode *(*alloc_inode)(struct super_block *sb);
|
||||
void (*destroy_inode)(struct inode *);
|
||||
void (*read_inode) (struct inode *);
|
||||
void (*dirty_inode) (struct inode *);
|
||||
int (*write_inode) (struct inode *, int);
|
||||
void (*put_inode) (struct inode *);
|
||||
void (*drop_inode) (struct inode *);
|
||||
void (*delete_inode) (struct inode *);
|
||||
void (*put_super) (struct super_block *);
|
||||
void (*write_super) (struct super_block *);
|
||||
int (*sync_fs)(struct super_block *sb, int wait);
|
||||
void (*write_super_lockfs) (struct super_block *);
|
||||
void (*unlockfs) (struct super_block *);
|
||||
int (*statfs) (struct dentry *, struct kstatfs *);
|
||||
int (*remount_fs) (struct super_block *, int *, char *);
|
||||
void (*clear_inode) (struct inode *);
|
||||
void (*umount_begin) (struct super_block *);
|
||||
int (*show_options)(struct seq_file *, struct vfsmount *);
|
||||
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
|
||||
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
|
||||
|
||||
locking rules:
|
||||
All may block.
|
||||
BKL s_lock s_umount
|
||||
alloc_inode: no no no
|
||||
destroy_inode: no
|
||||
read_inode: no (see below)
|
||||
dirty_inode: no (must not sleep)
|
||||
write_inode: no
|
||||
put_inode: no
|
||||
drop_inode: no !!!inode_lock!!!
|
||||
delete_inode: no
|
||||
put_super: yes yes no
|
||||
write_super: no yes read
|
||||
sync_fs: no no read
|
||||
write_super_lockfs: ?
|
||||
unlockfs: ?
|
||||
statfs: no no no
|
||||
remount_fs: yes yes maybe (see below)
|
||||
clear_inode: no
|
||||
umount_begin: yes no no
|
||||
show_options: no (vfsmount->sem)
|
||||
quota_read: no no no (see below)
|
||||
quota_write: no no no (see below)
|
||||
|
||||
->read_inode() is not a method - it's a callback used in iget().
|
||||
->remount_fs() will have the s_umount lock if it's already mounted.
|
||||
When called from get_sb_single, it does NOT have the s_umount lock.
|
||||
->quota_read() and ->quota_write() functions are both guaranteed to
|
||||
be the only ones operating on the quota file by the quota code (via
|
||||
dqio_sem) (unless an admin really wants to screw up something and
|
||||
writes to quota files with quotas on). For other details about locking
|
||||
see also dquot_operations section.
|
||||
|
||||
--------------------------- file_system_type ---------------------------
|
||||
prototypes:
|
||||
int (*get_sb) (struct file_system_type *, int,
|
||||
const char *, void *, struct vfsmount *);
|
||||
void (*kill_sb) (struct super_block *);
|
||||
locking rules:
|
||||
may block BKL
|
||||
get_sb yes yes
|
||||
kill_sb yes yes
|
||||
|
||||
->get_sb() returns error or 0 with locked superblock attached to the vfsmount
|
||||
(exclusive on ->s_umount).
|
||||
->kill_sb() takes a write-locked superblock, does all shutdown work on it,
|
||||
unlocks and drops the reference.
|
||||
|
||||
--------------------------- address_space_operations --------------------------
|
||||
prototypes:
|
||||
int (*writepage)(struct page *page, struct writeback_control *wbc);
|
||||
int (*readpage)(struct file *, struct page *);
|
||||
int (*sync_page)(struct page *);
|
||||
int (*writepages)(struct address_space *, struct writeback_control *);
|
||||
int (*set_page_dirty)(struct page *page);
|
||||
int (*readpages)(struct file *filp, struct address_space *mapping,
|
||||
struct list_head *pages, unsigned nr_pages);
|
||||
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
|
||||
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
|
||||
sector_t (*bmap)(struct address_space *, sector_t);
|
||||
int (*invalidatepage) (struct page *, unsigned long);
|
||||
int (*releasepage) (struct page *, int);
|
||||
int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
|
||||
loff_t offset, unsigned long nr_segs);
|
||||
int (*launder_page) (struct page *);
|
||||
|
||||
locking rules:
|
||||
All except set_page_dirty may block
|
||||
|
||||
BKL PageLocked(page)
|
||||
writepage: no yes, unlocks (see below)
|
||||
readpage: no yes, unlocks
|
||||
sync_page: no maybe
|
||||
writepages: no
|
||||
set_page_dirty no no
|
||||
readpages: no
|
||||
prepare_write: no yes
|
||||
commit_write: no yes
|
||||
bmap: yes
|
||||
invalidatepage: no yes
|
||||
releasepage: no yes
|
||||
direct_IO: no
|
||||
launder_page: no yes
|
||||
|
||||
->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
|
||||
may be called from the request handler (/dev/loop).
|
||||
|
||||
->readpage() unlocks the page, either synchronously or via I/O
|
||||
completion.
|
||||
|
||||
->readpages() populates the pagecache with the passed pages and starts
|
||||
I/O against them. They come unlocked upon I/O completion.
|
||||
|
||||
->writepage() is used for two purposes: for "memory cleansing" and for
|
||||
"sync". These are quite different operations and the behaviour may differ
|
||||
depending upon the mode.
|
||||
|
||||
If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
|
||||
it *must* start I/O against the page, even if that would involve
|
||||
blocking on in-progress I/O.
|
||||
|
||||
If writepage is called for memory cleansing (sync_mode ==
|
||||
WBC_SYNC_NONE) then its role is to get as much writeout underway as
|
||||
possible. So writepage should try to avoid blocking against
|
||||
currently-in-progress I/O.
|
||||
|
||||
If the filesystem is not called for "sync" and it determines that it
|
||||
would need to block against in-progress I/O to be able to start new I/O
|
||||
against the page the filesystem should redirty the page with
|
||||
redirty_page_for_writepage(), then unlock the page and return zero.
|
||||
This may also be done to avoid internal deadlocks, but rarely.
|
||||
|
||||
If the filesytem is called for sync then it must wait on any
|
||||
in-progress I/O and then start new I/O.
|
||||
|
||||
The filesystem should unlock the page synchronously, before returning to the
|
||||
caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
|
||||
value. WRITEPAGE_ACTIVATE means that page cannot really be written out
|
||||
currently, and VM should stop calling ->writepage() on this page for some
|
||||
time. VM does this by moving page to the head of the active list, hence the
|
||||
name.
|
||||
|
||||
Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
|
||||
and return zero, writepage *must* run set_page_writeback() against the page,
|
||||
followed by unlocking it. Once set_page_writeback() has been run against the
|
||||
page, write I/O can be submitted and the write I/O completion handler must run
|
||||
end_page_writeback() once the I/O is complete. If no I/O is submitted, the
|
||||
filesystem must run end_page_writeback() against the page before returning from
|
||||
writepage.
|
||||
|
||||
That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
|
||||
if the filesystem needs the page to be locked during writeout, that is ok, too,
|
||||
the page is allowed to be unlocked at any point in time between the calls to
|
||||
set_page_writeback() and end_page_writeback().
|
||||
|
||||
Note, failure to run either redirty_page_for_writepage() or the combination of
|
||||
set_page_writeback()/end_page_writeback() on a page submitted to writepage
|
||||
will leave the page itself marked clean but it will be tagged as dirty in the
|
||||
radix tree. This incoherency can lead to all sorts of hard-to-debug problems
|
||||
in the filesystem like having dirty inodes at umount and losing written data.
|
||||
|
||||
->sync_page() locking rules are not well-defined - usually it is called
|
||||
with lock on page, but that is not guaranteed. Considering the currently
|
||||
existing instances of this method ->sync_page() itself doesn't look
|
||||
well-defined...
|
||||
|
||||
->writepages() is used for periodic writeback and for syscall-initiated
|
||||
sync operations. The address_space should start I/O against at least
|
||||
*nr_to_write pages. *nr_to_write must be decremented for each page which is
|
||||
written. The address_space implementation may write more (or less) pages
|
||||
than *nr_to_write asks for, but it should try to be reasonably close. If
|
||||
nr_to_write is NULL, all dirty pages must be written.
|
||||
|
||||
writepages should _only_ write pages which are present on
|
||||
mapping->io_pages.
|
||||
|
||||
->set_page_dirty() is called from various places in the kernel
|
||||
when the target page is marked as needing writeback. It may be called
|
||||
under spinlock (it cannot block) and is sometimes called with the page
|
||||
not locked.
|
||||
|
||||
->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
|
||||
filesystems and by the swapper. The latter will eventually go away. All
|
||||
instances do not actually need the BKL. Please, keep it that way and don't
|
||||
breed new callers.
|
||||
|
||||
->invalidatepage() is called when the filesystem must attempt to drop
|
||||
some or all of the buffers from the page when it is being truncated. It
|
||||
returns zero on success. If ->invalidatepage is zero, the kernel uses
|
||||
block_invalidatepage() instead.
|
||||
|
||||
->releasepage() is called when the kernel is about to try to drop the
|
||||
buffers from the page in preparation for freeing it. It returns zero to
|
||||
indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
|
||||
the kernel assumes that the fs has no private interest in the buffers.
|
||||
|
||||
->launder_page() may be called prior to releasing a page if
|
||||
it is still found to be dirty. It returns zero if the page was successfully
|
||||
cleaned, or an error value if not. Note that in order to prevent the page
|
||||
getting mapped back in and redirtied, it needs to be kept locked
|
||||
across the entire operation.
|
||||
|
||||
Note: currently almost all instances of address_space methods are
|
||||
using BKL for internal serialization and that's one of the worst sources
|
||||
of contention. Normally they are calling library functions (in fs/buffer.c)
|
||||
and pass foo_get_block() as a callback (on local block-based filesystems,
|
||||
indeed). BKL is not needed for library stuff and is usually taken by
|
||||
foo_get_block(). It's an overkill, since block bitmaps can be protected by
|
||||
internal fs locking and real critical areas are much smaller than the areas
|
||||
filesystems protect now.
|
||||
|
||||
----------------------- file_lock_operations ------------------------------
|
||||
prototypes:
|
||||
void (*fl_insert)(struct file_lock *); /* lock insertion callback */
|
||||
void (*fl_remove)(struct file_lock *); /* lock removal callback */
|
||||
void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
|
||||
void (*fl_release_private)(struct file_lock *);
|
||||
|
||||
|
||||
locking rules:
|
||||
BKL may block
|
||||
fl_insert: yes no
|
||||
fl_remove: yes no
|
||||
fl_copy_lock: yes no
|
||||
fl_release_private: yes yes
|
||||
|
||||
----------------------- lock_manager_operations ---------------------------
|
||||
prototypes:
|
||||
int (*fl_compare_owner)(struct file_lock *, struct file_lock *);
|
||||
void (*fl_notify)(struct file_lock *); /* unblock callback */
|
||||
void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
|
||||
void (*fl_release_private)(struct file_lock *);
|
||||
void (*fl_break)(struct file_lock *); /* break_lease callback */
|
||||
|
||||
locking rules:
|
||||
BKL may block
|
||||
fl_compare_owner: yes no
|
||||
fl_notify: yes no
|
||||
fl_copy_lock: yes no
|
||||
fl_release_private: yes yes
|
||||
fl_break: yes no
|
||||
|
||||
Currently only NFSD and NLM provide instances of this class. None of the
|
||||
them block. If you have out-of-tree instances - please, show up. Locking
|
||||
in that area will change.
|
||||
--------------------------- buffer_head -----------------------------------
|
||||
prototypes:
|
||||
void (*b_end_io)(struct buffer_head *bh, int uptodate);
|
||||
|
||||
locking rules:
|
||||
called from interrupts. In other words, extreme care is needed here.
|
||||
bh is locked, but that's all warranties we have here. Currently only RAID1,
|
||||
highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
|
||||
call this method upon the IO completion.
|
||||
|
||||
--------------------------- block_device_operations -----------------------
|
||||
prototypes:
|
||||
int (*open) (struct inode *, struct file *);
|
||||
int (*release) (struct inode *, struct file *);
|
||||
int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
|
||||
int (*media_changed) (struct gendisk *);
|
||||
int (*revalidate_disk) (struct gendisk *);
|
||||
|
||||
locking rules:
|
||||
BKL bd_sem
|
||||
open: yes yes
|
||||
release: yes yes
|
||||
ioctl: yes no
|
||||
media_changed: no no
|
||||
revalidate_disk: no no
|
||||
|
||||
The last two are called only from check_disk_change().
|
||||
|
||||
--------------------------- file_operations -------------------------------
|
||||
prototypes:
|
||||
loff_t (*llseek) (struct file *, loff_t, int);
|
||||
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
|
||||
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
|
||||
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
|
||||
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
|
||||
int (*readdir) (struct file *, void *, filldir_t);
|
||||
unsigned int (*poll) (struct file *, struct poll_table_struct *);
|
||||
int (*ioctl) (struct inode *, struct file *, unsigned int,
|
||||
unsigned long);
|
||||
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
|
||||
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
|
||||
int (*mmap) (struct file *, struct vm_area_struct *);
|
||||
int (*open) (struct inode *, struct file *);
|
||||
int (*flush) (struct file *);
|
||||
int (*release) (struct inode *, struct file *);
|
||||
int (*fsync) (struct file *, struct dentry *, int datasync);
|
||||
int (*aio_fsync) (struct kiocb *, int datasync);
|
||||
int (*fasync) (int, struct file *, int);
|
||||
int (*lock) (struct file *, int, struct file_lock *);
|
||||
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
|
||||
loff_t *);
|
||||
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
|
||||
loff_t *);
|
||||
ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
|
||||
void __user *);
|
||||
ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
|
||||
loff_t *, int);
|
||||
unsigned long (*get_unmapped_area)(struct file *, unsigned long,
|
||||
unsigned long, unsigned long, unsigned long);
|
||||
int (*check_flags)(int);
|
||||
int (*dir_notify)(struct file *, unsigned long);
|
||||
};
|
||||
|
||||
locking rules:
|
||||
All except ->poll() may block.
|
||||
BKL
|
||||
llseek: no (see below)
|
||||
read: no
|
||||
aio_read: no
|
||||
write: no
|
||||
aio_write: no
|
||||
readdir: no
|
||||
poll: no
|
||||
ioctl: yes (see below)
|
||||
unlocked_ioctl: no (see below)
|
||||
compat_ioctl: no
|
||||
mmap: no
|
||||
open: maybe (see below)
|
||||
flush: no
|
||||
release: no
|
||||
fsync: no (see below)
|
||||
aio_fsync: no
|
||||
fasync: yes (see below)
|
||||
lock: yes
|
||||
readv: no
|
||||
writev: no
|
||||
sendfile: no
|
||||
sendpage: no
|
||||
get_unmapped_area: no
|
||||
check_flags: no
|
||||
dir_notify: no
|
||||
|
||||
->llseek() locking has moved from llseek to the individual llseek
|
||||
implementations. If your fs is not using generic_file_llseek, you
|
||||
need to acquire and release the appropriate locks in your ->llseek().
|
||||
For many filesystems, it is probably safe to acquire the inode
|
||||
semaphore. Note some filesystems (i.e. remote ones) provide no
|
||||
protection for i_size so you will need to use the BKL.
|
||||
|
||||
->open() locking is in-transit: big lock partially moved into the methods.
|
||||
The only exception is ->open() in the instances of file_operations that never
|
||||
end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices
|
||||
(chrdev_open() takes lock before replacing ->f_op and calling the secondary
|
||||
method. As soon as we fix the handling of module reference counters all
|
||||
instances of ->open() will be called without the BKL.
|
||||
|
||||
Note: ext2_release() was *the* source of contention on fs-intensive
|
||||
loads and dropping BKL on ->release() helps to get rid of that (we still
|
||||
grab BKL for cases when we close a file that had been opened r/w, but that
|
||||
can and should be done using the internal locking with smaller critical areas).
|
||||
Current worst offender is ext2_get_block()...
|
||||
|
||||
->fasync() is a mess. This area needs a big cleanup and that will probably
|
||||
affect locking.
|
||||
|
||||
->readdir() and ->ioctl() on directories must be changed. Ideally we would
|
||||
move ->readdir() to inode_operations and use a separate method for directory
|
||||
->ioctl() or kill the latter completely. One of the problems is that for
|
||||
anything that resembles union-mount we won't have a struct file for all
|
||||
components. And there are other reasons why the current interface is a mess...
|
||||
|
||||
->ioctl() on regular files is superceded by the ->unlocked_ioctl() that
|
||||
doesn't take the BKL.
|
||||
|
||||
->read on directories probably must go away - we should just enforce -EISDIR
|
||||
in sys_read() and friends.
|
||||
|
||||
->fsync() has i_sem on inode.
|
||||
|
||||
--------------------------- dquot_operations -------------------------------
|
||||
prototypes:
|
||||
int (*initialize) (struct inode *, int);
|
||||
int (*drop) (struct inode *);
|
||||
int (*alloc_space) (struct inode *, qsize_t, int);
|
||||
int (*alloc_inode) (const struct inode *, unsigned long);
|
||||
int (*free_space) (struct inode *, qsize_t);
|
||||
int (*free_inode) (const struct inode *, unsigned long);
|
||||
int (*transfer) (struct inode *, struct iattr *);
|
||||
int (*write_dquot) (struct dquot *);
|
||||
int (*acquire_dquot) (struct dquot *);
|
||||
int (*release_dquot) (struct dquot *);
|
||||
int (*mark_dirty) (struct dquot *);
|
||||
int (*write_info) (struct super_block *, int);
|
||||
|
||||
These operations are intended to be more or less wrapping functions that ensure
|
||||
a proper locking wrt the filesystem and call the generic quota operations.
|
||||
|
||||
What filesystem should expect from the generic quota functions:
|
||||
|
||||
FS recursion Held locks when called
|
||||
initialize: yes maybe dqonoff_sem
|
||||
drop: yes -
|
||||
alloc_space: ->mark_dirty() -
|
||||
alloc_inode: ->mark_dirty() -
|
||||
free_space: ->mark_dirty() -
|
||||
free_inode: ->mark_dirty() -
|
||||
transfer: yes -
|
||||
write_dquot: yes dqonoff_sem or dqptr_sem
|
||||
acquire_dquot: yes dqonoff_sem or dqptr_sem
|
||||
release_dquot: yes dqonoff_sem or dqptr_sem
|
||||
mark_dirty: no -
|
||||
write_info: yes dqonoff_sem
|
||||
|
||||
FS recursion means calling ->quota_read() and ->quota_write() from superblock
|
||||
operations.
|
||||
|
||||
->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called
|
||||
only directly by the filesystem and do not call any fs functions only
|
||||
the ->mark_dirty() operation.
|
||||
|
||||
More details about quota locking can be found in fs/dquot.c.
|
||||
|
||||
--------------------------- vm_operations_struct -----------------------------
|
||||
prototypes:
|
||||
void (*open)(struct vm_area_struct*);
|
||||
void (*close)(struct vm_area_struct*);
|
||||
struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *);
|
||||
|
||||
locking rules:
|
||||
BKL mmap_sem
|
||||
open: no yes
|
||||
close: no yes
|
||||
nopage: no yes
|
||||
|
||||
================================================================================
|
||||
Dubious stuff
|
||||
|
||||
(if you break something or notice that it is broken and do not fix it yourself
|
||||
- at least put it here)
|
||||
|
||||
ipc/shm.c::shm_delete() - may need BKL.
|
||||
->read() and ->write() in many drivers are (probably) missing BKL.
|
||||
drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL.
|
||||
57
Documentation/filesystems/adfs.txt
Normal file
57
Documentation/filesystems/adfs.txt
Normal file
@@ -0,0 +1,57 @@
|
||||
Mount options for ADFS
|
||||
----------------------
|
||||
|
||||
uid=nnn All files in the partition will be owned by
|
||||
user id nnn. Default 0 (root).
|
||||
gid=nnn All files in the partition will be in group
|
||||
nnn. Default 0 (root).
|
||||
ownmask=nnn The permission mask for ADFS 'owner' permissions
|
||||
will be nnn. Default 0700.
|
||||
othmask=nnn The permission mask for ADFS 'other' permissions
|
||||
will be nnn. Default 0077.
|
||||
|
||||
Mapping of ADFS permissions to Linux permissions
|
||||
------------------------------------------------
|
||||
|
||||
ADFS permissions consist of the following:
|
||||
|
||||
Owner read
|
||||
Owner write
|
||||
Other read
|
||||
Other write
|
||||
|
||||
(In older versions, an 'execute' permission did exist, but this
|
||||
does not hold the same meaning as the Linux 'execute' permission
|
||||
and is now obsolete).
|
||||
|
||||
The mapping is performed as follows:
|
||||
|
||||
Owner read -> -r--r--r--
|
||||
Owner write -> --w--w---w
|
||||
Owner read and filetype UnixExec -> ---x--x--x
|
||||
These are then masked by ownmask, eg 700 -> -rwx------
|
||||
Possible owner mode permissions -> -rwx------
|
||||
|
||||
Other read -> -r--r--r--
|
||||
Other write -> --w--w--w-
|
||||
Other read and filetype UnixExec -> ---x--x--x
|
||||
These are then masked by othmask, eg 077 -> ----rwxrwx
|
||||
Possible other mode permissions -> ----rwxrwx
|
||||
|
||||
Hence, with the default masks, if a file is owner read/write, and
|
||||
not a UnixExec filetype, then the permissions will be:
|
||||
|
||||
-rw-------
|
||||
|
||||
However, if the masks were ownmask=0770,othmask=0007, then this would
|
||||
be modified to:
|
||||
-rw-rw----
|
||||
|
||||
There is no restriction on what you can do with these masks. You may
|
||||
wish that either read bits give read access to the file for all, but
|
||||
keep the default write protection (ownmask=0755,othmask=0577):
|
||||
|
||||
-rw-r--r--
|
||||
|
||||
You can therefore tailor the permission translation to whatever you
|
||||
desire the permissions should be under Linux.
|
||||
219
Documentation/filesystems/affs.txt
Normal file
219
Documentation/filesystems/affs.txt
Normal file
@@ -0,0 +1,219 @@
|
||||
Overview of Amiga Filesystems
|
||||
=============================
|
||||
|
||||
Not all varieties of the Amiga filesystems are supported for reading and
|
||||
writing. The Amiga currently knows six different filesystems:
|
||||
|
||||
DOS\0 The old or original filesystem, not really suited for
|
||||
hard disks and normally not used on them, either.
|
||||
Supported read/write.
|
||||
|
||||
DOS\1 The original Fast File System. Supported read/write.
|
||||
|
||||
DOS\2 The old "international" filesystem. International means that
|
||||
a bug has been fixed so that accented ("international") letters
|
||||
in file names are case-insensitive, as they ought to be.
|
||||
Supported read/write.
|
||||
|
||||
DOS\3 The "international" Fast File System. Supported read/write.
|
||||
|
||||
DOS\4 The original filesystem with directory cache. The directory
|
||||
cache speeds up directory accesses on floppies considerably,
|
||||
but slows down file creation/deletion. Doesn't make much
|
||||
sense on hard disks. Supported read only.
|
||||
|
||||
DOS\5 The Fast File System with directory cache. Supported read only.
|
||||
|
||||
All of the above filesystems allow block sizes from 512 to 32K bytes.
|
||||
Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
|
||||
speed up almost everything at the expense of wasted disk space. The speed
|
||||
gain above 4K seems not really worth the price, so you don't lose too
|
||||
much here, either.
|
||||
|
||||
The muFS (multi user File System) equivalents of the above file systems
|
||||
are supported, too.
|
||||
|
||||
Mount options for the AFFS
|
||||
==========================
|
||||
|
||||
protect If this option is set, the protection bits cannot be altered.
|
||||
|
||||
setuid[=uid] This sets the owner of all files and directories in the file
|
||||
system to uid or the uid of the current user, respectively.
|
||||
|
||||
setgid[=gid] Same as above, but for gid.
|
||||
|
||||
mode=mode Sets the mode flags to the given (octal) value, regardless
|
||||
of the original permissions. Directories will get an x
|
||||
permission if the corresponding r bit is set.
|
||||
This is useful since most of the plain AmigaOS files
|
||||
will map to 600.
|
||||
|
||||
reserved=num Sets the number of reserved blocks at the start of the
|
||||
partition to num. You should never need this option.
|
||||
Default is 2.
|
||||
|
||||
root=block Sets the block number of the root block. This should never
|
||||
be necessary.
|
||||
|
||||
bs=blksize Sets the blocksize to blksize. Valid block sizes are 512,
|
||||
1024, 2048 and 4096. Like the root option, this should
|
||||
never be necessary, as the affs can figure it out itself.
|
||||
|
||||
quiet The file system will not return an error for disallowed
|
||||
mode changes.
|
||||
|
||||
verbose The volume name, file system type and block size will
|
||||
be written to the syslog when the filesystem is mounted.
|
||||
|
||||
mufs The filesystem is really a muFS, also it doesn't
|
||||
identify itself as one. This option is necessary if
|
||||
the filesystem wasn't formatted as muFS, but is used
|
||||
as one.
|
||||
|
||||
prefix=path Path will be prefixed to every absolute path name of
|
||||
symbolic links on an AFFS partition. Default = "/".
|
||||
(See below.)
|
||||
|
||||
volume=name When symbolic links with an absolute path are created
|
||||
on an AFFS partition, name will be prepended as the
|
||||
volume name. Default = "" (empty string).
|
||||
(See below.)
|
||||
|
||||
Handling of the Users/Groups and protection flags
|
||||
=================================================
|
||||
|
||||
Amiga -> Linux:
|
||||
|
||||
The Amiga protection flags RWEDRWEDHSPARWED are handled as follows:
|
||||
|
||||
- R maps to r for user, group and others. On directories, R implies x.
|
||||
|
||||
- If both W and D are allowed, w will be set.
|
||||
|
||||
- E maps to x.
|
||||
|
||||
- H and P are always retained and ignored under Linux.
|
||||
|
||||
- A is always reset when a file is written to.
|
||||
|
||||
User id and group id will be used unless set[gu]id are given as mount
|
||||
options. Since most of the Amiga file systems are single user systems
|
||||
they will be owned by root. The root directory (the mount point) of the
|
||||
Amiga filesystem will be owned by the user who actually mounts the
|
||||
filesystem (the root directory doesn't have uid/gid fields).
|
||||
|
||||
Linux -> Amiga:
|
||||
|
||||
The Linux rwxrwxrwx file mode is handled as follows:
|
||||
|
||||
- r permission will set R for user, group and others.
|
||||
|
||||
- w permission will set W and D for user, group and others.
|
||||
|
||||
- x permission of the user will set E for plain files.
|
||||
|
||||
- All other flags (suid, sgid, ...) are ignored and will
|
||||
not be retained.
|
||||
|
||||
Newly created files and directories will get the user and group ID
|
||||
of the current user and a mode according to the umask.
|
||||
|
||||
Symbolic links
|
||||
==============
|
||||
|
||||
Although the Amiga and Linux file systems resemble each other, there
|
||||
are some, not always subtle, differences. One of them becomes apparent
|
||||
with symbolic links. While Linux has a file system with exactly one
|
||||
root directory, the Amiga has a separate root directory for each
|
||||
file system (for example, partition, floppy disk, ...). With the Amiga,
|
||||
these entities are called "volumes". They have symbolic names which
|
||||
can be used to access them. Thus, symbolic links can point to a
|
||||
different volume. AFFS turns the volume name into a directory name
|
||||
and prepends the prefix path (see prefix option) to it.
|
||||
|
||||
Example:
|
||||
You mount all your Amiga partitions under /amiga/<volume> (where
|
||||
<volume> is the name of the volume), and you give the option
|
||||
"prefix=/amiga/" when mounting all your AFFS partitions. (They
|
||||
might be "User", "WB" and "Graphics", the mount points /amiga/User,
|
||||
/amiga/WB and /amiga/Graphics). A symbolic link referring to
|
||||
"User:sc/include/dos/dos.h" will be followed to
|
||||
"/amiga/User/sc/include/dos/dos.h".
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Command line:
|
||||
mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
|
||||
mount /dev/sda3 /Amiga -t affs
|
||||
|
||||
/etc/fstab entry:
|
||||
/dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0
|
||||
|
||||
IMPORTANT NOTE
|
||||
==============
|
||||
|
||||
If you boot Windows 95 (don't know about 3.x, 98 and NT) while you
|
||||
have an Amiga harddisk connected to your PC, it will overwrite
|
||||
the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating
|
||||
the Rigid Disk Block. Sheer luck has it that this is an unused
|
||||
area of the RDB, so only the checksum doesn't match anymore.
|
||||
Linux will ignore this garbage and recognize the RDB anyway, but
|
||||
before you connect that drive to your Amiga again, you must
|
||||
restore or repair your RDB. So please do make a backup copy of it
|
||||
before booting Windows!
|
||||
|
||||
If the damage is already done, the following should fix the RDB
|
||||
(where <disk> is the device name).
|
||||
DO AT YOUR OWN RISK:
|
||||
|
||||
dd if=/dev/<disk> of=rdb.tmp count=1
|
||||
cp rdb.tmp rdb.fixed
|
||||
dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4
|
||||
dd if=rdb.fixed of=/dev/<disk>
|
||||
|
||||
Bugs, Restrictions, Caveats
|
||||
===========================
|
||||
|
||||
Quite a few things may not work as advertised. Not everything is
|
||||
tested, though several hundred MB have been read and written using
|
||||
this fs. For a most up-to-date list of bugs please consult
|
||||
fs/affs/Changes.
|
||||
|
||||
Filenames are truncated to 30 characters without warning (this
|
||||
can be changed by setting the compile-time option AFFS_NO_TRUNCATE
|
||||
in include/linux/amigaffs.h).
|
||||
|
||||
Case is ignored by the affs in filename matching, but Linux shells
|
||||
do care about the case. Example (with /wb being an affs mounted fs):
|
||||
rm /wb/WRONGCASE
|
||||
will remove /mnt/wrongcase, but
|
||||
rm /wb/WR*
|
||||
will not since the names are matched by the shell.
|
||||
|
||||
The block allocation is designed for hard disk partitions. If more
|
||||
than 1 process writes to a (small) diskette, the blocks are allocated
|
||||
in an ugly way (but the real AFFS doesn't do much better). This
|
||||
is also true when space gets tight.
|
||||
|
||||
You cannot execute programs on an OFS (Old File System), since the
|
||||
program files cannot be memory mapped due to the 488 byte blocks.
|
||||
For the same reason you cannot mount an image on such a filesystem
|
||||
via the loopback device.
|
||||
|
||||
The bitmap valid flag in the root block may not be accurate when the
|
||||
system crashes while an affs partition is mounted. There's currently
|
||||
no way to fix a garbled filesystem without an Amiga (disk validator)
|
||||
or manually (who would do this?). Maybe later.
|
||||
|
||||
If you mount affs partitions on system startup, you may want to tell
|
||||
fsck that the fs should not be checked (place a '0' in the sixth field
|
||||
of /etc/fstab).
|
||||
|
||||
It's not possible to read floppy disks with a normal PC or workstation
|
||||
due to an incompatibility with the Amiga floppy controller.
|
||||
|
||||
If you are interested in an Amiga Emulator for Linux, look at
|
||||
|
||||
http://www.freiburg.linux.de/~uae/
|
||||
155
Documentation/filesystems/afs.txt
Normal file
155
Documentation/filesystems/afs.txt
Normal file
@@ -0,0 +1,155 @@
|
||||
kAFS: AFS FILESYSTEM
|
||||
====================
|
||||
|
||||
ABOUT
|
||||
=====
|
||||
|
||||
This filesystem provides a fairly simple AFS filesystem driver. It is under
|
||||
development and only provides very basic facilities. It does not yet support
|
||||
the following AFS features:
|
||||
|
||||
(*) Write support.
|
||||
(*) Communications security.
|
||||
(*) Local caching.
|
||||
(*) pioctl() system call.
|
||||
(*) Automatic mounting of embedded mountpoints.
|
||||
|
||||
|
||||
USAGE
|
||||
=====
|
||||
|
||||
When inserting the driver modules the root cell must be specified along with a
|
||||
list of volume location server IP addresses:
|
||||
|
||||
insmod rxrpc.o
|
||||
insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
|
||||
|
||||
The first module is a driver for the RxRPC remote operation protocol, and the
|
||||
second is the actual filesystem driver for the AFS filesystem.
|
||||
|
||||
Once the module has been loaded, more modules can be added by the following
|
||||
procedure:
|
||||
|
||||
echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
|
||||
|
||||
Where the parameters to the "add" command are the name of a cell and a list of
|
||||
volume location servers within that cell.
|
||||
|
||||
Filesystems can be mounted anywhere by commands similar to the following:
|
||||
|
||||
mount -t afs "%cambridge.redhat.com:root.afs." /afs
|
||||
mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
|
||||
mount -t afs "#root.afs." /afs
|
||||
mount -t afs "#root.cell." /afs/cambridge
|
||||
|
||||
NB: When using this on Linux 2.4, the mount command has to be different,
|
||||
since the filesystem doesn't have access to the device name argument:
|
||||
|
||||
mount -t afs none /afs -ovol="#root.afs."
|
||||
|
||||
Where the initial character is either a hash or a percent symbol depending on
|
||||
whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
|
||||
volume, but are willing to use a R/W volume instead (percent).
|
||||
|
||||
The name of the volume can be suffixes with ".backup" or ".readonly" to
|
||||
specify connection to only volumes of those types.
|
||||
|
||||
The name of the cell is optional, and if not given during a mount, then the
|
||||
named volume will be looked up in the cell specified during insmod.
|
||||
|
||||
Additional cells can be added through /proc (see later section).
|
||||
|
||||
|
||||
MOUNTPOINTS
|
||||
===========
|
||||
|
||||
AFS has a concept of mountpoints. These are specially formatted symbolic links
|
||||
(of the same form as the "device name" passed to mount). kAFS presents these
|
||||
to the user as directories that have special properties:
|
||||
|
||||
(*) They cannot be listed. Running a program like "ls" on them will incur an
|
||||
EREMOTE error (Object is remote).
|
||||
|
||||
(*) Other objects can't be looked up inside of them. This also incurs an
|
||||
EREMOTE error.
|
||||
|
||||
(*) They can be queried with the readlink() system call, which will return
|
||||
the name of the mountpoint to which they point. The "readlink" program
|
||||
will also work.
|
||||
|
||||
(*) They can be mounted on (which symbolic links can't).
|
||||
|
||||
|
||||
PROC FILESYSTEM
|
||||
===============
|
||||
|
||||
The rxrpc module creates a number of files in various places in the /proc
|
||||
filesystem:
|
||||
|
||||
(*) Firstly, some information files are made available in a directory called
|
||||
"/proc/net/rxrpc/". These list the extant transport endpoint, peer,
|
||||
connection and call records.
|
||||
|
||||
(*) Secondly, some control files are made available in a directory called
|
||||
"/proc/sys/rxrpc/". Currently, all these files can be used for is to
|
||||
turn on various levels of tracing.
|
||||
|
||||
The AFS modules creates a "/proc/fs/afs/" directory and populates it:
|
||||
|
||||
(*) A "cells" file that lists cells currently known to the afs module.
|
||||
|
||||
(*) A directory per cell that contains files that list volume location
|
||||
servers, volumes, and active servers known within that cell.
|
||||
|
||||
|
||||
THE CELL DATABASE
|
||||
=================
|
||||
|
||||
The filesystem maintains an internal database of all the cells it knows and
|
||||
the IP addresses of the volume location servers for those cells. The cell to
|
||||
which the computer belongs is added to the database when insmod is performed
|
||||
by the "rootcell=" argument.
|
||||
|
||||
Further cells can be added by commands similar to the following:
|
||||
|
||||
echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
|
||||
echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
|
||||
|
||||
No other cell database operations are available at this time.
|
||||
|
||||
|
||||
EXAMPLES
|
||||
========
|
||||
|
||||
Here's what I use to test this. Some of the names and IP addresses are local
|
||||
to my internal DNS. My "root.afs" partition has a mount point within it for
|
||||
some public volumes volumes.
|
||||
|
||||
insmod -S /tmp/rxrpc.o
|
||||
insmod -S /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
|
||||
|
||||
mount -t afs \%root.afs. /afs
|
||||
mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/
|
||||
|
||||
echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells
|
||||
mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/
|
||||
mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive
|
||||
mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib
|
||||
mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc
|
||||
mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project
|
||||
mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service
|
||||
mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software
|
||||
mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user
|
||||
|
||||
umount /afs/grand.central.org/user
|
||||
umount /afs/grand.central.org/software
|
||||
umount /afs/grand.central.org/service
|
||||
umount /afs/grand.central.org/project
|
||||
umount /afs/grand.central.org/doc
|
||||
umount /afs/grand.central.org/contrib
|
||||
umount /afs/grand.central.org/archive
|
||||
umount /afs/grand.central.org
|
||||
umount /afs/cambridge.redhat.com
|
||||
umount /afs
|
||||
rmmod kafs
|
||||
rmmod rxrpc
|
||||
118
Documentation/filesystems/automount-support.txt
Normal file
118
Documentation/filesystems/automount-support.txt
Normal file
@@ -0,0 +1,118 @@
|
||||
Support is available for filesystems that wish to do automounting support (such
|
||||
as kAFS which can be found in fs/afs/). This facility includes allowing
|
||||
in-kernel mounts to be performed and mountpoint degradation to be
|
||||
requested. The latter can also be requested by userspace.
|
||||
|
||||
|
||||
======================
|
||||
IN-KERNEL AUTOMOUNTING
|
||||
======================
|
||||
|
||||
A filesystem can now mount another filesystem on one of its directories by the
|
||||
following procedure:
|
||||
|
||||
(1) Give the directory a follow_link() operation.
|
||||
|
||||
When the directory is accessed, the follow_link op will be called, and
|
||||
it will be provided with the location of the mountpoint in the nameidata
|
||||
structure (vfsmount and dentry).
|
||||
|
||||
(2) Have the follow_link() op do the following steps:
|
||||
|
||||
(a) Call vfs_kern_mount() to call the appropriate filesystem to set up a
|
||||
superblock and gain a vfsmount structure representing it.
|
||||
|
||||
(b) Copy the nameidata provided as an argument and substitute the dentry
|
||||
argument into it the copy.
|
||||
|
||||
(c) Call do_add_mount() to install the new vfsmount into the namespace's
|
||||
mountpoint tree, thus making it accessible to userspace. Use the
|
||||
nameidata set up in (b) as the destination.
|
||||
|
||||
If the mountpoint will be automatically expired, then do_add_mount()
|
||||
should also be given the location of an expiration list (see further
|
||||
down).
|
||||
|
||||
(d) Release the path in the nameidata argument and substitute in the new
|
||||
vfsmount and its root dentry. The ref counts on these will need
|
||||
incrementing.
|
||||
|
||||
Then from userspace, you can just do something like:
|
||||
|
||||
[root@andromeda root]# mount -t afs \#root.afs. /afs
|
||||
[root@andromeda root]# ls /afs
|
||||
asd cambridge cambridge.redhat.com grand.central.org
|
||||
[root@andromeda root]# ls /afs/cambridge
|
||||
afsdoc
|
||||
[root@andromeda root]# ls /afs/cambridge/afsdoc/
|
||||
ChangeLog html LICENSE pdf RELNOTES-1.2.2
|
||||
|
||||
And then if you look in the mountpoint catalogue, you'll see something like:
|
||||
|
||||
[root@andromeda root]# cat /proc/mounts
|
||||
...
|
||||
#root.afs. /afs afs rw 0 0
|
||||
#root.cell. /afs/cambridge.redhat.com afs rw 0 0
|
||||
#afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
|
||||
|
||||
|
||||
===========================
|
||||
AUTOMATIC MOUNTPOINT EXPIRY
|
||||
===========================
|
||||
|
||||
Automatic expiration of mountpoints is easy, provided you've mounted the
|
||||
mountpoint to be expired in the automounting procedure outlined above.
|
||||
|
||||
To do expiration, you need to follow these steps:
|
||||
|
||||
(3) Create at least one list off which the vfsmounts to be expired can be
|
||||
hung. Access to this list will be governed by the vfsmount_lock.
|
||||
|
||||
(4) In step (2c) above, the call to do_add_mount() should be provided with a
|
||||
pointer to this list. It will hang the vfsmount off of it if it succeeds.
|
||||
|
||||
(5) When you want mountpoints to be expired, call mark_mounts_for_expiry()
|
||||
with a pointer to this list. This will process the list, marking every
|
||||
vfsmount thereon for potential expiry on the next call.
|
||||
|
||||
If a vfsmount was already flagged for expiry, and if its usage count is 1
|
||||
(it's only referenced by its parent vfsmount), then it will be deleted
|
||||
from the namespace and thrown away (effectively unmounted).
|
||||
|
||||
It may prove simplest to simply call this at regular intervals, using
|
||||
some sort of timed event to drive it.
|
||||
|
||||
The expiration flag is cleared by calls to mntput. This means that expiration
|
||||
will only happen on the second expiration request after the last time the
|
||||
mountpoint was accessed.
|
||||
|
||||
If a mountpoint is moved, it gets removed from the expiration list. If a bind
|
||||
mount is made on an expirable mount, the new vfsmount will not be on the
|
||||
expiration list and will not expire.
|
||||
|
||||
If a namespace is copied, all mountpoints contained therein will be copied,
|
||||
and the copies of those that are on an expiration list will be added to the
|
||||
same expiration list.
|
||||
|
||||
|
||||
=======================
|
||||
USERSPACE DRIVEN EXPIRY
|
||||
=======================
|
||||
|
||||
As an alternative, it is possible for userspace to request expiry of any
|
||||
mountpoint (though some will be rejected - the current process's idea of the
|
||||
rootfs for example). It does this by passing the MNT_EXPIRE flag to
|
||||
umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH.
|
||||
|
||||
If the mountpoint in question is in referenced by something other than
|
||||
umount() or its parent mountpoint, an EBUSY error will be returned and the
|
||||
mountpoint will not be marked for expiration or unmounted.
|
||||
|
||||
If the mountpoint was not already marked for expiry at that time, an EAGAIN
|
||||
error will be given and it won't be unmounted.
|
||||
|
||||
Otherwise if it was already marked and it wasn't referenced, unmounting will
|
||||
take place as usual.
|
||||
|
||||
Again, the expiration flag is cleared every time anything other than umount()
|
||||
looks at a mountpoint.
|
||||
117
Documentation/filesystems/befs.txt
Normal file
117
Documentation/filesystems/befs.txt
Normal file
@@ -0,0 +1,117 @@
|
||||
BeOS filesystem for Linux
|
||||
|
||||
Document last updated: Dec 6, 2001
|
||||
|
||||
WARNING
|
||||
=======
|
||||
Make sure you understand that this is alpha software. This means that the
|
||||
implementation is neither complete nor well-tested.
|
||||
|
||||
I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
|
||||
|
||||
LICENSE
|
||||
=====
|
||||
This software is covered by the GNU General Public License.
|
||||
See the file COPYING for the complete text of the license.
|
||||
Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
|
||||
|
||||
AUTHOR
|
||||
=====
|
||||
The largest part of the code written by Will Dyson <will_dyson@pobox.com>
|
||||
He has been working on the code since Aug 13, 2001. See the changelog for
|
||||
details.
|
||||
|
||||
Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
|
||||
His original code can still be found at:
|
||||
<http://hp.vector.co.jp/authors/VA008030/bfs/>
|
||||
Does anyone know of a more current email address for Makoto? He doesn't
|
||||
respond to the address given above...
|
||||
|
||||
Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru>
|
||||
|
||||
WHAT IS THIS DRIVER?
|
||||
==================
|
||||
This module implements the native filesystem of BeOS <http://www.be.com/>
|
||||
for the linux 2.4.1 and later kernels. Currently it is a read-only
|
||||
implementation.
|
||||
|
||||
Which is it, BFS or BEFS?
|
||||
================
|
||||
Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
|
||||
But Unixware Boot Filesystem is called bfs, too. And they are already in
|
||||
the kernel. Because of this naming conflict, on Linux the BeOS
|
||||
filesystem is called befs.
|
||||
|
||||
HOW TO INSTALL
|
||||
==============
|
||||
step 1. Install the BeFS patch into the source code tree of linux.
|
||||
|
||||
Apply the patchfile to your kernel source tree.
|
||||
Assuming that your kernel source is in /foo/bar/linux and the patchfile
|
||||
is called patch-befs-xxx, you would do the following:
|
||||
|
||||
cd /foo/bar/linux
|
||||
patch -p1 < /path/to/patch-befs-xxx
|
||||
|
||||
if the patching step fails (i.e. there are rejected hunks), you can try to
|
||||
figure it out yourself (it shouldn't be hard), or mail the maintainer
|
||||
(Will Dyson <will_dyson@pobox.com>) for help.
|
||||
|
||||
step 2. Configuration & make kernel
|
||||
|
||||
The linux kernel has many compile-time options. Most of them are beyond the
|
||||
scope of this document. I suggest the Kernel-HOWTO document as a good general
|
||||
reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html>
|
||||
|
||||
However, to use the BeFS module, you must enable it at configure time.
|
||||
|
||||
cd /foo/bar/linux
|
||||
make menuconfig (or xconfig)
|
||||
|
||||
The BeFS module is not a standard part of the linux kernel, so you must first
|
||||
enable support for experimental code under the "Code maturity level" menu.
|
||||
|
||||
Then, under the "Filesystems" menu will be an option called "BeFS
|
||||
filesystem (experimental)", or something like that. Enable that option
|
||||
(it is fine to make it a module).
|
||||
|
||||
Save your kernel configuration and then build your kernel.
|
||||
|
||||
step 3. Install
|
||||
|
||||
See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
|
||||
instructions on this critical step.
|
||||
|
||||
USING BFS
|
||||
=========
|
||||
To use the BeOS filesystem, use filesystem type 'befs'.
|
||||
|
||||
ex)
|
||||
mount -t befs /dev/fd0 /beos
|
||||
|
||||
MOUNT OPTIONS
|
||||
=============
|
||||
uid=nnn All files in the partition will be owned by user id nnn.
|
||||
gid=nnn All files in the partition will be in group nnn.
|
||||
iocharset=xxx Use xxx as the name of the NLS translation table.
|
||||
debug The driver will output debugging information to the syslog.
|
||||
|
||||
HOW TO GET LASTEST VERSION
|
||||
==========================
|
||||
|
||||
The latest version is currently available at:
|
||||
<http://befs-driver.sourceforge.net/>
|
||||
|
||||
ANY KNOWN BUGS?
|
||||
===========
|
||||
As of Jan 20, 2002:
|
||||
|
||||
None
|
||||
|
||||
SPECIAL THANKS
|
||||
==============
|
||||
Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
|
||||
Hiroyuki Yamada ... Testing LinuxPPC.
|
||||
|
||||
|
||||
|
||||
57
Documentation/filesystems/bfs.txt
Normal file
57
Documentation/filesystems/bfs.txt
Normal file
@@ -0,0 +1,57 @@
|
||||
BFS FILESYSTEM FOR LINUX
|
||||
========================
|
||||
|
||||
The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
|
||||
usually contains the kernel image and a few other files required for the
|
||||
boot process.
|
||||
|
||||
In order to access /stand partition under Linux you obviously need to
|
||||
know the partition number and the kernel must support UnixWare disk slices
|
||||
(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
|
||||
depend on having UnixWare disklabel support because one can also mount
|
||||
BFS filesystem via loopback:
|
||||
|
||||
# losetup /dev/loop0 stand.img
|
||||
# mount -t bfs /dev/loop0 /mnt/stand
|
||||
|
||||
where stand.img is a file containing the image of BFS filesystem.
|
||||
When you have finished using it and umounted you need to also deallocate
|
||||
/dev/loop0 device by:
|
||||
|
||||
# losetup -d /dev/loop0
|
||||
|
||||
You can simplify mounting by just typing:
|
||||
|
||||
# mount -t bfs -o loop stand.img /mnt/stand
|
||||
|
||||
this will allocate the first available loopback device (and load loop.o
|
||||
kernel module if necessary) automatically. If the loopback driver is not
|
||||
loaded automatically, make sure that your kernel is compiled with kmod
|
||||
support (CONFIG_KMOD) enabled. Beware that umount will not
|
||||
deallocate /dev/loopN device if /etc/mtab file on your system is a
|
||||
symbolic link to /proc/mounts. You will need to do it manually using
|
||||
"-d" switch of losetup(8). Read losetup(8) manpage for more info.
|
||||
|
||||
To create the BFS image under UnixWare you need to find out first which
|
||||
slice contains it. The command prtvtoc(1M) is your friend:
|
||||
|
||||
# prtvtoc /dev/rdsk/c0b0t0d0s0
|
||||
|
||||
(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
|
||||
look for the slice with tag "STAND", which is usually slice 10. With this
|
||||
information you can use dd(1) to create the BFS image:
|
||||
|
||||
# umount /stand
|
||||
# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
|
||||
|
||||
Just in case, you can verify that you have done the right thing by checking
|
||||
the magic number:
|
||||
|
||||
# od -Ad -tx4 stand.img | more
|
||||
|
||||
The first 4 bytes should be 0x1badface.
|
||||
|
||||
If you have any patches, questions or suggestions regarding this BFS
|
||||
implementation please contact the author:
|
||||
|
||||
Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
|
||||
51
Documentation/filesystems/cifs.txt
Normal file
51
Documentation/filesystems/cifs.txt
Normal file
@@ -0,0 +1,51 @@
|
||||
This is the client VFS module for the Common Internet File System
|
||||
(CIFS) protocol which is the successor to the Server Message Block
|
||||
(SMB) protocol, the native file sharing mechanism for most early
|
||||
PC operating systems. CIFS is fully supported by current network
|
||||
file servers such as Windows 2000, Windows 2003 (including
|
||||
Windows XP) as well by Samba (which provides excellent CIFS
|
||||
server support for Linux and many other operating systems), so
|
||||
this network filesystem client can mount to a wide variety of
|
||||
servers. The smbfs module should be used instead of this cifs module
|
||||
for mounting to older SMB servers such as OS/2. The smbfs and cifs
|
||||
modules can coexist and do not conflict. The CIFS VFS filesystem
|
||||
module is designed to work well with servers that implement the
|
||||
newer versions (dialects) of the SMB/CIFS protocol such as Samba,
|
||||
the program written by Andrew Tridgell that turns any Unix host
|
||||
into a SMB/CIFS file server.
|
||||
|
||||
The intent of this module is to provide the most advanced network
|
||||
file system function for CIFS compliant servers, including better
|
||||
POSIX compliance, secure per-user session establishment, high
|
||||
performance safe distributed caching (oplock), optional packet
|
||||
signing, large files, Unicode support and other internationalization
|
||||
improvements. Since both Samba server and this filesystem client support
|
||||
the CIFS Unix extensions, the combination can provide a reasonable
|
||||
alternative to NFSv4 for fileserving in some Linux to Linux environments,
|
||||
not just in Linux to Windows environments.
|
||||
|
||||
This filesystem has an optional mount utility (mount.cifs) that can
|
||||
be obtained from the project page and installed in the path in the same
|
||||
directory with the other mount helpers (such as mount.smbfs).
|
||||
Mounting using the cifs filesystem without installing the mount helper
|
||||
requires specifying the server's ip address.
|
||||
|
||||
For Linux 2.4:
|
||||
mount //anything/here /mnt_target -o
|
||||
user=username,pass=password,unc=//ip_address_of_server/sharename
|
||||
|
||||
For Linux 2.5:
|
||||
mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password
|
||||
|
||||
|
||||
For more information on the module see the project page at
|
||||
|
||||
http://us1.samba.org/samba/Linux_CIFS_client.html
|
||||
|
||||
For more information on CIFS see:
|
||||
|
||||
http://www.snia.org/tech_activities/CIFS
|
||||
|
||||
or the Samba site:
|
||||
|
||||
http://www.samba.org
|
||||
1673
Documentation/filesystems/coda.txt
Normal file
1673
Documentation/filesystems/coda.txt
Normal file
File diff suppressed because it is too large
Load Diff
434
Documentation/filesystems/configfs/configfs.txt
Normal file
434
Documentation/filesystems/configfs/configfs.txt
Normal file
@@ -0,0 +1,434 @@
|
||||
|
||||
configfs - Userspace-driven kernel object configuration.
|
||||
|
||||
Joel Becker <joel.becker@oracle.com>
|
||||
|
||||
Updated: 31 March 2005
|
||||
|
||||
Copyright (c) 2005 Oracle Corporation,
|
||||
Joel Becker <joel.becker@oracle.com>
|
||||
|
||||
|
||||
[What is configfs?]
|
||||
|
||||
configfs is a ram-based filesystem that provides the converse of
|
||||
sysfs's functionality. Where sysfs is a filesystem-based view of
|
||||
kernel objects, configfs is a filesystem-based manager of kernel
|
||||
objects, or config_items.
|
||||
|
||||
With sysfs, an object is created in kernel (for example, when a device
|
||||
is discovered) and it is registered with sysfs. Its attributes then
|
||||
appear in sysfs, allowing userspace to read the attributes via
|
||||
readdir(3)/read(2). It may allow some attributes to be modified via
|
||||
write(2). The important point is that the object is created and
|
||||
destroyed in kernel, the kernel controls the lifecycle of the sysfs
|
||||
representation, and sysfs is merely a window on all this.
|
||||
|
||||
A configfs config_item is created via an explicit userspace operation:
|
||||
mkdir(2). It is destroyed via rmdir(2). The attributes appear at
|
||||
mkdir(2) time, and can be read or modified via read(2) and write(2).
|
||||
As with sysfs, readdir(3) queries the list of items and/or attributes.
|
||||
symlink(2) can be used to group items together. Unlike sysfs, the
|
||||
lifetime of the representation is completely driven by userspace. The
|
||||
kernel modules backing the items must respond to this.
|
||||
|
||||
Both sysfs and configfs can and should exist together on the same
|
||||
system. One is not a replacement for the other.
|
||||
|
||||
[Using configfs]
|
||||
|
||||
configfs can be compiled as a module or into the kernel. You can access
|
||||
it by doing
|
||||
|
||||
mount -t configfs none /config
|
||||
|
||||
The configfs tree will be empty unless client modules are also loaded.
|
||||
These are modules that register their item types with configfs as
|
||||
subsystems. Once a client subsystem is loaded, it will appear as a
|
||||
subdirectory (or more than one) under /config. Like sysfs, the
|
||||
configfs tree is always there, whether mounted on /config or not.
|
||||
|
||||
An item is created via mkdir(2). The item's attributes will also
|
||||
appear at this time. readdir(3) can determine what the attributes are,
|
||||
read(2) can query their default values, and write(2) can store new
|
||||
values. Like sysfs, attributes should be ASCII text files, preferably
|
||||
with only one value per file. The same efficiency caveats from sysfs
|
||||
apply. Don't mix more than one attribute in one attribute file.
|
||||
|
||||
Like sysfs, configfs expects write(2) to store the entire buffer at
|
||||
once. When writing to configfs attributes, userspace processes should
|
||||
first read the entire file, modify the portions they wish to change, and
|
||||
then write the entire buffer back. Attribute files have a maximum size
|
||||
of one page (PAGE_SIZE, 4096 on i386).
|
||||
|
||||
When an item needs to be destroyed, remove it with rmdir(2). An
|
||||
item cannot be destroyed if any other item has a link to it (via
|
||||
symlink(2)). Links can be removed via unlink(2).
|
||||
|
||||
[Configuring FakeNBD: an Example]
|
||||
|
||||
Imagine there's a Network Block Device (NBD) driver that allows you to
|
||||
access remote block devices. Call it FakeNBD. FakeNBD uses configfs
|
||||
for its configuration. Obviously, there will be a nice program that
|
||||
sysadmins use to configure FakeNBD, but somehow that program has to tell
|
||||
the driver about it. Here's where configfs comes in.
|
||||
|
||||
When the FakeNBD driver is loaded, it registers itself with configfs.
|
||||
readdir(3) sees this just fine:
|
||||
|
||||
# ls /config
|
||||
fakenbd
|
||||
|
||||
A fakenbd connection can be created with mkdir(2). The name is
|
||||
arbitrary, but likely the tool will make some use of the name. Perhaps
|
||||
it is a uuid or a disk name:
|
||||
|
||||
# mkdir /config/fakenbd/disk1
|
||||
# ls /config/fakenbd/disk1
|
||||
target device rw
|
||||
|
||||
The target attribute contains the IP address of the server FakeNBD will
|
||||
connect to. The device attribute is the device on the server.
|
||||
Predictably, the rw attribute determines whether the connection is
|
||||
read-only or read-write.
|
||||
|
||||
# echo 10.0.0.1 > /config/fakenbd/disk1/target
|
||||
# echo /dev/sda1 > /config/fakenbd/disk1/device
|
||||
# echo 1 > /config/fakenbd/disk1/rw
|
||||
|
||||
That's it. That's all there is. Now the device is configured, via the
|
||||
shell no less.
|
||||
|
||||
[Coding With configfs]
|
||||
|
||||
Every object in configfs is a config_item. A config_item reflects an
|
||||
object in the subsystem. It has attributes that match values on that
|
||||
object. configfs handles the filesystem representation of that object
|
||||
and its attributes, allowing the subsystem to ignore all but the
|
||||
basic show/store interaction.
|
||||
|
||||
Items are created and destroyed inside a config_group. A group is a
|
||||
collection of items that share the same attributes and operations.
|
||||
Items are created by mkdir(2) and removed by rmdir(2), but configfs
|
||||
handles that. The group has a set of operations to perform these tasks
|
||||
|
||||
A subsystem is the top level of a client module. During initialization,
|
||||
the client module registers the subsystem with configfs, the subsystem
|
||||
appears as a directory at the top of the configfs filesystem. A
|
||||
subsystem is also a config_group, and can do everything a config_group
|
||||
can.
|
||||
|
||||
[struct config_item]
|
||||
|
||||
struct config_item {
|
||||
char *ci_name;
|
||||
char ci_namebuf[UOBJ_NAME_LEN];
|
||||
struct kref ci_kref;
|
||||
struct list_head ci_entry;
|
||||
struct config_item *ci_parent;
|
||||
struct config_group *ci_group;
|
||||
struct config_item_type *ci_type;
|
||||
struct dentry *ci_dentry;
|
||||
};
|
||||
|
||||
void config_item_init(struct config_item *);
|
||||
void config_item_init_type_name(struct config_item *,
|
||||
const char *name,
|
||||
struct config_item_type *type);
|
||||
struct config_item *config_item_get(struct config_item *);
|
||||
void config_item_put(struct config_item *);
|
||||
|
||||
Generally, struct config_item is embedded in a container structure, a
|
||||
structure that actually represents what the subsystem is doing. The
|
||||
config_item portion of that structure is how the object interacts with
|
||||
configfs.
|
||||
|
||||
Whether statically defined in a source file or created by a parent
|
||||
config_group, a config_item must have one of the _init() functions
|
||||
called on it. This initializes the reference count and sets up the
|
||||
appropriate fields.
|
||||
|
||||
All users of a config_item should have a reference on it via
|
||||
config_item_get(), and drop the reference when they are done via
|
||||
config_item_put().
|
||||
|
||||
By itself, a config_item cannot do much more than appear in configfs.
|
||||
Usually a subsystem wants the item to display and/or store attributes,
|
||||
among other things. For that, it needs a type.
|
||||
|
||||
[struct config_item_type]
|
||||
|
||||
struct configfs_item_operations {
|
||||
void (*release)(struct config_item *);
|
||||
ssize_t (*show_attribute)(struct config_item *,
|
||||
struct configfs_attribute *,
|
||||
char *);
|
||||
ssize_t (*store_attribute)(struct config_item *,
|
||||
struct configfs_attribute *,
|
||||
const char *, size_t);
|
||||
int (*allow_link)(struct config_item *src,
|
||||
struct config_item *target);
|
||||
int (*drop_link)(struct config_item *src,
|
||||
struct config_item *target);
|
||||
};
|
||||
|
||||
struct config_item_type {
|
||||
struct module *ct_owner;
|
||||
struct configfs_item_operations *ct_item_ops;
|
||||
struct configfs_group_operations *ct_group_ops;
|
||||
struct configfs_attribute **ct_attrs;
|
||||
};
|
||||
|
||||
The most basic function of a config_item_type is to define what
|
||||
operations can be performed on a config_item. All items that have been
|
||||
allocated dynamically will need to provide the ct_item_ops->release()
|
||||
method. This method is called when the config_item's reference count
|
||||
reaches zero. Items that wish to display an attribute need to provide
|
||||
the ct_item_ops->show_attribute() method. Similarly, storing a new
|
||||
attribute value uses the store_attribute() method.
|
||||
|
||||
[struct configfs_attribute]
|
||||
|
||||
struct configfs_attribute {
|
||||
char *ca_name;
|
||||
struct module *ca_owner;
|
||||
mode_t ca_mode;
|
||||
};
|
||||
|
||||
When a config_item wants an attribute to appear as a file in the item's
|
||||
configfs directory, it must define a configfs_attribute describing it.
|
||||
It then adds the attribute to the NULL-terminated array
|
||||
config_item_type->ct_attrs. When the item appears in configfs, the
|
||||
attribute file will appear with the configfs_attribute->ca_name
|
||||
filename. configfs_attribute->ca_mode specifies the file permissions.
|
||||
|
||||
If an attribute is readable and the config_item provides a
|
||||
ct_item_ops->show_attribute() method, that method will be called
|
||||
whenever userspace asks for a read(2) on the attribute. The converse
|
||||
will happen for write(2).
|
||||
|
||||
[struct config_group]
|
||||
|
||||
A config_item cannot live in a vacuum. The only way one can be created
|
||||
is via mkdir(2) on a config_group. This will trigger creation of a
|
||||
child item.
|
||||
|
||||
struct config_group {
|
||||
struct config_item cg_item;
|
||||
struct list_head cg_children;
|
||||
struct configfs_subsystem *cg_subsys;
|
||||
struct config_group **default_groups;
|
||||
};
|
||||
|
||||
void config_group_init(struct config_group *group);
|
||||
void config_group_init_type_name(struct config_group *group,
|
||||
const char *name,
|
||||
struct config_item_type *type);
|
||||
|
||||
|
||||
The config_group structure contains a config_item. Properly configuring
|
||||
that item means that a group can behave as an item in its own right.
|
||||
However, it can do more: it can create child items or groups. This is
|
||||
accomplished via the group operations specified on the group's
|
||||
config_item_type.
|
||||
|
||||
struct configfs_group_operations {
|
||||
struct config_item *(*make_item)(struct config_group *group,
|
||||
const char *name);
|
||||
struct config_group *(*make_group)(struct config_group *group,
|
||||
const char *name);
|
||||
int (*commit_item)(struct config_item *item);
|
||||
void (*drop_item)(struct config_group *group,
|
||||
struct config_item *item);
|
||||
};
|
||||
|
||||
A group creates child items by providing the
|
||||
ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new
|
||||
config_item (or more likely, its container structure), initializes it,
|
||||
and returns it to configfs. Configfs will then populate the filesystem
|
||||
tree to reflect the new item.
|
||||
|
||||
If the subsystem wants the child to be a group itself, the subsystem
|
||||
provides ct_group_ops->make_group(). Everything else behaves the same,
|
||||
using the group _init() functions on the group.
|
||||
|
||||
Finally, when userspace calls rmdir(2) on the item or group,
|
||||
ct_group_ops->drop_item() is called. As a config_group is also a
|
||||
config_item, it is not necessary for a separate drop_group() method.
|
||||
The subsystem must config_item_put() the reference that was initialized
|
||||
upon item allocation. If a subsystem has no work to do, it may omit
|
||||
the ct_group_ops->drop_item() method, and configfs will call
|
||||
config_item_put() on the item on behalf of the subsystem.
|
||||
|
||||
IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2)
|
||||
is called, configfs WILL remove the item from the filesystem tree
|
||||
(assuming that it has no children to keep it busy). The subsystem is
|
||||
responsible for responding to this. If the subsystem has references to
|
||||
the item in other threads, the memory is safe. It may take some time
|
||||
for the item to actually disappear from the subsystem's usage. But it
|
||||
is gone from configfs.
|
||||
|
||||
A config_group cannot be removed while it still has child items. This
|
||||
is implemented in the configfs rmdir(2) code. ->drop_item() will not be
|
||||
called, as the item has not been dropped. rmdir(2) will fail, as the
|
||||
directory is not empty.
|
||||
|
||||
[struct configfs_subsystem]
|
||||
|
||||
A subsystem must register itself, usually at module_init time. This
|
||||
tells configfs to make the subsystem appear in the file tree.
|
||||
|
||||
struct configfs_subsystem {
|
||||
struct config_group su_group;
|
||||
struct semaphore su_sem;
|
||||
};
|
||||
|
||||
int configfs_register_subsystem(struct configfs_subsystem *subsys);
|
||||
void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
|
||||
|
||||
A subsystem consists of a toplevel config_group and a semaphore.
|
||||
The group is where child config_items are created. For a subsystem,
|
||||
this group is usually defined statically. Before calling
|
||||
configfs_register_subsystem(), the subsystem must have initialized the
|
||||
group via the usual group _init() functions, and it must also have
|
||||
initialized the semaphore.
|
||||
When the register call returns, the subsystem is live, and it
|
||||
will be visible via configfs. At that point, mkdir(2) can be called and
|
||||
the subsystem must be ready for it.
|
||||
|
||||
[An Example]
|
||||
|
||||
The best example of these basic concepts is the simple_children
|
||||
subsystem/group and the simple_child item in configfs_example.c It
|
||||
shows a trivial object displaying and storing an attribute, and a simple
|
||||
group creating and destroying these children.
|
||||
|
||||
[Hierarchy Navigation and the Subsystem Semaphore]
|
||||
|
||||
There is an extra bonus that configfs provides. The config_groups and
|
||||
config_items are arranged in a hierarchy due to the fact that they
|
||||
appear in a filesystem. A subsystem is NEVER to touch the filesystem
|
||||
parts, but the subsystem might be interested in this hierarchy. For
|
||||
this reason, the hierarchy is mirrored via the config_group->cg_children
|
||||
and config_item->ci_parent structure members.
|
||||
|
||||
A subsystem can navigate the cg_children list and the ci_parent pointer
|
||||
to see the tree created by the subsystem. This can race with configfs'
|
||||
management of the hierarchy, so configfs uses the subsystem semaphore to
|
||||
protect modifications. Whenever a subsystem wants to navigate the
|
||||
hierarchy, it must do so under the protection of the subsystem
|
||||
semaphore.
|
||||
|
||||
A subsystem will be prevented from acquiring the semaphore while a newly
|
||||
allocated item has not been linked into this hierarchy. Similarly, it
|
||||
will not be able to acquire the semaphore while a dropping item has not
|
||||
yet been unlinked. This means that an item's ci_parent pointer will
|
||||
never be NULL while the item is in configfs, and that an item will only
|
||||
be in its parent's cg_children list for the same duration. This allows
|
||||
a subsystem to trust ci_parent and cg_children while they hold the
|
||||
semaphore.
|
||||
|
||||
[Item Aggregation Via symlink(2)]
|
||||
|
||||
configfs provides a simple group via the group->item parent/child
|
||||
relationship. Often, however, a larger environment requires aggregation
|
||||
outside of the parent/child connection. This is implemented via
|
||||
symlink(2).
|
||||
|
||||
A config_item may provide the ct_item_ops->allow_link() and
|
||||
ct_item_ops->drop_link() methods. If the ->allow_link() method exists,
|
||||
symlink(2) may be called with the config_item as the source of the link.
|
||||
These links are only allowed between configfs config_items. Any
|
||||
symlink(2) attempt outside the configfs filesystem will be denied.
|
||||
|
||||
When symlink(2) is called, the source config_item's ->allow_link()
|
||||
method is called with itself and a target item. If the source item
|
||||
allows linking to target item, it returns 0. A source item may wish to
|
||||
reject a link if it only wants links to a certain type of object (say,
|
||||
in its own subsystem).
|
||||
|
||||
When unlink(2) is called on the symbolic link, the source item is
|
||||
notified via the ->drop_link() method. Like the ->drop_item() method,
|
||||
this is a void function and cannot return failure. The subsystem is
|
||||
responsible for responding to the change.
|
||||
|
||||
A config_item cannot be removed while it links to any other item, nor
|
||||
can it be removed while an item links to it. Dangling symlinks are not
|
||||
allowed in configfs.
|
||||
|
||||
[Automatically Created Subgroups]
|
||||
|
||||
A new config_group may want to have two types of child config_items.
|
||||
While this could be codified by magic names in ->make_item(), it is much
|
||||
more explicit to have a method whereby userspace sees this divergence.
|
||||
|
||||
Rather than have a group where some items behave differently than
|
||||
others, configfs provides a method whereby one or many subgroups are
|
||||
automatically created inside the parent at its creation. Thus,
|
||||
mkdir("parent) results in "parent", "parent/subgroup1", up through
|
||||
"parent/subgroupN". Items of type 1 can now be created in
|
||||
"parent/subgroup1", and items of type N can be created in
|
||||
"parent/subgroupN".
|
||||
|
||||
These automatic subgroups, or default groups, do not preclude other
|
||||
children of the parent group. If ct_group_ops->make_group() exists,
|
||||
other child groups can be created on the parent group directly.
|
||||
|
||||
A configfs subsystem specifies default groups by filling in the
|
||||
NULL-terminated array default_groups on the config_group structure.
|
||||
Each group in that array is populated in the configfs tree at the same
|
||||
time as the parent group. Similarly, they are removed at the same time
|
||||
as the parent. No extra notification is provided. When a ->drop_item()
|
||||
method call notifies the subsystem the parent group is going away, it
|
||||
also means every default group child associated with that parent group.
|
||||
|
||||
As a consequence of this, default_groups cannot be removed directly via
|
||||
rmdir(2). They also are not considered when rmdir(2) on the parent
|
||||
group is checking for children.
|
||||
|
||||
[Committable Items]
|
||||
|
||||
NOTE: Committable items are currently unimplemented.
|
||||
|
||||
Some config_items cannot have a valid initial state. That is, no
|
||||
default values can be specified for the item's attributes such that the
|
||||
item can do its work. Userspace must configure one or more attributes,
|
||||
after which the subsystem can start whatever entity this item
|
||||
represents.
|
||||
|
||||
Consider the FakeNBD device from above. Without a target address *and*
|
||||
a target device, the subsystem has no idea what block device to import.
|
||||
The simple example assumes that the subsystem merely waits until all the
|
||||
appropriate attributes are configured, and then connects. This will,
|
||||
indeed, work, but now every attribute store must check if the attributes
|
||||
are initialized. Every attribute store must fire off the connection if
|
||||
that condition is met.
|
||||
|
||||
Far better would be an explicit action notifying the subsystem that the
|
||||
config_item is ready to go. More importantly, an explicit action allows
|
||||
the subsystem to provide feedback as to whether the attributes are
|
||||
initialized in a way that makes sense. configfs provides this as
|
||||
committable items.
|
||||
|
||||
configfs still uses only normal filesystem operations. An item is
|
||||
committed via rename(2). The item is moved from a directory where it
|
||||
can be modified to a directory where it cannot.
|
||||
|
||||
Any group that provides the ct_group_ops->commit_item() method has
|
||||
committable items. When this group appears in configfs, mkdir(2) will
|
||||
not work directly in the group. Instead, the group will have two
|
||||
subdirectories: "live" and "pending". The "live" directory does not
|
||||
support mkdir(2) or rmdir(2) either. It only allows rename(2). The
|
||||
"pending" directory does allow mkdir(2) and rmdir(2). An item is
|
||||
created in the "pending" directory. Its attributes can be modified at
|
||||
will. Userspace commits the item by renaming it into the "live"
|
||||
directory. At this point, the subsystem receives the ->commit_item()
|
||||
callback. If all required attributes are filled to satisfaction, the
|
||||
method returns zero and the item is moved to the "live" directory.
|
||||
|
||||
As rmdir(2) does not work in the "live" directory, an item must be
|
||||
shutdown, or "uncommitted". Again, this is done via rename(2), this
|
||||
time from the "live" directory back to the "pending" one. The subsystem
|
||||
is notified by the ct_group_ops->uncommit_object() method.
|
||||
|
||||
|
||||
487
Documentation/filesystems/configfs/configfs_example.c
Normal file
487
Documentation/filesystems/configfs/configfs_example.c
Normal file
@@ -0,0 +1,487 @@
|
||||
/*
|
||||
* vim: noexpandtab ts=8 sts=0 sw=8:
|
||||
*
|
||||
* configfs_example.c - This file is a demonstration module containing
|
||||
* a number of configfs subsystems.
|
||||
*
|
||||
* This program is free software; you can redistribute it and/or
|
||||
* modify it under the terms of the GNU General Public
|
||||
* License as published by the Free Software Foundation; either
|
||||
* version 2 of the License, or (at your option) any later version.
|
||||
*
|
||||
* This program is distributed in the hope that it will be useful,
|
||||
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
||||
* General Public License for more details.
|
||||
*
|
||||
* You should have received a copy of the GNU General Public
|
||||
* License along with this program; if not, write to the
|
||||
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
|
||||
* Boston, MA 021110-1307, USA.
|
||||
*
|
||||
* Based on sysfs:
|
||||
* sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
|
||||
*
|
||||
* configfs Copyright (C) 2005 Oracle. All rights reserved.
|
||||
*/
|
||||
|
||||
#include <linux/init.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/slab.h>
|
||||
|
||||
#include <linux/configfs.h>
|
||||
|
||||
|
||||
|
||||
/*
|
||||
* 01-childless
|
||||
*
|
||||
* This first example is a childless subsystem. It cannot create
|
||||
* any config_items. It just has attributes.
|
||||
*
|
||||
* Note that we are enclosing the configfs_subsystem inside a container.
|
||||
* This is not necessary if a subsystem has no attributes directly
|
||||
* on the subsystem. See the next example, 02-simple-children, for
|
||||
* such a subsystem.
|
||||
*/
|
||||
|
||||
struct childless {
|
||||
struct configfs_subsystem subsys;
|
||||
int showme;
|
||||
int storeme;
|
||||
};
|
||||
|
||||
struct childless_attribute {
|
||||
struct configfs_attribute attr;
|
||||
ssize_t (*show)(struct childless *, char *);
|
||||
ssize_t (*store)(struct childless *, const char *, size_t);
|
||||
};
|
||||
|
||||
static inline struct childless *to_childless(struct config_item *item)
|
||||
{
|
||||
return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL;
|
||||
}
|
||||
|
||||
static ssize_t childless_showme_read(struct childless *childless,
|
||||
char *page)
|
||||
{
|
||||
ssize_t pos;
|
||||
|
||||
pos = sprintf(page, "%d\n", childless->showme);
|
||||
childless->showme++;
|
||||
|
||||
return pos;
|
||||
}
|
||||
|
||||
static ssize_t childless_storeme_read(struct childless *childless,
|
||||
char *page)
|
||||
{
|
||||
return sprintf(page, "%d\n", childless->storeme);
|
||||
}
|
||||
|
||||
static ssize_t childless_storeme_write(struct childless *childless,
|
||||
const char *page,
|
||||
size_t count)
|
||||
{
|
||||
unsigned long tmp;
|
||||
char *p = (char *) page;
|
||||
|
||||
tmp = simple_strtoul(p, &p, 10);
|
||||
if (!p || (*p && (*p != '\n')))
|
||||
return -EINVAL;
|
||||
|
||||
if (tmp > INT_MAX)
|
||||
return -ERANGE;
|
||||
|
||||
childless->storeme = tmp;
|
||||
|
||||
return count;
|
||||
}
|
||||
|
||||
static ssize_t childless_description_read(struct childless *childless,
|
||||
char *page)
|
||||
{
|
||||
return sprintf(page,
|
||||
"[01-childless]\n"
|
||||
"\n"
|
||||
"The childless subsystem is the simplest possible subsystem in\n"
|
||||
"configfs. It does not support the creation of child config_items.\n"
|
||||
"It only has a few attributes. In fact, it isn't much different\n"
|
||||
"than a directory in /proc.\n");
|
||||
}
|
||||
|
||||
static struct childless_attribute childless_attr_showme = {
|
||||
.attr = { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO },
|
||||
.show = childless_showme_read,
|
||||
};
|
||||
static struct childless_attribute childless_attr_storeme = {
|
||||
.attr = { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR },
|
||||
.show = childless_storeme_read,
|
||||
.store = childless_storeme_write,
|
||||
};
|
||||
static struct childless_attribute childless_attr_description = {
|
||||
.attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO },
|
||||
.show = childless_description_read,
|
||||
};
|
||||
|
||||
static struct configfs_attribute *childless_attrs[] = {
|
||||
&childless_attr_showme.attr,
|
||||
&childless_attr_storeme.attr,
|
||||
&childless_attr_description.attr,
|
||||
NULL,
|
||||
};
|
||||
|
||||
static ssize_t childless_attr_show(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
char *page)
|
||||
{
|
||||
struct childless *childless = to_childless(item);
|
||||
struct childless_attribute *childless_attr =
|
||||
container_of(attr, struct childless_attribute, attr);
|
||||
ssize_t ret = 0;
|
||||
|
||||
if (childless_attr->show)
|
||||
ret = childless_attr->show(childless, page);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static ssize_t childless_attr_store(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
const char *page, size_t count)
|
||||
{
|
||||
struct childless *childless = to_childless(item);
|
||||
struct childless_attribute *childless_attr =
|
||||
container_of(attr, struct childless_attribute, attr);
|
||||
ssize_t ret = -EINVAL;
|
||||
|
||||
if (childless_attr->store)
|
||||
ret = childless_attr->store(childless, page, count);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static struct configfs_item_operations childless_item_ops = {
|
||||
.show_attribute = childless_attr_show,
|
||||
.store_attribute = childless_attr_store,
|
||||
};
|
||||
|
||||
static struct config_item_type childless_type = {
|
||||
.ct_item_ops = &childless_item_ops,
|
||||
.ct_attrs = childless_attrs,
|
||||
.ct_owner = THIS_MODULE,
|
||||
};
|
||||
|
||||
static struct childless childless_subsys = {
|
||||
.subsys = {
|
||||
.su_group = {
|
||||
.cg_item = {
|
||||
.ci_namebuf = "01-childless",
|
||||
.ci_type = &childless_type,
|
||||
},
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
|
||||
/* ----------------------------------------------------------------- */
|
||||
|
||||
/*
|
||||
* 02-simple-children
|
||||
*
|
||||
* This example merely has a simple one-attribute child. Note that
|
||||
* there is no extra attribute structure, as the child's attribute is
|
||||
* known from the get-go. Also, there is no container for the
|
||||
* subsystem, as it has no attributes of its own.
|
||||
*/
|
||||
|
||||
struct simple_child {
|
||||
struct config_item item;
|
||||
int storeme;
|
||||
};
|
||||
|
||||
static inline struct simple_child *to_simple_child(struct config_item *item)
|
||||
{
|
||||
return item ? container_of(item, struct simple_child, item) : NULL;
|
||||
}
|
||||
|
||||
static struct configfs_attribute simple_child_attr_storeme = {
|
||||
.ca_owner = THIS_MODULE,
|
||||
.ca_name = "storeme",
|
||||
.ca_mode = S_IRUGO | S_IWUSR,
|
||||
};
|
||||
|
||||
static struct configfs_attribute *simple_child_attrs[] = {
|
||||
&simple_child_attr_storeme,
|
||||
NULL,
|
||||
};
|
||||
|
||||
static ssize_t simple_child_attr_show(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
char *page)
|
||||
{
|
||||
ssize_t count;
|
||||
struct simple_child *simple_child = to_simple_child(item);
|
||||
|
||||
count = sprintf(page, "%d\n", simple_child->storeme);
|
||||
|
||||
return count;
|
||||
}
|
||||
|
||||
static ssize_t simple_child_attr_store(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
const char *page, size_t count)
|
||||
{
|
||||
struct simple_child *simple_child = to_simple_child(item);
|
||||
unsigned long tmp;
|
||||
char *p = (char *) page;
|
||||
|
||||
tmp = simple_strtoul(p, &p, 10);
|
||||
if (!p || (*p && (*p != '\n')))
|
||||
return -EINVAL;
|
||||
|
||||
if (tmp > INT_MAX)
|
||||
return -ERANGE;
|
||||
|
||||
simple_child->storeme = tmp;
|
||||
|
||||
return count;
|
||||
}
|
||||
|
||||
static void simple_child_release(struct config_item *item)
|
||||
{
|
||||
kfree(to_simple_child(item));
|
||||
}
|
||||
|
||||
static struct configfs_item_operations simple_child_item_ops = {
|
||||
.release = simple_child_release,
|
||||
.show_attribute = simple_child_attr_show,
|
||||
.store_attribute = simple_child_attr_store,
|
||||
};
|
||||
|
||||
static struct config_item_type simple_child_type = {
|
||||
.ct_item_ops = &simple_child_item_ops,
|
||||
.ct_attrs = simple_child_attrs,
|
||||
.ct_owner = THIS_MODULE,
|
||||
};
|
||||
|
||||
|
||||
struct simple_children {
|
||||
struct config_group group;
|
||||
};
|
||||
|
||||
static inline struct simple_children *to_simple_children(struct config_item *item)
|
||||
{
|
||||
return item ? container_of(to_config_group(item), struct simple_children, group) : NULL;
|
||||
}
|
||||
|
||||
static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
|
||||
{
|
||||
struct simple_child *simple_child;
|
||||
|
||||
simple_child = kmalloc(sizeof(struct simple_child), GFP_KERNEL);
|
||||
if (!simple_child)
|
||||
return NULL;
|
||||
|
||||
memset(simple_child, 0, sizeof(struct simple_child));
|
||||
|
||||
config_item_init_type_name(&simple_child->item, name,
|
||||
&simple_child_type);
|
||||
|
||||
simple_child->storeme = 0;
|
||||
|
||||
return &simple_child->item;
|
||||
}
|
||||
|
||||
static struct configfs_attribute simple_children_attr_description = {
|
||||
.ca_owner = THIS_MODULE,
|
||||
.ca_name = "description",
|
||||
.ca_mode = S_IRUGO,
|
||||
};
|
||||
|
||||
static struct configfs_attribute *simple_children_attrs[] = {
|
||||
&simple_children_attr_description,
|
||||
NULL,
|
||||
};
|
||||
|
||||
static ssize_t simple_children_attr_show(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
char *page)
|
||||
{
|
||||
return sprintf(page,
|
||||
"[02-simple-children]\n"
|
||||
"\n"
|
||||
"This subsystem allows the creation of child config_items. These\n"
|
||||
"items have only one attribute that is readable and writeable.\n");
|
||||
}
|
||||
|
||||
static void simple_children_release(struct config_item *item)
|
||||
{
|
||||
kfree(to_simple_children(item));
|
||||
}
|
||||
|
||||
static struct configfs_item_operations simple_children_item_ops = {
|
||||
.release = simple_children_release,
|
||||
.show_attribute = simple_children_attr_show,
|
||||
};
|
||||
|
||||
/*
|
||||
* Note that, since no extra work is required on ->drop_item(),
|
||||
* no ->drop_item() is provided.
|
||||
*/
|
||||
static struct configfs_group_operations simple_children_group_ops = {
|
||||
.make_item = simple_children_make_item,
|
||||
};
|
||||
|
||||
static struct config_item_type simple_children_type = {
|
||||
.ct_item_ops = &simple_children_item_ops,
|
||||
.ct_group_ops = &simple_children_group_ops,
|
||||
.ct_attrs = simple_children_attrs,
|
||||
.ct_owner = THIS_MODULE,
|
||||
};
|
||||
|
||||
static struct configfs_subsystem simple_children_subsys = {
|
||||
.su_group = {
|
||||
.cg_item = {
|
||||
.ci_namebuf = "02-simple-children",
|
||||
.ci_type = &simple_children_type,
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
|
||||
/* ----------------------------------------------------------------- */
|
||||
|
||||
/*
|
||||
* 03-group-children
|
||||
*
|
||||
* This example reuses the simple_children group from above. However,
|
||||
* the simple_children group is not the subsystem itself, it is a
|
||||
* child of the subsystem. Creation of a group in the subsystem creates
|
||||
* a new simple_children group. That group can then have simple_child
|
||||
* children of its own.
|
||||
*/
|
||||
|
||||
static struct config_group *group_children_make_group(struct config_group *group, const char *name)
|
||||
{
|
||||
struct simple_children *simple_children;
|
||||
|
||||
simple_children = kmalloc(sizeof(struct simple_children),
|
||||
GFP_KERNEL);
|
||||
if (!simple_children)
|
||||
return NULL;
|
||||
|
||||
memset(simple_children, 0, sizeof(struct simple_children));
|
||||
|
||||
config_group_init_type_name(&simple_children->group, name,
|
||||
&simple_children_type);
|
||||
|
||||
return &simple_children->group;
|
||||
}
|
||||
|
||||
static struct configfs_attribute group_children_attr_description = {
|
||||
.ca_owner = THIS_MODULE,
|
||||
.ca_name = "description",
|
||||
.ca_mode = S_IRUGO,
|
||||
};
|
||||
|
||||
static struct configfs_attribute *group_children_attrs[] = {
|
||||
&group_children_attr_description,
|
||||
NULL,
|
||||
};
|
||||
|
||||
static ssize_t group_children_attr_show(struct config_item *item,
|
||||
struct configfs_attribute *attr,
|
||||
char *page)
|
||||
{
|
||||
return sprintf(page,
|
||||
"[03-group-children]\n"
|
||||
"\n"
|
||||
"This subsystem allows the creation of child config_groups. These\n"
|
||||
"groups are like the subsystem simple-children.\n");
|
||||
}
|
||||
|
||||
static struct configfs_item_operations group_children_item_ops = {
|
||||
.show_attribute = group_children_attr_show,
|
||||
};
|
||||
|
||||
/*
|
||||
* Note that, since no extra work is required on ->drop_item(),
|
||||
* no ->drop_item() is provided.
|
||||
*/
|
||||
static struct configfs_group_operations group_children_group_ops = {
|
||||
.make_group = group_children_make_group,
|
||||
};
|
||||
|
||||
static struct config_item_type group_children_type = {
|
||||
.ct_item_ops = &group_children_item_ops,
|
||||
.ct_group_ops = &group_children_group_ops,
|
||||
.ct_attrs = group_children_attrs,
|
||||
.ct_owner = THIS_MODULE,
|
||||
};
|
||||
|
||||
static struct configfs_subsystem group_children_subsys = {
|
||||
.su_group = {
|
||||
.cg_item = {
|
||||
.ci_namebuf = "03-group-children",
|
||||
.ci_type = &group_children_type,
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
/* ----------------------------------------------------------------- */
|
||||
|
||||
/*
|
||||
* We're now done with our subsystem definitions.
|
||||
* For convenience in this module, here's a list of them all. It
|
||||
* allows the init function to easily register them. Most modules
|
||||
* will only have one subsystem, and will only call register_subsystem
|
||||
* on it directly.
|
||||
*/
|
||||
static struct configfs_subsystem *example_subsys[] = {
|
||||
&childless_subsys.subsys,
|
||||
&simple_children_subsys,
|
||||
&group_children_subsys,
|
||||
NULL,
|
||||
};
|
||||
|
||||
static int __init configfs_example_init(void)
|
||||
{
|
||||
int ret;
|
||||
int i;
|
||||
struct configfs_subsystem *subsys;
|
||||
|
||||
for (i = 0; example_subsys[i]; i++) {
|
||||
subsys = example_subsys[i];
|
||||
|
||||
config_group_init(&subsys->su_group);
|
||||
init_MUTEX(&subsys->su_sem);
|
||||
ret = configfs_register_subsystem(subsys);
|
||||
if (ret) {
|
||||
printk(KERN_ERR "Error %d while registering subsystem %s\n",
|
||||
ret,
|
||||
subsys->su_group.cg_item.ci_namebuf);
|
||||
goto out_unregister;
|
||||
}
|
||||
}
|
||||
|
||||
return 0;
|
||||
|
||||
out_unregister:
|
||||
for (; i >= 0; i--) {
|
||||
configfs_unregister_subsystem(example_subsys[i]);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void __exit configfs_example_exit(void)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; example_subsys[i]; i++) {
|
||||
configfs_unregister_subsystem(example_subsys[i]);
|
||||
}
|
||||
}
|
||||
|
||||
module_init(configfs_example_init);
|
||||
module_exit(configfs_example_exit);
|
||||
MODULE_LICENSE("GPL");
|
||||
76
Documentation/filesystems/cramfs.txt
Normal file
76
Documentation/filesystems/cramfs.txt
Normal file
@@ -0,0 +1,76 @@
|
||||
|
||||
Cramfs - cram a filesystem onto a small ROM
|
||||
|
||||
cramfs is designed to be simple and small, and to compress things well.
|
||||
|
||||
It uses the zlib routines to compress a file one page at a time, and
|
||||
allows random page access. The meta-data is not compressed, but is
|
||||
expressed in a very terse representation to make it use much less
|
||||
diskspace than traditional filesystems.
|
||||
|
||||
You can't write to a cramfs filesystem (making it compressible and
|
||||
compact also makes it _very_ hard to update on-the-fly), so you have to
|
||||
create the disk image with the "mkcramfs" utility.
|
||||
|
||||
|
||||
Usage Notes
|
||||
-----------
|
||||
|
||||
File sizes are limited to less than 16MB.
|
||||
|
||||
Maximum filesystem size is a little over 256MB. (The last file on the
|
||||
filesystem is allowed to extend past 256MB.)
|
||||
|
||||
Only the low 8 bits of gid are stored. The current version of
|
||||
mkcramfs simply truncates to 8 bits, which is a potential security
|
||||
issue.
|
||||
|
||||
Hard links are supported, but hard linked files
|
||||
will still have a link count of 1 in the cramfs image.
|
||||
|
||||
Cramfs directories have no `.' or `..' entries. Directories (like
|
||||
every other file on cramfs) always have a link count of 1. (There's
|
||||
no need to use -noleaf in `find', btw.)
|
||||
|
||||
No timestamps are stored in a cramfs, so these default to the epoch
|
||||
(1970 GMT). Recently-accessed files may have updated timestamps, but
|
||||
the update lasts only as long as the inode is cached in memory, after
|
||||
which the timestamp reverts to 1970, i.e. moves backwards in time.
|
||||
|
||||
Currently, cramfs must be written and read with architectures of the
|
||||
same endianness, and can be read only by kernels with PAGE_CACHE_SIZE
|
||||
== 4096. At least the latter of these is a bug, but it hasn't been
|
||||
decided what the best fix is. For the moment if you have larger pages
|
||||
you can just change the #define in mkcramfs.c, so long as you don't
|
||||
mind the filesystem becoming unreadable to future kernels.
|
||||
|
||||
|
||||
For /usr/share/magic
|
||||
--------------------
|
||||
|
||||
0 ulelong 0x28cd3d45 Linux cramfs offset 0
|
||||
>4 ulelong x size %d
|
||||
>8 ulelong x flags 0x%x
|
||||
>12 ulelong x future 0x%x
|
||||
>16 string >\0 signature "%.16s"
|
||||
>32 ulelong x fsid.crc 0x%x
|
||||
>36 ulelong x fsid.edition %d
|
||||
>40 ulelong x fsid.blocks %d
|
||||
>44 ulelong x fsid.files %d
|
||||
>48 string >\0 name "%.16s"
|
||||
512 ulelong 0x28cd3d45 Linux cramfs offset 512
|
||||
>516 ulelong x size %d
|
||||
>520 ulelong x flags 0x%x
|
||||
>524 ulelong x future 0x%x
|
||||
>528 string >\0 signature "%.16s"
|
||||
>544 ulelong x fsid.crc 0x%x
|
||||
>548 ulelong x fsid.edition %d
|
||||
>552 ulelong x fsid.blocks %d
|
||||
>556 ulelong x fsid.files %d
|
||||
>560 string >\0 name "%.16s"
|
||||
|
||||
|
||||
Hacker Notes
|
||||
------------
|
||||
|
||||
See fs/cramfs/README for filesystem layout and implementation notes.
|
||||
173
Documentation/filesystems/dentry-locking.txt
Normal file
173
Documentation/filesystems/dentry-locking.txt
Normal file
@@ -0,0 +1,173 @@
|
||||
RCU-based dcache locking model
|
||||
==============================
|
||||
|
||||
On many workloads, the most common operation on dcache is to look up a
|
||||
dentry, given a parent dentry and the name of the child. Typically,
|
||||
for every open(), stat() etc., the dentry corresponding to the
|
||||
pathname will be looked up by walking the tree starting with the first
|
||||
component of the pathname and using that dentry along with the next
|
||||
component to look up the next level and so on. Since it is a frequent
|
||||
operation for workloads like multiuser environments and web servers,
|
||||
it is important to optimize this path.
|
||||
|
||||
Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in
|
||||
every component during path look-up. Since 2.5.10 onwards, fast-walk
|
||||
algorithm changed this by holding the dcache_lock at the beginning and
|
||||
walking as many cached path component dentries as possible. This
|
||||
significantly decreases the number of acquisition of
|
||||
dcache_lock. However it also increases the lock hold time
|
||||
significantly and affects performance in large SMP machines. Since
|
||||
2.5.62 kernel, dcache has been using a new locking model that uses RCU
|
||||
to make dcache look-up lock-free.
|
||||
|
||||
The current dcache locking model is not very different from the
|
||||
existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock
|
||||
protected the hash chain, d_child, d_alias, d_lru lists as well as
|
||||
d_inode and several other things like mount look-up. RCU-based changes
|
||||
affect only the way the hash chain is protected. For everything else
|
||||
the dcache_lock must be taken for both traversing as well as
|
||||
updating. The hash chain updates too take the dcache_lock. The
|
||||
significant change is the way d_lookup traverses the hash chain, it
|
||||
doesn't acquire the dcache_lock for this and rely on RCU to ensure
|
||||
that the dentry has not been *freed*.
|
||||
|
||||
|
||||
Dcache locking details
|
||||
======================
|
||||
|
||||
For many multi-user workloads, open() and stat() on files are very
|
||||
frequently occurring operations. Both involve walking of path names to
|
||||
find the dentry corresponding to the concerned file. In 2.4 kernel,
|
||||
dcache_lock was held during look-up of each path component. Contention
|
||||
and cache-line bouncing of this global lock caused significant
|
||||
scalability problems. With the introduction of RCU in Linux kernel,
|
||||
this was worked around by making the look-up of path components during
|
||||
path walking lock-free.
|
||||
|
||||
|
||||
Safe lock-free look-up of dcache hash table
|
||||
===========================================
|
||||
|
||||
Dcache is a complex data structure with the hash table entries also
|
||||
linked together in other lists. In 2.4 kernel, dcache_lock protected
|
||||
all the lists. We applied RCU only on hash chain walking. The rest of
|
||||
the lists are still protected by dcache_lock. Some of the important
|
||||
changes are :
|
||||
|
||||
1. The deletion from hash chain is done using hlist_del_rcu() macro
|
||||
which doesn't initialize next pointer of the deleted dentry and
|
||||
this allows us to walk safely lock-free while a deletion is
|
||||
happening.
|
||||
|
||||
2. Insertion of a dentry into the hash table is done using
|
||||
hlist_add_head_rcu() which take care of ordering the writes - the
|
||||
writes to the dentry must be visible before the dentry is
|
||||
inserted. This works in conjunction with hlist_for_each_rcu() while
|
||||
walking the hash chain. The only requirement is that all
|
||||
initialization to the dentry must be done before
|
||||
hlist_add_head_rcu() since we don't have dcache_lock protection
|
||||
while traversing the hash chain. This isn't different from the
|
||||
existing code.
|
||||
|
||||
3. The dentry looked up without holding dcache_lock by cannot be
|
||||
returned for walking if it is unhashed. It then may have a NULL
|
||||
d_inode or other bogosity since RCU doesn't protect the other
|
||||
fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
|
||||
indicate unhashed dentries and use this in conjunction with a
|
||||
per-dentry lock (d_lock). Once looked up without the dcache_lock,
|
||||
we acquire the per-dentry lock (d_lock) and check if the dentry is
|
||||
unhashed. If so, the look-up is failed. If not, the reference count
|
||||
of the dentry is increased and the dentry is returned.
|
||||
|
||||
4. Once a dentry is looked up, it must be ensured during the path walk
|
||||
for that component it doesn't go away. In pre-2.5.10 code, this was
|
||||
done holding a reference to the dentry. dcache_rcu does the same.
|
||||
In some sense, dcache_rcu path walking looks like the pre-2.5.10
|
||||
version.
|
||||
|
||||
5. All dentry hash chain updates must take the dcache_lock as well as
|
||||
the per-dentry lock in that order. dput() does this to ensure that
|
||||
a dentry that has just been looked up in another CPU doesn't get
|
||||
deleted before dget() can be done on it.
|
||||
|
||||
6. There are several ways to do reference counting of RCU protected
|
||||
objects. One such example is in ipv4 route cache where deferred
|
||||
freeing (using call_rcu()) is done as soon as the reference count
|
||||
goes to zero. This cannot be done in the case of dentries because
|
||||
tearing down of dentries require blocking (dentry_iput()) which
|
||||
isn't supported from RCU callbacks. Instead, tearing down of
|
||||
dentries happen synchronously in dput(), but actual freeing happens
|
||||
later when RCU grace period is over. This allows safe lock-free
|
||||
walking of the hash chains, but a matched dentry may have been
|
||||
partially torn down. The checking of DCACHE_UNHASHED flag with
|
||||
d_lock held detects such dentries and prevents them from being
|
||||
returned from look-up.
|
||||
|
||||
|
||||
Maintaining POSIX rename semantics
|
||||
==================================
|
||||
|
||||
Since look-up of dentries is lock-free, it can race against a
|
||||
concurrent rename operation. For example, during rename of file A to
|
||||
B, look-up of either A or B must succeed. So, if look-up of B happens
|
||||
after A has been removed from the hash chain but not added to the new
|
||||
hash chain, it may fail. Also, a comparison while the name is being
|
||||
written concurrently by a rename may result in false positive matches
|
||||
violating rename semantics. Issues related to race with rename are
|
||||
handled as described below :
|
||||
|
||||
1. Look-up can be done in two ways - d_lookup() which is safe from
|
||||
simultaneous renames and __d_lookup() which is not. If
|
||||
__d_lookup() fails, it must be followed up by a d_lookup() to
|
||||
correctly determine whether a dentry is in the hash table or
|
||||
not. d_lookup() protects look-ups using a sequence lock
|
||||
(rename_lock).
|
||||
|
||||
2. The name associated with a dentry (d_name) may be changed if a
|
||||
rename is allowed to happen simultaneously. To avoid memcmp() in
|
||||
__d_lookup() go out of bounds due to a rename and false positive
|
||||
comparison, the name comparison is done while holding the
|
||||
per-dentry lock. This prevents concurrent renames during this
|
||||
operation.
|
||||
|
||||
3. Hash table walking during look-up may move to a different bucket as
|
||||
the current dentry is moved to a different bucket due to rename.
|
||||
But we use hlists in dcache hash table and they are
|
||||
null-terminated. So, even if a dentry moves to a different bucket,
|
||||
hash chain walk will terminate. [with a list_head list, it may not
|
||||
since termination is when the list_head in the original bucket is
|
||||
reached]. Since we redo the d_parent check and compare name while
|
||||
holding d_lock, lock-free look-up will not race against d_move().
|
||||
|
||||
4. There can be a theoretical race when a dentry keeps coming back to
|
||||
original bucket due to double moves. Due to this look-up may
|
||||
consider that it has never moved and can end up in a infinite loop.
|
||||
But this is not any worse that theoretical livelocks we already
|
||||
have in the kernel.
|
||||
|
||||
|
||||
Important guidelines for filesystem developers related to dcache_rcu
|
||||
====================================================================
|
||||
|
||||
1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
|
||||
don't change. Only dcache internal implementation changes. However
|
||||
filesystems *must not* delete from the dentry hash chains directly
|
||||
using the list macros like allowed earlier. They must use dcache
|
||||
APIs like d_drop() or __d_drop() depending on the situation.
|
||||
|
||||
2. d_flags is now protected by a per-dentry lock (d_lock). All access
|
||||
to d_flags must be protected by it.
|
||||
|
||||
3. For a hashed dentry, checking of d_count needs to be protected by
|
||||
d_lock.
|
||||
|
||||
|
||||
Papers and other documentation on dcache locking
|
||||
================================================
|
||||
|
||||
1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
|
||||
|
||||
2. http://lse.sourceforge.net/locking/dcache/dcache.html
|
||||
|
||||
|
||||
|
||||
113
Documentation/filesystems/directory-locking
Normal file
113
Documentation/filesystems/directory-locking
Normal file
@@ -0,0 +1,113 @@
|
||||
Locking scheme used for directory operations is based on two
|
||||
kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem).
|
||||
|
||||
For our purposes all operations fall in 5 classes:
|
||||
|
||||
1) read access. Locking rules: caller locks directory we are accessing.
|
||||
|
||||
2) object creation. Locking rules: same as above.
|
||||
|
||||
3) object removal. Locking rules: caller locks parent, finds victim,
|
||||
locks victim and calls the method.
|
||||
|
||||
4) rename() that is _not_ cross-directory. Locking rules: caller locks
|
||||
the parent, finds source and target, if target already exists - locks it
|
||||
and then calls the method.
|
||||
|
||||
5) link creation. Locking rules:
|
||||
* lock parent
|
||||
* check that source is not a directory
|
||||
* lock source
|
||||
* call the method.
|
||||
|
||||
6) cross-directory rename. The trickiest in the whole bunch. Locking
|
||||
rules:
|
||||
* lock the filesystem
|
||||
* lock parents in "ancestors first" order.
|
||||
* find source and target.
|
||||
* if old parent is equal to or is a descendent of target
|
||||
fail with -ENOTEMPTY
|
||||
* if new parent is equal to or is a descendent of source
|
||||
fail with -ELOOP
|
||||
* if target exists - lock it.
|
||||
* call the method.
|
||||
|
||||
|
||||
The rules above obviously guarantee that all directories that are going to be
|
||||
read, modified or removed by method will be locked by caller.
|
||||
|
||||
|
||||
If no directory is its own ancestor, the scheme above is deadlock-free.
|
||||
Proof:
|
||||
|
||||
First of all, at any moment we have a partial ordering of the
|
||||
objects - A < B iff A is an ancestor of B.
|
||||
|
||||
That ordering can change. However, the following is true:
|
||||
|
||||
(1) if object removal or non-cross-directory rename holds lock on A and
|
||||
attempts to acquire lock on B, A will remain the parent of B until we
|
||||
acquire the lock on B. (Proof: only cross-directory rename can change
|
||||
the parent of object and it would have to lock the parent).
|
||||
|
||||
(2) if cross-directory rename holds the lock on filesystem, order will not
|
||||
change until rename acquires all locks. (Proof: other cross-directory
|
||||
renames will be blocked on filesystem lock and we don't start changing
|
||||
the order until we had acquired all locks).
|
||||
|
||||
(3) any operation holds at most one lock on non-directory object and
|
||||
that lock is acquired after all other locks. (Proof: see descriptions
|
||||
of operations).
|
||||
|
||||
Now consider the minimal deadlock. Each process is blocked on
|
||||
attempt to acquire some lock and already holds at least one lock. Let's
|
||||
consider the set of contended locks. First of all, filesystem lock is
|
||||
not contended, since any process blocked on it is not holding any locks.
|
||||
Thus all processes are blocked on ->i_sem.
|
||||
|
||||
Non-directory objects are not contended due to (3). Thus link
|
||||
creation can't be a part of deadlock - it can't be blocked on source
|
||||
and it means that it doesn't hold any locks.
|
||||
|
||||
Any contended object is either held by cross-directory rename or
|
||||
has a child that is also contended. Indeed, suppose that it is held by
|
||||
operation other than cross-directory rename. Then the lock this operation
|
||||
is blocked on belongs to child of that object due to (1).
|
||||
|
||||
It means that one of the operations is cross-directory rename.
|
||||
Otherwise the set of contended objects would be infinite - each of them
|
||||
would have a contended child and we had assumed that no object is its
|
||||
own descendent. Moreover, there is exactly one cross-directory rename
|
||||
(see above).
|
||||
|
||||
Consider the object blocking the cross-directory rename. One
|
||||
of its descendents is locked by cross-directory rename (otherwise we
|
||||
would again have an infinite set of contended objects). But that
|
||||
means that cross-directory rename is taking locks out of order. Due
|
||||
to (2) the order hadn't changed since we had acquired filesystem lock.
|
||||
But locking rules for cross-directory rename guarantee that we do not
|
||||
try to acquire lock on descendent before the lock on ancestor.
|
||||
Contradiction. I.e. deadlock is impossible. Q.E.D.
|
||||
|
||||
|
||||
These operations are guaranteed to avoid loop creation. Indeed,
|
||||
the only operation that could introduce loops is cross-directory rename.
|
||||
Since the only new (parent, child) pair added by rename() is (new parent,
|
||||
source), such loop would have to contain these objects and the rest of it
|
||||
would have to exist before rename(). I.e. at the moment of loop creation
|
||||
rename() responsible for that would be holding filesystem lock and new parent
|
||||
would have to be equal to or a descendent of source. But that means that
|
||||
new parent had been equal to or a descendent of source since the moment when
|
||||
we had acquired filesystem lock and rename() would fail with -ELOOP in that
|
||||
case.
|
||||
|
||||
While this locking scheme works for arbitrary DAGs, it relies on
|
||||
ability to check that directory is a descendent of another object. Current
|
||||
implementation assumes that directory graph is a tree. This assumption is
|
||||
also preserved by all operations (cross-directory rename on a tree that would
|
||||
not introduce a cycle will leave it a tree and link() fails for directories).
|
||||
|
||||
Notice that "directory" in the above == "anything that might have
|
||||
children", so if we are going to introduce hybrid objects we will need
|
||||
either to make sure that link(2) doesn't work for them or to make changes
|
||||
in is_subdir() that would make it work even in presence of such beasts.
|
||||
130
Documentation/filesystems/dlmfs.txt
Normal file
130
Documentation/filesystems/dlmfs.txt
Normal file
@@ -0,0 +1,130 @@
|
||||
dlmfs
|
||||
==================
|
||||
A minimal DLM userspace interface implemented via a virtual file
|
||||
system.
|
||||
|
||||
dlmfs is built with OCFS2 as it requires most of its infrastructure.
|
||||
|
||||
Project web page: http://oss.oracle.com/projects/ocfs2
|
||||
Tools web page: http://oss.oracle.com/projects/ocfs2-tools
|
||||
OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
|
||||
|
||||
All code copyright 2005 Oracle except when otherwise noted.
|
||||
|
||||
CREDITS
|
||||
=======
|
||||
|
||||
Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds
|
||||
and Transmeta Corp.
|
||||
|
||||
Mark Fasheh <mark.fasheh@oracle.com>
|
||||
|
||||
Caveats
|
||||
=======
|
||||
- Right now it only works with the OCFS2 DLM, though support for other
|
||||
DLM implementations should not be a major issue.
|
||||
|
||||
Mount options
|
||||
=============
|
||||
None
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
If you're just interested in OCFS2, then please see ocfs2.txt. The
|
||||
rest of this document will be geared towards those who want to use
|
||||
dlmfs for easy to setup and easy to use clustered locking in
|
||||
userspace.
|
||||
|
||||
Setup
|
||||
=====
|
||||
|
||||
dlmfs requires that the OCFS2 cluster infrastructure be in
|
||||
place. Please download ocfs2-tools from the above url and configure a
|
||||
cluster.
|
||||
|
||||
You'll want to start heartbeating on a volume which all the nodes in
|
||||
your lockspace can access. The easiest way to do this is via
|
||||
ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires
|
||||
that an OCFS2 file system be in place so that it can automatically
|
||||
find it's heartbeat area, though it will eventually support heartbeat
|
||||
against raw disks.
|
||||
|
||||
Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed
|
||||
with ocfs2-tools.
|
||||
|
||||
Once you're heartbeating, DLM lock 'domains' can be easily created /
|
||||
destroyed and locks within them accessed.
|
||||
|
||||
Locking
|
||||
=======
|
||||
|
||||
Users may access dlmfs via standard file system calls, or they can use
|
||||
'libo2dlm' (distributed with ocfs2-tools) which abstracts the file
|
||||
system calls and presents a more traditional locking api.
|
||||
|
||||
dlmfs handles lock caching automatically for the user, so a lock
|
||||
request for an already acquired lock will not generate another DLM
|
||||
call. Userspace programs are assumed to handle their own local
|
||||
locking.
|
||||
|
||||
Two levels of locks are supported - Shared Read, and Exclusive.
|
||||
Also supported is a Trylock operation.
|
||||
|
||||
For information on the libo2dlm interface, please see o2dlm.h,
|
||||
distributed with ocfs2-tools.
|
||||
|
||||
Lock value blocks can be read and written to a resource via read(2)
|
||||
and write(2) against the fd obtained via your open(2) call. The
|
||||
maximum currently supported LVB length is 64 bytes (though that is an
|
||||
OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share
|
||||
small amounts of data amongst their nodes.
|
||||
|
||||
mkdir(2) signals dlmfs to join a domain (which will have the same name
|
||||
as the resulting directory)
|
||||
|
||||
rmdir(2) signals dlmfs to leave the domain
|
||||
|
||||
Locks for a given domain are represented by regular inodes inside the
|
||||
domain directory. Locking against them is done via the open(2) system
|
||||
call.
|
||||
|
||||
The open(2) call will not return until your lock has been granted or
|
||||
an error has occurred, unless it has been instructed to do a trylock
|
||||
operation. If the lock succeeds, you'll get an fd.
|
||||
|
||||
open(2) with O_CREAT to ensure the resource inode is created - dlmfs does
|
||||
not automatically create inodes for existing lock resources.
|
||||
|
||||
Open Flag Lock Request Type
|
||||
--------- -----------------
|
||||
O_RDONLY Shared Read
|
||||
O_RDWR Exclusive
|
||||
|
||||
Open Flag Resulting Locking Behavior
|
||||
--------- --------------------------
|
||||
O_NONBLOCK Trylock operation
|
||||
|
||||
You must provide exactly one of O_RDONLY or O_RDWR.
|
||||
|
||||
If O_NONBLOCK is also provided and the trylock operation was valid but
|
||||
could not lock the resource then open(2) will return ETXTBUSY.
|
||||
|
||||
close(2) drops the lock associated with your fd.
|
||||
|
||||
Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is
|
||||
supported locally as well. This means you can use them to restrict
|
||||
access to the resources via dlmfs on your local node only.
|
||||
|
||||
The resource LVB may be read from the fd in either Shared Read or
|
||||
Exclusive modes via the read(2) system call. It can be written via
|
||||
write(2) only when open in Exclusive mode.
|
||||
|
||||
Once written, an LVB will be visible to other nodes who obtain Read
|
||||
Only or higher level locks on the resource.
|
||||
|
||||
See Also
|
||||
========
|
||||
http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
|
||||
|
||||
For more information on the VMS distributed locking API.
|
||||
382
Documentation/filesystems/ext2.txt
Normal file
382
Documentation/filesystems/ext2.txt
Normal file
@@ -0,0 +1,382 @@
|
||||
|
||||
The Second Extended Filesystem
|
||||
==============================
|
||||
|
||||
ext2 was originally released in January 1993. Written by R\'emy Card,
|
||||
Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the
|
||||
Extended Filesystem. It is currently still (April 2001) the predominant
|
||||
filesystem in use by Linux. There are also implementations available
|
||||
for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS.
|
||||
|
||||
Options
|
||||
=======
|
||||
|
||||
Most defaults are determined by the filesystem superblock, and can be
|
||||
set using tune2fs(8). Kernel-determined defaults are indicated by (*).
|
||||
|
||||
bsddf (*) Makes `df' act like BSD.
|
||||
minixdf Makes `df' act like Minix.
|
||||
|
||||
check=none, nocheck (*) Don't do extra checking of bitmaps on mount
|
||||
(check=normal and check=strict options removed)
|
||||
|
||||
debug Extra debugging information is sent to the
|
||||
kernel syslog. Useful for developers.
|
||||
|
||||
errors=continue Keep going on a filesystem error.
|
||||
errors=remount-ro Remount the filesystem read-only on an error.
|
||||
errors=panic Panic and halt the machine if an error occurs.
|
||||
|
||||
grpid, bsdgroups Give objects the same group ID as their parent.
|
||||
nogrpid, sysvgroups New objects have the group ID of their creator.
|
||||
|
||||
nouid32 Use 16-bit UIDs and GIDs.
|
||||
|
||||
oldalloc Enable the old block allocator. Orlov should
|
||||
have better performance, we'd like to get some
|
||||
feedback if it's the contrary for you.
|
||||
orlov (*) Use the Orlov block allocator.
|
||||
(See http://lwn.net/Articles/14633/ and
|
||||
http://lwn.net/Articles/14446/.)
|
||||
|
||||
resuid=n The user ID which may use the reserved blocks.
|
||||
resgid=n The group ID which may use the reserved blocks.
|
||||
|
||||
sb=n Use alternate superblock at this location.
|
||||
|
||||
user_xattr Enable "user." POSIX Extended Attributes
|
||||
(requires CONFIG_EXT2_FS_XATTR).
|
||||
See also http://acl.bestbits.at
|
||||
nouser_xattr Don't support "user." extended attributes.
|
||||
|
||||
acl Enable POSIX Access Control Lists support
|
||||
(requires CONFIG_EXT2_FS_POSIX_ACL).
|
||||
See also http://acl.bestbits.at
|
||||
noacl Don't support POSIX ACLs.
|
||||
|
||||
nobh Do not attach buffer_heads to file pagecache.
|
||||
|
||||
xip Use execute in place (no caching) if possible
|
||||
|
||||
grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2.
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
ext2 shares many properties with traditional Unix filesystems. It has
|
||||
the concepts of blocks, inodes and directories. It has space in the
|
||||
specification for Access Control Lists (ACLs), fragments, undeletion and
|
||||
compression though these are not yet implemented (some are available as
|
||||
separate patches). There is also a versioning mechanism to allow new
|
||||
features (such as journalling) to be added in a maximally compatible
|
||||
manner.
|
||||
|
||||
Blocks
|
||||
------
|
||||
|
||||
The space in the device or file is split up into blocks. These are
|
||||
a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems),
|
||||
which is decided when the filesystem is created. Smaller blocks mean
|
||||
less wasted space per file, but require slightly more accounting overhead,
|
||||
and also impose other limits on the size of files and the filesystem.
|
||||
|
||||
Block Groups
|
||||
------------
|
||||
|
||||
Blocks are clustered into block groups in order to reduce fragmentation
|
||||
and minimise the amount of head seeking when reading a large amount
|
||||
of consecutive data. Information about each block group is kept in a
|
||||
descriptor table stored in the block(s) immediately after the superblock.
|
||||
Two blocks near the start of each group are reserved for the block usage
|
||||
bitmap and the inode usage bitmap which show which blocks and inodes
|
||||
are in use. Since each bitmap is limited to a single block, this means
|
||||
that the maximum size of a block group is 8 times the size of a block.
|
||||
|
||||
The block(s) following the bitmaps in each block group are designated
|
||||
as the inode table for that block group and the remainder are the data
|
||||
blocks. The block allocation algorithm attempts to allocate data blocks
|
||||
in the same block group as the inode which contains them.
|
||||
|
||||
The Superblock
|
||||
--------------
|
||||
|
||||
The superblock contains all the information about the configuration of
|
||||
the filing system. The primary copy of the superblock is stored at an
|
||||
offset of 1024 bytes from the start of the device, and it is essential
|
||||
to mounting the filesystem. Since it is so important, backup copies of
|
||||
the superblock are stored in block groups throughout the filesystem.
|
||||
The first version of ext2 (revision 0) stores a copy at the start of
|
||||
every block group, along with backups of the group descriptor block(s).
|
||||
Because this can consume a considerable amount of space for large
|
||||
filesystems, later revisions can optionally reduce the number of backup
|
||||
copies by only putting backups in specific groups (this is the sparse
|
||||
superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7.
|
||||
|
||||
The information in the superblock contains fields such as the total
|
||||
number of inodes and blocks in the filesystem and how many are free,
|
||||
how many inodes and blocks are in each block group, when the filesystem
|
||||
was mounted (and if it was cleanly unmounted), when it was modified,
|
||||
what version of the filesystem it is (see the Revisions section below)
|
||||
and which OS created it.
|
||||
|
||||
If the filesystem is revision 1 or higher, then there are extra fields,
|
||||
such as a volume name, a unique identification number, the inode size,
|
||||
and space for optional filesystem features to store configuration info.
|
||||
|
||||
All fields in the superblock (as in all other ext2 structures) are stored
|
||||
on the disc in little endian format, so a filesystem is portable between
|
||||
machines without having to know what machine it was created on.
|
||||
|
||||
Inodes
|
||||
------
|
||||
|
||||
The inode (index node) is a fundamental concept in the ext2 filesystem.
|
||||
Each object in the filesystem is represented by an inode. The inode
|
||||
structure contains pointers to the filesystem blocks which contain the
|
||||
data held in the object and all of the metadata about an object except
|
||||
its name. The metadata about an object includes the permissions, owner,
|
||||
group, flags, size, number of blocks used, access time, change time,
|
||||
modification time, deletion time, number of links, fragments, version
|
||||
(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
|
||||
|
||||
There are some reserved fields which are currently unused in the inode
|
||||
structure and several which are overloaded. One field is reserved for the
|
||||
directory ACL if the inode is a directory and alternately for the top 32
|
||||
bits of the file size if the inode is a regular file (allowing file sizes
|
||||
larger than 2GB). The translator field is unused under Linux, but is used
|
||||
by the HURD to reference the inode of a program which will be used to
|
||||
interpret this object. Most of the remaining reserved fields have been
|
||||
used up for both Linux and the HURD for larger owner and group fields,
|
||||
The HURD also has a larger mode field so it uses another of the remaining
|
||||
fields to store the extra more bits.
|
||||
|
||||
There are pointers to the first 12 blocks which contain the file's data
|
||||
in the inode. There is a pointer to an indirect block (which contains
|
||||
pointers to the next set of blocks), a pointer to a doubly-indirect
|
||||
block (which contains pointers to indirect blocks) and a pointer to a
|
||||
trebly-indirect block (which contains pointers to doubly-indirect blocks).
|
||||
|
||||
The flags field contains some ext2-specific flags which aren't catered
|
||||
for by the standard chmod flags. These flags can be listed with lsattr
|
||||
and changed with the chattr command, and allow specific filesystem
|
||||
behaviour on a per-file basis. There are flags for secure deletion,
|
||||
undeletable, compression, synchronous updates, immutability, append-only,
|
||||
dumpable, no-atime, indexed directories, and data-journaling. Not all
|
||||
of these are supported yet.
|
||||
|
||||
Directories
|
||||
-----------
|
||||
|
||||
A directory is a filesystem object and has an inode just like a file.
|
||||
It is a specially formatted file containing records which associate
|
||||
each name with an inode number. Later revisions of the filesystem also
|
||||
encode the type of the object (file, directory, symlink, device, fifo,
|
||||
socket) to avoid the need to check the inode itself for this information
|
||||
(support for taking advantage of this feature does not yet exist in
|
||||
Glibc 2.2).
|
||||
|
||||
The inode allocation code tries to assign inodes which are in the same
|
||||
block group as the directory in which they are first created.
|
||||
|
||||
The current implementation of ext2 uses a singly-linked list to store
|
||||
the filenames in the directory; a pending enhancement uses hashing of the
|
||||
filenames to allow lookup without the need to scan the entire directory.
|
||||
|
||||
The current implementation never removes empty directory blocks once they
|
||||
have been allocated to hold more files.
|
||||
|
||||
Special files
|
||||
-------------
|
||||
|
||||
Symbolic links are also filesystem objects with inodes. They deserve
|
||||
special mention because the data for them is stored within the inode
|
||||
itself if the symlink is less than 60 bytes long. It uses the fields
|
||||
which would normally be used to store the pointers to data blocks.
|
||||
This is a worthwhile optimisation as it we avoid allocating a full
|
||||
block for the symlink, and most symlinks are less than 60 characters long.
|
||||
|
||||
Character and block special devices never have data blocks assigned to
|
||||
them. Instead, their device number is stored in the inode, again reusing
|
||||
the fields which would be used to point to the data blocks.
|
||||
|
||||
Reserved Space
|
||||
--------------
|
||||
|
||||
In ext2, there is a mechanism for reserving a certain number of blocks
|
||||
for a particular user (normally the super-user). This is intended to
|
||||
allow for the system to continue functioning even if non-privileged users
|
||||
fill up all the space available to them (this is independent of filesystem
|
||||
quotas). It also keeps the filesystem from filling up entirely which
|
||||
helps combat fragmentation.
|
||||
|
||||
Filesystem check
|
||||
----------------
|
||||
|
||||
At boot time, most systems run a consistency check (e2fsck) on their
|
||||
filesystems. The superblock of the ext2 filesystem contains several
|
||||
fields which indicate whether fsck should actually run (since checking
|
||||
the filesystem at boot can take a long time if it is large). fsck will
|
||||
run if the filesystem was not cleanly unmounted, if the maximum mount
|
||||
count has been exceeded or if the maximum time between checks has been
|
||||
exceeded.
|
||||
|
||||
Feature Compatibility
|
||||
---------------------
|
||||
|
||||
The compatibility feature mechanism used in ext2 is sophisticated.
|
||||
It safely allows features to be added to the filesystem, without
|
||||
unnecessarily sacrificing compatibility with older versions of the
|
||||
filesystem code. The feature compatibility mechanism is not supported by
|
||||
the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in
|
||||
revision 1. There are three 32-bit fields, one for compatible features
|
||||
(COMPAT), one for read-only compatible (RO_COMPAT) features and one for
|
||||
incompatible (INCOMPAT) features.
|
||||
|
||||
These feature flags have specific meanings for the kernel as follows:
|
||||
|
||||
A COMPAT flag indicates that a feature is present in the filesystem,
|
||||
but the on-disk format is 100% compatible with older on-disk formats, so
|
||||
a kernel which didn't know anything about this feature could read/write
|
||||
the filesystem without any chance of corrupting the filesystem (or even
|
||||
making it inconsistent). This is essentially just a flag which says
|
||||
"this filesystem has a (hidden) feature" that the kernel or e2fsck may
|
||||
want to be aware of (more on e2fsck and feature flags later). The ext3
|
||||
HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
|
||||
a regular file with data blocks in it so the kernel does not need to
|
||||
take any special notice of it if it doesn't understand ext3 journaling.
|
||||
|
||||
An RO_COMPAT flag indicates that the on-disk format is 100% compatible
|
||||
with older on-disk formats for reading (i.e. the feature does not change
|
||||
the visible on-disk format). However, an old kernel writing to such a
|
||||
filesystem would/could corrupt the filesystem, so this is prevented. The
|
||||
most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
|
||||
sparse groups allow file data blocks where superblock/group descriptor
|
||||
backups used to live, and ext2_free_blocks() refuses to free these blocks,
|
||||
which would leading to inconsistent bitmaps. An old kernel would also
|
||||
get an error if it tried to free a series of blocks which crossed a group
|
||||
boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.
|
||||
|
||||
An INCOMPAT flag indicates the on-disk format has changed in some
|
||||
way that makes it unreadable by older kernels, or would otherwise
|
||||
cause a problem if an old kernel tried to mount it. FILETYPE is an
|
||||
INCOMPAT flag because older kernels would think a filename was longer
|
||||
than 256 characters, which would lead to corrupt directory listings.
|
||||
The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
|
||||
doesn't understand compression, you would just get garbage back from
|
||||
read() instead of it automatically decompressing your data. The ext3
|
||||
RECOVER flag is needed to prevent a kernel which does not understand the
|
||||
ext3 journal from mounting the filesystem without replaying the journal.
|
||||
|
||||
For e2fsck, it needs to be more strict with the handling of these
|
||||
flags than the kernel. If it doesn't understand ANY of the COMPAT,
|
||||
RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem,
|
||||
because it has no way of verifying whether a given feature is valid
|
||||
or not. Allowing e2fsck to succeed on a filesystem with an unknown
|
||||
feature is a false sense of security for the user. Refusing to check
|
||||
a filesystem with unknown features is a good incentive for the user to
|
||||
update to the latest e2fsck. This also means that anyone adding feature
|
||||
flags to ext2 also needs to update e2fsck to verify these features.
|
||||
|
||||
Metadata
|
||||
--------
|
||||
|
||||
It is frequently claimed that the ext2 implementation of writing
|
||||
asynchronous metadata is faster than the ffs synchronous metadata
|
||||
scheme but less reliable. Both methods are equally resolvable by their
|
||||
respective fsck programs.
|
||||
|
||||
If you're exceptionally paranoid, there are 3 ways of making metadata
|
||||
writes synchronous on ext2:
|
||||
|
||||
per-file if you have the program source: use the O_SYNC flag to open()
|
||||
per-file if you don't have the source: use "chattr +S" on the file
|
||||
per-filesystem: add the "sync" option to mount (or in /etc/fstab)
|
||||
|
||||
the first and last are not ext2 specific but do force the metadata to
|
||||
be written synchronously. See also Journaling below.
|
||||
|
||||
Limitations
|
||||
-----------
|
||||
|
||||
There are various limits imposed by the on-disk layout of ext2. Other
|
||||
limits are imposed by the current implementation of the kernel code.
|
||||
Many of the limits are determined at the time the filesystem is first
|
||||
created, and depend upon the block size chosen. The ratio of inodes to
|
||||
data blocks is fixed at filesystem creation time, so the only way to
|
||||
increase the number of inodes is to increase the size of the filesystem.
|
||||
No tools currently exist which can change the ratio of inodes to blocks.
|
||||
|
||||
Most of these limits could be overcome with slight changes in the on-disk
|
||||
format and using a compatibility flag to signal the format change (at
|
||||
the expense of some compatibility).
|
||||
|
||||
Filesystem block size: 1kB 2kB 4kB 8kB
|
||||
|
||||
File size limit: 16GB 256GB 2048GB 2048GB
|
||||
Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
|
||||
|
||||
There is a 2.4 kernel limit of 2048GB for a single block device, so no
|
||||
filesystem larger than that can be created at this time. There is also
|
||||
an upper limit on the block size imposed by the page size of the kernel,
|
||||
so 8kB blocks are only allowed on Alpha systems (and other architectures
|
||||
which support larger pages).
|
||||
|
||||
There is an upper limit of 32768 subdirectories in a single directory.
|
||||
|
||||
There is a "soft" upper limit of about 10-15k files in a single directory
|
||||
with the current linear linked-list directory implementation. This limit
|
||||
stems from performance problems when creating and deleting (and also
|
||||
finding) files in such large directories. Using a hashed directory index
|
||||
(under development) allows 100k-1M+ files in a single directory without
|
||||
performance problems (although RAM size becomes an issue at this point).
|
||||
|
||||
The (meaningless) absolute upper limit of files in a single directory
|
||||
(imposed by the file size, the realistic limit is obviously much less)
|
||||
is over 130 trillion files. It would be higher except there are not
|
||||
enough 4-character names to make up unique directory entries, so they
|
||||
have to be 8 character filenames, even then we are fairly close to
|
||||
running out of unique filenames.
|
||||
|
||||
Journaling
|
||||
----------
|
||||
|
||||
A journaling extension to the ext2 code has been developed by Stephen
|
||||
Tweedie. It avoids the risks of metadata corruption and the need to
|
||||
wait for e2fsck to complete after a crash, without requiring a change
|
||||
to the on-disk ext2 layout. In a nutshell, the journal is a regular
|
||||
file which stores whole metadata (and optionally data) blocks that have
|
||||
been modified, prior to writing them into the filesystem. This means
|
||||
it is possible to add a journal to an existing ext2 filesystem without
|
||||
the need for data conversion.
|
||||
|
||||
When changes to the filesystem (e.g. a file is renamed) they are stored in
|
||||
a transaction in the journal and can either be complete or incomplete at
|
||||
the time of a crash. If a transaction is complete at the time of a crash
|
||||
(or in the normal case where the system does not crash), then any blocks
|
||||
in that transaction are guaranteed to represent a valid filesystem state,
|
||||
and are copied into the filesystem. If a transaction is incomplete at
|
||||
the time of the crash, then there is no guarantee of consistency for
|
||||
the blocks in that transaction so they are discarded (which means any
|
||||
filesystem changes they represent are also lost).
|
||||
Check Documentation/filesystems/ext3.txt if you want to read more about
|
||||
ext3 and journaling.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
The kernel source file:/usr/src/linux/fs/ext2/
|
||||
e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
|
||||
Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
|
||||
Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
|
||||
Filesystem Resizing http://ext2resize.sourceforge.net/
|
||||
Compression (*) http://e2compr.sourceforge.net/
|
||||
|
||||
Implementations for:
|
||||
Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm
|
||||
Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2
|
||||
DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
|
||||
OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/
|
||||
RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/
|
||||
|
||||
(*) no longer actively developed/supported (as of Apr 2001)
|
||||
198
Documentation/filesystems/ext3.txt
Normal file
198
Documentation/filesystems/ext3.txt
Normal file
@@ -0,0 +1,198 @@
|
||||
|
||||
Ext3 Filesystem
|
||||
===============
|
||||
|
||||
Ext3 was originally released in September 1999. Written by Stephen Tweedie
|
||||
for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
|
||||
Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
|
||||
|
||||
Ext3 is the ext2 filesystem enhanced with journalling capabilities.
|
||||
|
||||
Options
|
||||
=======
|
||||
|
||||
When mounting an ext3 filesystem, the following option are accepted:
|
||||
(*) == default
|
||||
|
||||
journal=update Update the ext3 file system's journal to the current
|
||||
format.
|
||||
|
||||
journal=inum When a journal already exists, this option is ignored.
|
||||
Otherwise, it specifies the number of the inode which
|
||||
will represent the ext3 file system's journal file.
|
||||
|
||||
journal_dev=devnum When the external journal device's major/minor numbers
|
||||
have changed, this option allows the user to specify
|
||||
the new journal location. The journal device is
|
||||
identified through its new major/minor numbers encoded
|
||||
in devnum.
|
||||
|
||||
noload Don't load the journal on mounting.
|
||||
|
||||
data=journal All data are committed into the journal prior to being
|
||||
written into the main file system.
|
||||
|
||||
data=ordered (*) All data are forced directly out to the main file
|
||||
system prior to its metadata being committed to the
|
||||
journal.
|
||||
|
||||
data=writeback Data ordering is not preserved, data may be written
|
||||
into the main file system after its metadata has been
|
||||
committed to the journal.
|
||||
|
||||
commit=nrsec (*) Ext3 can be told to sync all its data and metadata
|
||||
every 'nrsec' seconds. The default value is 5 seconds.
|
||||
This means that if you lose your power, you will lose
|
||||
as much as the latest 5 seconds of work (your
|
||||
filesystem will not be damaged though, thanks to the
|
||||
journaling). This default value (or any low value)
|
||||
will hurt performance, but it's good for data-safety.
|
||||
Setting it to 0 will have the same effect as leaving
|
||||
it at the default (5 seconds).
|
||||
Setting it to very large values will improve
|
||||
performance.
|
||||
|
||||
barrier=1 This enables/disables barriers. barrier=0 disables
|
||||
it, barrier=1 enables it.
|
||||
|
||||
orlov (*) This enables the new Orlov block allocator. It is
|
||||
enabled by default.
|
||||
|
||||
oldalloc This disables the Orlov block allocator and enables
|
||||
the old block allocator. Orlov should have better
|
||||
performance - we'd like to get some feedback if it's
|
||||
the contrary for you.
|
||||
|
||||
user_xattr Enables Extended User Attributes. Additionally, you
|
||||
need to have extended attribute support enabled in the
|
||||
kernel configuration (CONFIG_EXT3_FS_XATTR). See the
|
||||
attr(5) manual page and http://acl.bestbits.at/ to
|
||||
learn more about extended attributes.
|
||||
|
||||
nouser_xattr Disables Extended User Attributes.
|
||||
|
||||
acl Enables POSIX Access Control Lists support.
|
||||
Additionally, you need to have ACL support enabled in
|
||||
the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL).
|
||||
See the acl(5) manual page and http://acl.bestbits.at/
|
||||
for more information.
|
||||
|
||||
noacl This option disables POSIX Access Control List
|
||||
support.
|
||||
|
||||
reservation
|
||||
|
||||
noreservation
|
||||
|
||||
bsddf (*) Make 'df' act like BSD.
|
||||
minixdf Make 'df' act like Minix.
|
||||
|
||||
check=none Don't do extra checking of bitmaps on mount.
|
||||
nocheck
|
||||
|
||||
debug Extra debugging information is sent to syslog.
|
||||
|
||||
errors=remount-ro(*) Remount the filesystem read-only on an error.
|
||||
errors=continue Keep going on a filesystem error.
|
||||
errors=panic Panic and halt the machine if an error occurs.
|
||||
|
||||
grpid Give objects the same group ID as their creator.
|
||||
bsdgroups
|
||||
|
||||
nogrpid (*) New objects have the group ID of their creator.
|
||||
sysvgroups
|
||||
|
||||
resgid=n The group ID which may use the reserved blocks.
|
||||
|
||||
resuid=n The user ID which may use the reserved blocks.
|
||||
|
||||
sb=n Use alternate superblock at this location.
|
||||
|
||||
quota
|
||||
noquota
|
||||
grpquota
|
||||
usrquota
|
||||
|
||||
bh (*) ext3 associates buffer heads to data pages to
|
||||
nobh (a) cache disk block mapping information
|
||||
(b) link pages into transaction to provide
|
||||
ordering guarantees.
|
||||
"bh" option forces use of buffer heads.
|
||||
"nobh" option tries to avoid associating buffer
|
||||
heads (supported only for "writeback" mode).
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
Ext3 shares all disk implementation with the ext2 filesystem, and adds
|
||||
transactions capabilities to ext2. Journaling is done by the Journaling Block
|
||||
Device layer.
|
||||
|
||||
Journaling Block Device layer
|
||||
-----------------------------
|
||||
The Journaling Block Device layer (JBD) isn't ext3 specific. It was design to
|
||||
add journaling capabilities on a block device. The ext3 filesystem code will
|
||||
inform the JBD of modifications it is performing (called a transaction). The
|
||||
journal supports the transactions start and stop, and in case of crash, the
|
||||
journal can replayed the transactions to put the partition back in a
|
||||
consistent state fast.
|
||||
|
||||
Handles represent a single atomic update to a filesystem. JBD can handle an
|
||||
external journal on a block device.
|
||||
|
||||
Data Mode
|
||||
---------
|
||||
There are 3 different data modes:
|
||||
|
||||
* writeback mode
|
||||
In data=writeback mode, ext3 does not journal data at all. This mode provides
|
||||
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
|
||||
mode - metadata journaling. A crash+recovery can cause incorrect data to
|
||||
appear in files which were written shortly before the crash. This mode will
|
||||
typically provide the best ext3 performance.
|
||||
|
||||
* ordered mode
|
||||
In data=ordered mode, ext3 only officially journals metadata, but it logically
|
||||
groups metadata and data blocks into a single unit called a transaction. When
|
||||
it's time to write the new metadata out to disk, the associated data blocks
|
||||
are written first. In general, this mode performs slightly slower than
|
||||
writeback but significantly faster than journal mode.
|
||||
|
||||
* journal mode
|
||||
data=journal mode provides full data and metadata journaling. All new data is
|
||||
written to the journal first, and then to its final location.
|
||||
In the event of a crash, the journal can be replayed, bringing both data and
|
||||
metadata into a consistent state. This mode is the slowest except when data
|
||||
needs to be read from and written to disk at the same time where it
|
||||
outperforms all others modes.
|
||||
|
||||
Compatibility
|
||||
-------------
|
||||
|
||||
Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
|
||||
Ext3 is fully compatible with Ext2. Ext3 partitions can easily be mounted as
|
||||
Ext2.
|
||||
|
||||
|
||||
External Tools
|
||||
==============
|
||||
See manual pages to learn more.
|
||||
|
||||
tune2fs: create a ext3 journal on a ext2 partition with the -j flag.
|
||||
mke2fs: create a ext3 partition with the -j flag.
|
||||
debugfs: ext2 and ext3 file system debugger.
|
||||
ext2online: online (mounted) ext2 and ext3 filesystem resizer
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
kernel source: <file:fs/ext3/>
|
||||
<file:fs/jbd/>
|
||||
|
||||
programs: http://e2fsprogs.sourceforge.net/
|
||||
http://ext2resize.sourceforge.net
|
||||
|
||||
useful links: http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
|
||||
http://www-106.ibm.com/developerworks/linux/library/l-fs7/
|
||||
http://www-106.ibm.com/developerworks/linux/library/l-fs8/
|
||||
236
Documentation/filesystems/ext4.txt
Normal file
236
Documentation/filesystems/ext4.txt
Normal file
@@ -0,0 +1,236 @@
|
||||
|
||||
Ext4 Filesystem
|
||||
===============
|
||||
|
||||
This is a development version of the ext4 filesystem, an advanced level
|
||||
of the ext3 filesystem which incorporates scalability and reliability
|
||||
enhancements for supporting large filesystems (64 bit) in keeping with
|
||||
increasing disk capacities and state-of-the-art feature requirements.
|
||||
|
||||
Mailing list: linux-ext4@vger.kernel.org
|
||||
|
||||
|
||||
1. Quick usage instructions:
|
||||
===========================
|
||||
|
||||
- Grab updated e2fsprogs from
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/
|
||||
This is a patchset on top of e2fsprogs-1.39, which can be found at
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
|
||||
|
||||
- It's still mke2fs -j /dev/hda1
|
||||
|
||||
- mount /dev/hda1 /wherever -t ext4dev
|
||||
|
||||
- To enable extents,
|
||||
|
||||
mount /dev/hda1 /wherever -t ext4dev -o extents
|
||||
|
||||
- The filesystem is compatible with the ext3 driver until you add a file
|
||||
which has extents (ie: `mount -o extents', then create a file).
|
||||
|
||||
NOTE: The "extents" mount flag is temporary. It will soon go away and
|
||||
extents will be enabled by the "-o extents" flag to mke2fs or tune2fs
|
||||
|
||||
- When comparing performance with other filesystems, remember that
|
||||
ext3/4 by default offers higher data integrity guarantees than most. So
|
||||
when comparing with a metadata-only journalling filesystem, use `mount -o
|
||||
data=writeback'. And you might as well use `mount -o nobh' too along
|
||||
with it. Making the journal larger than the mke2fs default often helps
|
||||
performance with metadata-intensive workloads.
|
||||
|
||||
2. Features
|
||||
===========
|
||||
|
||||
2.1 Currently available
|
||||
|
||||
* ability to use filesystems > 16TB
|
||||
* extent format reduces metadata overhead (RAM, IO for access, transactions)
|
||||
* extent format more robust in face of on-disk corruption due to magics,
|
||||
* internal redunancy in tree
|
||||
|
||||
2.1 Previously available, soon to be enabled by default by "mkefs.ext4":
|
||||
|
||||
* dir_index and resize inode will be on by default
|
||||
* large inodes will be used by default for fast EAs, nsec timestamps, etc
|
||||
|
||||
2.2 Candidate features for future inclusion
|
||||
|
||||
There are several under discussion, whether they all make it in is
|
||||
partly a function of how much time everyone has to work on them:
|
||||
|
||||
* improved file allocation (multi-block alloc, delayed alloc; basically done)
|
||||
* fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
|
||||
* nsec timestamps for mtime, atime, ctime, create time (patch exists,
|
||||
needs some e2fsck work)
|
||||
* inode version field on disk (NFSv4, Lustre; prototype exists)
|
||||
* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
|
||||
* journal checksumming for robustness, performance (prototype exists)
|
||||
* persistent file preallocation (e.g for streaming media, databases)
|
||||
|
||||
Features like metadata checksumming have been discussed and planned for
|
||||
a bit but no patches exist yet so I'm not sure they're in the near-term
|
||||
roadmap.
|
||||
|
||||
The big performance win will come with mballoc and delalloc. CFS has
|
||||
been using mballoc for a few years already with Lustre, and IBM + Bull
|
||||
did a lot of benchmarking on it. The reason it isn't in the first set of
|
||||
patches is partly a manageability issue, and partly because it doesn't
|
||||
directly affect the on-disk format (outside of much better allocation)
|
||||
so it isn't critical to get into the first round of changes. I believe
|
||||
Alex is working on a new set of patches right now.
|
||||
|
||||
3. Options
|
||||
==========
|
||||
|
||||
When mounting an ext4 filesystem, the following option are accepted:
|
||||
(*) == default
|
||||
|
||||
extents ext4 will use extents to address file data. The
|
||||
file system will no longer be mountable by ext3.
|
||||
|
||||
journal=update Update the ext4 file system's journal to the current
|
||||
format.
|
||||
|
||||
journal=inum When a journal already exists, this option is ignored.
|
||||
Otherwise, it specifies the number of the inode which
|
||||
will represent the ext4 file system's journal file.
|
||||
|
||||
journal_dev=devnum When the external journal device's major/minor numbers
|
||||
have changed, this option allows the user to specify
|
||||
the new journal location. The journal device is
|
||||
identified through its new major/minor numbers encoded
|
||||
in devnum.
|
||||
|
||||
noload Don't load the journal on mounting.
|
||||
|
||||
data=journal All data are committed into the journal prior to being
|
||||
written into the main file system.
|
||||
|
||||
data=ordered (*) All data are forced directly out to the main file
|
||||
system prior to its metadata being committed to the
|
||||
journal.
|
||||
|
||||
data=writeback Data ordering is not preserved, data may be written
|
||||
into the main file system after its metadata has been
|
||||
committed to the journal.
|
||||
|
||||
commit=nrsec (*) Ext4 can be told to sync all its data and metadata
|
||||
every 'nrsec' seconds. The default value is 5 seconds.
|
||||
This means that if you lose your power, you will lose
|
||||
as much as the latest 5 seconds of work (your
|
||||
filesystem will not be damaged though, thanks to the
|
||||
journaling). This default value (or any low value)
|
||||
will hurt performance, but it's good for data-safety.
|
||||
Setting it to 0 will have the same effect as leaving
|
||||
it at the default (5 seconds).
|
||||
Setting it to very large values will improve
|
||||
performance.
|
||||
|
||||
barrier=1 This enables/disables barriers. barrier=0 disables
|
||||
it, barrier=1 enables it.
|
||||
|
||||
orlov (*) This enables the new Orlov block allocator. It is
|
||||
enabled by default.
|
||||
|
||||
oldalloc This disables the Orlov block allocator and enables
|
||||
the old block allocator. Orlov should have better
|
||||
performance - we'd like to get some feedback if it's
|
||||
the contrary for you.
|
||||
|
||||
user_xattr Enables Extended User Attributes. Additionally, you
|
||||
need to have extended attribute support enabled in the
|
||||
kernel configuration (CONFIG_EXT4_FS_XATTR). See the
|
||||
attr(5) manual page and http://acl.bestbits.at/ to
|
||||
learn more about extended attributes.
|
||||
|
||||
nouser_xattr Disables Extended User Attributes.
|
||||
|
||||
acl Enables POSIX Access Control Lists support.
|
||||
Additionally, you need to have ACL support enabled in
|
||||
the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL).
|
||||
See the acl(5) manual page and http://acl.bestbits.at/
|
||||
for more information.
|
||||
|
||||
noacl This option disables POSIX Access Control List
|
||||
support.
|
||||
|
||||
reservation
|
||||
|
||||
noreservation
|
||||
|
||||
bsddf (*) Make 'df' act like BSD.
|
||||
minixdf Make 'df' act like Minix.
|
||||
|
||||
check=none Don't do extra checking of bitmaps on mount.
|
||||
nocheck
|
||||
|
||||
debug Extra debugging information is sent to syslog.
|
||||
|
||||
errors=remount-ro(*) Remount the filesystem read-only on an error.
|
||||
errors=continue Keep going on a filesystem error.
|
||||
errors=panic Panic and halt the machine if an error occurs.
|
||||
|
||||
grpid Give objects the same group ID as their creator.
|
||||
bsdgroups
|
||||
|
||||
nogrpid (*) New objects have the group ID of their creator.
|
||||
sysvgroups
|
||||
|
||||
resgid=n The group ID which may use the reserved blocks.
|
||||
|
||||
resuid=n The user ID which may use the reserved blocks.
|
||||
|
||||
sb=n Use alternate superblock at this location.
|
||||
|
||||
quota
|
||||
noquota
|
||||
grpquota
|
||||
usrquota
|
||||
|
||||
bh (*) ext4 associates buffer heads to data pages to
|
||||
nobh (a) cache disk block mapping information
|
||||
(b) link pages into transaction to provide
|
||||
ordering guarantees.
|
||||
"bh" option forces use of buffer heads.
|
||||
"nobh" option tries to avoid associating buffer
|
||||
heads (supported only for "writeback" mode).
|
||||
|
||||
|
||||
Data Mode
|
||||
---------
|
||||
There are 3 different data modes:
|
||||
|
||||
* writeback mode
|
||||
In data=writeback mode, ext4 does not journal data at all. This mode provides
|
||||
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
|
||||
mode - metadata journaling. A crash+recovery can cause incorrect data to
|
||||
appear in files which were written shortly before the crash. This mode will
|
||||
typically provide the best ext4 performance.
|
||||
|
||||
* ordered mode
|
||||
In data=ordered mode, ext4 only officially journals metadata, but it logically
|
||||
groups metadata and data blocks into a single unit called a transaction. When
|
||||
it's time to write the new metadata out to disk, the associated data blocks
|
||||
are written first. In general, this mode performs slightly slower than
|
||||
writeback but significantly faster than journal mode.
|
||||
|
||||
* journal mode
|
||||
data=journal mode provides full data and metadata journaling. All new data is
|
||||
written to the journal first, and then to its final location.
|
||||
In the event of a crash, the journal can be replayed, bringing both data and
|
||||
metadata into a consistent state. This mode is the slowest except when data
|
||||
needs to be read from and written to disk at the same time where it
|
||||
outperforms all others modes.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
kernel source: <file:fs/ext4/>
|
||||
<file:fs/jbd2/>
|
||||
|
||||
programs: http://e2fsprogs.sourceforge.net/
|
||||
http://ext2resize.sourceforge.net
|
||||
|
||||
useful links: http://fedoraproject.org/wiki/ext3-devel
|
||||
http://www.bullopensource.org/ext4/
|
||||
123
Documentation/filesystems/files.txt
Normal file
123
Documentation/filesystems/files.txt
Normal file
@@ -0,0 +1,123 @@
|
||||
File management in the Linux kernel
|
||||
-----------------------------------
|
||||
|
||||
This document describes how locking for files (struct file)
|
||||
and file descriptor table (struct files) works.
|
||||
|
||||
Up until 2.6.12, the file descriptor table has been protected
|
||||
with a lock (files->file_lock) and reference count (files->count).
|
||||
->file_lock protected accesses to all the file related fields
|
||||
of the table. ->count was used for sharing the file descriptor
|
||||
table between tasks cloned with CLONE_FILES flag. Typically
|
||||
this would be the case for posix threads. As with the common
|
||||
refcounting model in the kernel, the last task doing
|
||||
a put_files_struct() frees the file descriptor (fd) table.
|
||||
The files (struct file) themselves are protected using
|
||||
reference count (->f_count).
|
||||
|
||||
In the new lock-free model of file descriptor management,
|
||||
the reference counting is similar, but the locking is
|
||||
based on RCU. The file descriptor table contains multiple
|
||||
elements - the fd sets (open_fds and close_on_exec, the
|
||||
array of file pointers, the sizes of the sets and the array
|
||||
etc.). In order for the updates to appear atomic to
|
||||
a lock-free reader, all the elements of the file descriptor
|
||||
table are in a separate structure - struct fdtable.
|
||||
files_struct contains a pointer to struct fdtable through
|
||||
which the actual fd table is accessed. Initially the
|
||||
fdtable is embedded in files_struct itself. On a subsequent
|
||||
expansion of fdtable, a new fdtable structure is allocated
|
||||
and files->fdtab points to the new structure. The fdtable
|
||||
structure is freed with RCU and lock-free readers either
|
||||
see the old fdtable or the new fdtable making the update
|
||||
appear atomic. Here are the locking rules for
|
||||
the fdtable structure -
|
||||
|
||||
1. All references to the fdtable must be done through
|
||||
the files_fdtable() macro :
|
||||
|
||||
struct fdtable *fdt;
|
||||
|
||||
rcu_read_lock();
|
||||
|
||||
fdt = files_fdtable(files);
|
||||
....
|
||||
if (n <= fdt->max_fds)
|
||||
....
|
||||
...
|
||||
rcu_read_unlock();
|
||||
|
||||
files_fdtable() uses rcu_dereference() macro which takes care of
|
||||
the memory barrier requirements for lock-free dereference.
|
||||
The fdtable pointer must be read within the read-side
|
||||
critical section.
|
||||
|
||||
2. Reading of the fdtable as described above must be protected
|
||||
by rcu_read_lock()/rcu_read_unlock().
|
||||
|
||||
3. For any update to the fd table, files->file_lock must
|
||||
be held.
|
||||
|
||||
4. To look up the file structure given an fd, a reader
|
||||
must use either fcheck() or fcheck_files() APIs. These
|
||||
take care of barrier requirements due to lock-free lookup.
|
||||
An example :
|
||||
|
||||
struct file *file;
|
||||
|
||||
rcu_read_lock();
|
||||
file = fcheck(fd);
|
||||
if (file) {
|
||||
...
|
||||
}
|
||||
....
|
||||
rcu_read_unlock();
|
||||
|
||||
5. Handling of the file structures is special. Since the look-up
|
||||
of the fd (fget()/fget_light()) are lock-free, it is possible
|
||||
that look-up may race with the last put() operation on the
|
||||
file structure. This is avoided using the rcuref APIs
|
||||
on ->f_count :
|
||||
|
||||
rcu_read_lock();
|
||||
file = fcheck_files(files, fd);
|
||||
if (file) {
|
||||
if (rcuref_inc_lf(&file->f_count))
|
||||
*fput_needed = 1;
|
||||
else
|
||||
/* Didn't get the reference, someone's freed */
|
||||
file = NULL;
|
||||
}
|
||||
rcu_read_unlock();
|
||||
....
|
||||
return file;
|
||||
|
||||
rcuref_inc_lf() detects if refcounts is already zero or
|
||||
goes to zero during increment. If it does, we fail
|
||||
fget()/fget_light().
|
||||
|
||||
6. Since both fdtable and file structures can be looked up
|
||||
lock-free, they must be installed using rcu_assign_pointer()
|
||||
API. If they are looked up lock-free, rcu_dereference()
|
||||
must be used. However it is advisable to use files_fdtable()
|
||||
and fcheck()/fcheck_files() which take care of these issues.
|
||||
|
||||
7. While updating, the fdtable pointer must be looked up while
|
||||
holding files->file_lock. If ->file_lock is dropped, then
|
||||
another thread expand the files thereby creating a new
|
||||
fdtable and making the earlier fdtable pointer stale.
|
||||
For example :
|
||||
|
||||
spin_lock(&files->file_lock);
|
||||
fd = locate_fd(files, file, start);
|
||||
if (fd >= 0) {
|
||||
/* locate_fd() may have expanded fdtable, load the ptr */
|
||||
fdt = files_fdtable(files);
|
||||
FD_SET(fd, fdt->open_fds);
|
||||
FD_CLR(fd, fdt->close_on_exec);
|
||||
spin_unlock(&files->file_lock);
|
||||
.....
|
||||
|
||||
Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
|
||||
the fdtable pointer (fdt) must be loaded after locate_fd().
|
||||
|
||||
423
Documentation/filesystems/fuse.txt
Normal file
423
Documentation/filesystems/fuse.txt
Normal file
@@ -0,0 +1,423 @@
|
||||
Definitions
|
||||
~~~~~~~~~~~
|
||||
|
||||
Userspace filesystem:
|
||||
|
||||
A filesystem in which data and metadata are provided by an ordinary
|
||||
userspace process. The filesystem can be accessed normally through
|
||||
the kernel interface.
|
||||
|
||||
Filesystem daemon:
|
||||
|
||||
The process(es) providing the data and metadata of the filesystem.
|
||||
|
||||
Non-privileged mount (or user mount):
|
||||
|
||||
A userspace filesystem mounted by a non-privileged (non-root) user.
|
||||
The filesystem daemon is running with the privileges of the mounting
|
||||
user. NOTE: this is not the same as mounts allowed with the "user"
|
||||
option in /etc/fstab, which is not discussed here.
|
||||
|
||||
Filesystem connection:
|
||||
|
||||
A connection between the filesystem daemon and the kernel. The
|
||||
connection exists until either the daemon dies, or the filesystem is
|
||||
umounted. Note that detaching (or lazy umounting) the filesystem
|
||||
does _not_ break the connection, in this case it will exist until
|
||||
the last reference to the filesystem is released.
|
||||
|
||||
Mount owner:
|
||||
|
||||
The user who does the mounting.
|
||||
|
||||
User:
|
||||
|
||||
The user who is performing filesystem operations.
|
||||
|
||||
What is FUSE?
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
FUSE is a userspace filesystem framework. It consists of a kernel
|
||||
module (fuse.ko), a userspace library (libfuse.*) and a mount utility
|
||||
(fusermount).
|
||||
|
||||
One of the most important features of FUSE is allowing secure,
|
||||
non-privileged mounts. This opens up new possibilities for the use of
|
||||
filesystems. A good example is sshfs: a secure network filesystem
|
||||
using the sftp protocol.
|
||||
|
||||
The userspace library and utilities are available from the FUSE
|
||||
homepage:
|
||||
|
||||
http://fuse.sourceforge.net/
|
||||
|
||||
Filesystem type
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
The filesystem type given to mount(2) can be one of the following:
|
||||
|
||||
'fuse'
|
||||
|
||||
This is the usual way to mount a FUSE filesystem. The first
|
||||
argument of the mount system call may contain an arbitrary string,
|
||||
which is not interpreted by the kernel.
|
||||
|
||||
'fuseblk'
|
||||
|
||||
The filesystem is block device based. The first argument of the
|
||||
mount system call is interpreted as the name of the device.
|
||||
|
||||
Mount options
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
'fd=N'
|
||||
|
||||
The file descriptor to use for communication between the userspace
|
||||
filesystem and the kernel. The file descriptor must have been
|
||||
obtained by opening the FUSE device ('/dev/fuse').
|
||||
|
||||
'rootmode=M'
|
||||
|
||||
The file mode of the filesystem's root in octal representation.
|
||||
|
||||
'user_id=N'
|
||||
|
||||
The numeric user id of the mount owner.
|
||||
|
||||
'group_id=N'
|
||||
|
||||
The numeric group id of the mount owner.
|
||||
|
||||
'default_permissions'
|
||||
|
||||
By default FUSE doesn't check file access permissions, the
|
||||
filesystem is free to implement it's access policy or leave it to
|
||||
the underlying file access mechanism (e.g. in case of network
|
||||
filesystems). This option enables permission checking, restricting
|
||||
access based on file mode. It is usually useful together with the
|
||||
'allow_other' mount option.
|
||||
|
||||
'allow_other'
|
||||
|
||||
This option overrides the security measure restricting file access
|
||||
to the user mounting the filesystem. This option is by default only
|
||||
allowed to root, but this restriction can be removed with a
|
||||
(userspace) configuration option.
|
||||
|
||||
'max_read=N'
|
||||
|
||||
With this option the maximum size of read operations can be set.
|
||||
The default is infinite. Note that the size of read requests is
|
||||
limited anyway to 32 pages (which is 128kbyte on i386).
|
||||
|
||||
'blksize=N'
|
||||
|
||||
Set the block size for the filesystem. The default is 512. This
|
||||
option is only valid for 'fuseblk' type mounts.
|
||||
|
||||
Control filesystem
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
There's a control filesystem for FUSE, which can be mounted by:
|
||||
|
||||
mount -t fusectl none /sys/fs/fuse/connections
|
||||
|
||||
Mounting it under the '/sys/fs/fuse/connections' directory makes it
|
||||
backwards compatible with earlier versions.
|
||||
|
||||
Under the fuse control filesystem each connection has a directory
|
||||
named by a unique number.
|
||||
|
||||
For each connection the following files exist within this directory:
|
||||
|
||||
'waiting'
|
||||
|
||||
The number of requests which are waiting to be transferred to
|
||||
userspace or being processed by the filesystem daemon. If there is
|
||||
no filesystem activity and 'waiting' is non-zero, then the
|
||||
filesystem is hung or deadlocked.
|
||||
|
||||
'abort'
|
||||
|
||||
Writing anything into this file will abort the filesystem
|
||||
connection. This means that all waiting requests will be aborted an
|
||||
error returned for all aborted and new requests.
|
||||
|
||||
Only the owner of the mount may read or write these files.
|
||||
|
||||
Interrupting filesystem operations
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If a process issuing a FUSE filesystem request is interrupted, the
|
||||
following will happen:
|
||||
|
||||
1) If the request is not yet sent to userspace AND the signal is
|
||||
fatal (SIGKILL or unhandled fatal signal), then the request is
|
||||
dequeued and returns immediately.
|
||||
|
||||
2) If the request is not yet sent to userspace AND the signal is not
|
||||
fatal, then an 'interrupted' flag is set for the request. When
|
||||
the request has been successfully transferred to userspace and
|
||||
this flag is set, an INTERRUPT request is queued.
|
||||
|
||||
3) If the request is already sent to userspace, then an INTERRUPT
|
||||
request is queued.
|
||||
|
||||
INTERRUPT requests take precedence over other requests, so the
|
||||
userspace filesystem will receive queued INTERRUPTs before any others.
|
||||
|
||||
The userspace filesystem may ignore the INTERRUPT requests entirely,
|
||||
or may honor them by sending a reply to the _original_ request, with
|
||||
the error set to EINTR.
|
||||
|
||||
It is also possible that there's a race between processing the
|
||||
original request and it's INTERRUPT request. There are two possibilities:
|
||||
|
||||
1) The INTERRUPT request is processed before the original request is
|
||||
processed
|
||||
|
||||
2) The INTERRUPT request is processed after the original request has
|
||||
been answered
|
||||
|
||||
If the filesystem cannot find the original request, it should wait for
|
||||
some timeout and/or a number of new requests to arrive, after which it
|
||||
should reply to the INTERRUPT request with an EAGAIN error. In case
|
||||
1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT
|
||||
reply will be ignored.
|
||||
|
||||
Aborting a filesystem connection
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
It is possible to get into certain situations where the filesystem is
|
||||
not responding. Reasons for this may be:
|
||||
|
||||
a) Broken userspace filesystem implementation
|
||||
|
||||
b) Network connection down
|
||||
|
||||
c) Accidental deadlock
|
||||
|
||||
d) Malicious deadlock
|
||||
|
||||
(For more on c) and d) see later sections)
|
||||
|
||||
In either of these cases it may be useful to abort the connection to
|
||||
the filesystem. There are several ways to do this:
|
||||
|
||||
- Kill the filesystem daemon. Works in case of a) and b)
|
||||
|
||||
- Kill the filesystem daemon and all users of the filesystem. Works
|
||||
in all cases except some malicious deadlocks
|
||||
|
||||
- Use forced umount (umount -f). Works in all cases but only if
|
||||
filesystem is still attached (it hasn't been lazy unmounted)
|
||||
|
||||
- Abort filesystem through the FUSE control filesystem. Most
|
||||
powerful method, always works.
|
||||
|
||||
How do non-privileged mounts work?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Since the mount() system call is a privileged operation, a helper
|
||||
program (fusermount) is needed, which is installed setuid root.
|
||||
|
||||
The implication of providing non-privileged mounts is that the mount
|
||||
owner must not be able to use this capability to compromise the
|
||||
system. Obvious requirements arising from this are:
|
||||
|
||||
A) mount owner should not be able to get elevated privileges with the
|
||||
help of the mounted filesystem
|
||||
|
||||
B) mount owner should not get illegitimate access to information from
|
||||
other users' and the super user's processes
|
||||
|
||||
C) mount owner should not be able to induce undesired behavior in
|
||||
other users' or the super user's processes
|
||||
|
||||
How are requirements fulfilled?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A) The mount owner could gain elevated privileges by either:
|
||||
|
||||
1) creating a filesystem containing a device file, then opening
|
||||
this device
|
||||
|
||||
2) creating a filesystem containing a suid or sgid application,
|
||||
then executing this application
|
||||
|
||||
The solution is not to allow opening device files and ignore
|
||||
setuid and setgid bits when executing programs. To ensure this
|
||||
fusermount always adds "nosuid" and "nodev" to the mount options
|
||||
for non-privileged mounts.
|
||||
|
||||
B) If another user is accessing files or directories in the
|
||||
filesystem, the filesystem daemon serving requests can record the
|
||||
exact sequence and timing of operations performed. This
|
||||
information is otherwise inaccessible to the mount owner, so this
|
||||
counts as an information leak.
|
||||
|
||||
The solution to this problem will be presented in point 2) of C).
|
||||
|
||||
C) There are several ways in which the mount owner can induce
|
||||
undesired behavior in other users' processes, such as:
|
||||
|
||||
1) mounting a filesystem over a file or directory which the mount
|
||||
owner could otherwise not be able to modify (or could only
|
||||
make limited modifications).
|
||||
|
||||
This is solved in fusermount, by checking the access
|
||||
permissions on the mountpoint and only allowing the mount if
|
||||
the mount owner can do unlimited modification (has write
|
||||
access to the mountpoint, and mountpoint is not a "sticky"
|
||||
directory)
|
||||
|
||||
2) Even if 1) is solved the mount owner can change the behavior
|
||||
of other users' processes.
|
||||
|
||||
i) It can slow down or indefinitely delay the execution of a
|
||||
filesystem operation creating a DoS against the user or the
|
||||
whole system. For example a suid application locking a
|
||||
system file, and then accessing a file on the mount owner's
|
||||
filesystem could be stopped, and thus causing the system
|
||||
file to be locked forever.
|
||||
|
||||
ii) It can present files or directories of unlimited length, or
|
||||
directory structures of unlimited depth, possibly causing a
|
||||
system process to eat up diskspace, memory or other
|
||||
resources, again causing DoS.
|
||||
|
||||
The solution to this as well as B) is not to allow processes
|
||||
to access the filesystem, which could otherwise not be
|
||||
monitored or manipulated by the mount owner. Since if the
|
||||
mount owner can ptrace a process, it can do all of the above
|
||||
without using a FUSE mount, the same criteria as used in
|
||||
ptrace can be used to check if a process is allowed to access
|
||||
the filesystem or not.
|
||||
|
||||
Note that the ptrace check is not strictly necessary to
|
||||
prevent B/2/i, it is enough to check if mount owner has enough
|
||||
privilege to send signal to the process accessing the
|
||||
filesystem, since SIGSTOP can be used to get a similar effect.
|
||||
|
||||
I think these limitations are unacceptable?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If a sysadmin trusts the users enough, or can ensure through other
|
||||
measures, that system processes will never enter non-privileged
|
||||
mounts, it can relax the last limitation with a "user_allow_other"
|
||||
config option. If this config option is set, the mounting user can
|
||||
add the "allow_other" mount option which disables the check for other
|
||||
users' processes.
|
||||
|
||||
Kernel - userspace interface
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The following diagram shows how a filesystem operation (in this
|
||||
example unlink) is performed in FUSE.
|
||||
|
||||
NOTE: everything in this description is greatly simplified
|
||||
|
||||
| "rm /mnt/fuse/file" | FUSE filesystem daemon
|
||||
| |
|
||||
| | >sys_read()
|
||||
| | >fuse_dev_read()
|
||||
| | >request_wait()
|
||||
| | [sleep on fc->waitq]
|
||||
| |
|
||||
| >sys_unlink() |
|
||||
| >fuse_unlink() |
|
||||
| [get request from |
|
||||
| fc->unused_list] |
|
||||
| >request_send() |
|
||||
| [queue req on fc->pending] |
|
||||
| [wake up fc->waitq] | [woken up]
|
||||
| >request_wait_answer() |
|
||||
| [sleep on req->waitq] |
|
||||
| | <request_wait()
|
||||
| | [remove req from fc->pending]
|
||||
| | [copy req to read buffer]
|
||||
| | [add req to fc->processing]
|
||||
| | <fuse_dev_read()
|
||||
| | <sys_read()
|
||||
| |
|
||||
| | [perform unlink]
|
||||
| |
|
||||
| | >sys_write()
|
||||
| | >fuse_dev_write()
|
||||
| | [look up req in fc->processing]
|
||||
| | [remove from fc->processing]
|
||||
| | [copy write buffer to req]
|
||||
| [woken up] | [wake up req->waitq]
|
||||
| | <fuse_dev_write()
|
||||
| | <sys_write()
|
||||
| <request_wait_answer() |
|
||||
| <request_send() |
|
||||
| [add request to |
|
||||
| fc->unused_list] |
|
||||
| <fuse_unlink() |
|
||||
| <sys_unlink() |
|
||||
|
||||
There are a couple of ways in which to deadlock a FUSE filesystem.
|
||||
Since we are talking about unprivileged userspace programs,
|
||||
something must be done about these.
|
||||
|
||||
Scenario 1 - Simple deadlock
|
||||
-----------------------------
|
||||
|
||||
| "rm /mnt/fuse/file" | FUSE filesystem daemon
|
||||
| |
|
||||
| >sys_unlink("/mnt/fuse/file") |
|
||||
| [acquire inode semaphore |
|
||||
| for "file"] |
|
||||
| >fuse_unlink() |
|
||||
| [sleep on req->waitq] |
|
||||
| | <sys_read()
|
||||
| | >sys_unlink("/mnt/fuse/file")
|
||||
| | [acquire inode semaphore
|
||||
| | for "file"]
|
||||
| | *DEADLOCK*
|
||||
|
||||
The solution for this is to allow the filesystem to be aborted.
|
||||
|
||||
Scenario 2 - Tricky deadlock
|
||||
----------------------------
|
||||
|
||||
This one needs a carefully crafted filesystem. It's a variation on
|
||||
the above, only the call back to the filesystem is not explicit,
|
||||
but is caused by a pagefault.
|
||||
|
||||
| Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2
|
||||
| |
|
||||
| [fd = open("/mnt/fuse/file")] | [request served normally]
|
||||
| [mmap fd to 'addr'] |
|
||||
| [close fd] | [FLUSH triggers 'magic' flag]
|
||||
| [read a byte from addr] |
|
||||
| >do_page_fault() |
|
||||
| [find or create page] |
|
||||
| [lock page] |
|
||||
| >fuse_readpage() |
|
||||
| [queue READ request] |
|
||||
| [sleep on req->waitq] |
|
||||
| | [read request to buffer]
|
||||
| | [create reply header before addr]
|
||||
| | >sys_write(addr - headerlength)
|
||||
| | >fuse_dev_write()
|
||||
| | [look up req in fc->processing]
|
||||
| | [remove from fc->processing]
|
||||
| | [copy write buffer to req]
|
||||
| | >do_page_fault()
|
||||
| | [find or create page]
|
||||
| | [lock page]
|
||||
| | * DEADLOCK *
|
||||
|
||||
Solution is basically the same as above.
|
||||
|
||||
An additional problem is that while the write buffer is being copied
|
||||
to the request, the request must not be interrupted/aborted. This is
|
||||
because the destination address of the copy may not be valid after the
|
||||
request has returned.
|
||||
|
||||
This is solved with doing the copy atomically, and allowing abort
|
||||
while the page(s) belonging to the write buffer are faulted with
|
||||
get_user_pages(). The 'req->locked' flag indicates when the copy is
|
||||
taking place, and abort is delayed until this flag is unset.
|
||||
43
Documentation/filesystems/gfs2.txt
Normal file
43
Documentation/filesystems/gfs2.txt
Normal file
@@ -0,0 +1,43 @@
|
||||
Global File System
|
||||
------------------
|
||||
|
||||
http://sources.redhat.com/cluster/
|
||||
|
||||
GFS is a cluster file system. It allows a cluster of computers to
|
||||
simultaneously use a block device that is shared between them (with FC,
|
||||
iSCSI, NBD, etc). GFS reads and writes to the block device like a local
|
||||
file system, but also uses a lock module to allow the computers coordinate
|
||||
their I/O so file system consistency is maintained. One of the nifty
|
||||
features of GFS is perfect consistency -- changes made to the file system
|
||||
on one machine show up immediately on all other machines in the cluster.
|
||||
|
||||
GFS uses interchangable inter-node locking mechanisms. Different lock
|
||||
modules can plug into GFS and each file system selects the appropriate
|
||||
lock module at mount time. Lock modules include:
|
||||
|
||||
lock_nolock -- allows gfs to be used as a local file system
|
||||
|
||||
lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking
|
||||
The dlm is found at linux/fs/dlm/
|
||||
|
||||
In addition to interfacing with an external locking manager, a gfs lock
|
||||
module is responsible for interacting with external cluster management
|
||||
systems. Lock_dlm depends on user space cluster management systems found
|
||||
at the URL above.
|
||||
|
||||
To use gfs as a local file system, no external clustering systems are
|
||||
needed, simply:
|
||||
|
||||
$ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device
|
||||
$ mount -t gfs2 /dev/block_device /dir
|
||||
|
||||
GFS2 is not on-disk compatible with previous versions of GFS.
|
||||
|
||||
The following man pages can be found at the URL above:
|
||||
gfs2_fsck to repair a filesystem
|
||||
gfs2_grow to expand a filesystem online
|
||||
gfs2_jadd to add journals to a filesystem online
|
||||
gfs2_tool to manipulate, examine and tune a filesystem
|
||||
gfs2_quota to examine and change quota values in a filesystem
|
||||
mount.gfs2 to help mount(8) mount a filesystem
|
||||
mkfs.gfs2 to make a filesystem
|
||||
83
Documentation/filesystems/hfs.txt
Normal file
83
Documentation/filesystems/hfs.txt
Normal file
@@ -0,0 +1,83 @@
|
||||
|
||||
Macintosh HFS Filesystem for Linux
|
||||
==================================
|
||||
|
||||
HFS stands for ``Hierarchical File System'' and is the filesystem used
|
||||
by the Mac Plus and all later Macintosh models. Earlier Macintosh
|
||||
models used MFS (``Macintosh File System''), which is not supported,
|
||||
MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
|
||||
HFS but is extended in various areas. Use the hfsplus filesystem driver
|
||||
to access such filesystems from Linux.
|
||||
|
||||
|
||||
Mount options
|
||||
=============
|
||||
|
||||
When mounting an HFS filesystem, the following options are accepted:
|
||||
|
||||
creator=cccc, type=cccc
|
||||
Specifies the creator/type values as shown by the MacOS finder
|
||||
used for creating new files. Default values: '????'.
|
||||
|
||||
uid=n, gid=n
|
||||
Specifies the user/group that owns all files on the filesystems.
|
||||
Default: user/group id of the mounting process.
|
||||
|
||||
dir_umask=n, file_umask=n, umask=n
|
||||
Specifies the umask used for all files , all directories or all
|
||||
files and directories. Defaults to the umask of the mounting process.
|
||||
|
||||
session=n
|
||||
Select the CDROM session to mount as HFS filesystem. Defaults to
|
||||
leaving that decision to the CDROM driver. This option will fail
|
||||
with anything but a CDROM as underlying devices.
|
||||
|
||||
part=n
|
||||
Select partition number n from the devices. Does only makes
|
||||
sense for CDROMS because they can't be partitioned under Linux.
|
||||
For disk devices the generic partition parsing code does this
|
||||
for us. Defaults to not parsing the partition table at all.
|
||||
|
||||
quiet
|
||||
Ignore invalid mount options instead of complaining.
|
||||
|
||||
|
||||
Writing to HFS Filesystems
|
||||
==========================
|
||||
|
||||
HFS is not a UNIX filesystem, thus it does not have the usual features you'd
|
||||
expect:
|
||||
|
||||
o You can't modify the set-uid, set-gid, sticky or executable bits or the uid
|
||||
and gid of files.
|
||||
o You can't create hard- or symlinks, device files, sockets or FIFOs.
|
||||
|
||||
HFS does on the other have the concepts of multiple forks per file. These
|
||||
non-standard forks are represented as hidden additional files in the normal
|
||||
filesystems namespace which is kind of a cludge and makes the semantics for
|
||||
the a little strange:
|
||||
|
||||
o You can't create, delete or rename resource forks of files or the
|
||||
Finder's metadata.
|
||||
o They are however created (with default values), deleted and renamed
|
||||
along with the corresponding data fork or directory.
|
||||
o Copying files to a different filesystem will loose those attributes
|
||||
that are essential for MacOS to work.
|
||||
|
||||
|
||||
Creating HFS filesystems
|
||||
===================================
|
||||
|
||||
The hfsutils package from Robert Leslie contains a program called
|
||||
hformat that can be used to create HFS filesystem. See
|
||||
<http://www.mars.org/home/rob/proj/hfs/> for details.
|
||||
|
||||
|
||||
Credits
|
||||
=======
|
||||
|
||||
The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU)
|
||||
and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis
|
||||
Technologies.
|
||||
Roman rewrote large parts of the code and brought in btree routines derived
|
||||
from Brad Boyer's hfsplus driver (also maintained by Roman now).
|
||||
296
Documentation/filesystems/hpfs.txt
Normal file
296
Documentation/filesystems/hpfs.txt
Normal file
@@ -0,0 +1,296 @@
|
||||
Read/Write HPFS 2.09
|
||||
1998-2004, Mikulas Patocka
|
||||
|
||||
email: mikulas@artax.karlin.mff.cuni.cz
|
||||
homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
|
||||
|
||||
CREDITS:
|
||||
Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
|
||||
is taken from it
|
||||
Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
|
||||
Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
|
||||
|
||||
Mount options
|
||||
|
||||
uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask)
|
||||
Set owner/group/mode for files that do not have it specified in extended
|
||||
attributes. Mode is inverted umask - for example umask 027 gives owner
|
||||
all permission, group read permission and anybody else no access. Note
|
||||
that for files mode is anded with 0666. If you want files to have 'x'
|
||||
rights, you must use extended attributes.
|
||||
case=lower,asis (default asis)
|
||||
File name lowercasing in readdir.
|
||||
conv=binary,text,auto (default binary)
|
||||
CR/LF -> LF conversion, if auto, decision is made according to extension
|
||||
- there is a list of text extensions (I thing it's better to not convert
|
||||
text file than to damage binary file). If you want to change that list,
|
||||
change it in the source. Original readonly HPFS contained some strange
|
||||
heuristic algorithm that I removed. I thing it's danger to let the
|
||||
computer decide whether file is text or binary. For example, DJGPP
|
||||
binaries contain small text message at the beginning and they could be
|
||||
misidentified and damaged under some circumstances.
|
||||
check=none,normal,strict (default normal)
|
||||
Check level. Selecting none will cause only little speedup and big
|
||||
danger. I tried to write it so that it won't crash if check=normal on
|
||||
corrupted filesystems. check=strict means many superfluous checks -
|
||||
used for debugging (for example it checks if file is allocated in
|
||||
bitmaps when accessing it).
|
||||
errors=continue,remount-ro,panic (default remount-ro)
|
||||
Behaviour when filesystem errors found.
|
||||
chkdsk=no,errors,always (default errors)
|
||||
When to mark filesystem dirty so that OS/2 checks it.
|
||||
eas=no,ro,rw (default rw)
|
||||
What to do with extended attributes. 'no' - ignore them and use always
|
||||
values specified in uid/gid/mode options. 'ro' - read extended
|
||||
attributes but do not create them. 'rw' - create extended attributes
|
||||
when you use chmod/chown/chgrp/mknod/ln -s on the filesystem.
|
||||
timeshift=(-)nnn (default 0)
|
||||
Shifts the time by nnn seconds. For example, if you see under linux
|
||||
one hour more, than under os/2, use timeshift=-3600.
|
||||
|
||||
|
||||
File names
|
||||
|
||||
As in OS/2, filenames are case insensitive. However, shell thinks that names
|
||||
are case sensitive, so for example when you create a file FOO, you can use
|
||||
'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you
|
||||
also won't be able to compile linux kernel (and maybe other things) on HPFS
|
||||
because kernel creates different files with names like bootsect.S and
|
||||
bootsect.s. When searching for file thats name has characters >= 128, codepages
|
||||
are used - see below.
|
||||
OS/2 ignores dots and spaces at the end of file name, so this driver does as
|
||||
well. If you create 'a. ...', the file 'a' will be created, but you can still
|
||||
access it under names 'a.', 'a..', 'a . . . ' etc.
|
||||
|
||||
|
||||
Extended attributes
|
||||
|
||||
On HPFS partitions, OS/2 can associate to each file a special information called
|
||||
extended attributes. Extended attributes are pairs of (key,value) where key is
|
||||
an ascii string identifying that attribute and value is any string of bytes of
|
||||
variable length. OS/2 stores window and icon positions and file types there. So
|
||||
why not use it for unix-specific info like file owner or access rights? This
|
||||
driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended
|
||||
attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only
|
||||
that extended attributes those value differs from defaults specified in mount
|
||||
options are created. Once created, the extended attributes are never deleted,
|
||||
they're just changed. It means that when your default uid=0 and you type
|
||||
something like 'chown luser file; chown root file' the file will contain
|
||||
extended attribute UID=0. And when you umount the fs and mount it again with
|
||||
uid=luser_uid, the file will be still owned by root! If you chmod file to 444,
|
||||
extended attribute "MODE" will not be set, this special case is done by setting
|
||||
read-only flag. When you mknod a block or char device, besides "MODE", the
|
||||
special 4-byte extended attribute "DEV" will be created containing the device
|
||||
number. Currently this driver cannot resize extended attributes - it means
|
||||
that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV"
|
||||
attributes with different sizes, they won't be rewritten and changing these
|
||||
values doesn't work.
|
||||
|
||||
|
||||
Symlinks
|
||||
|
||||
You can do symlinks on HPFS partition, symlinks are achieved by setting extended
|
||||
attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
|
||||
chgrp symlinks but I don't know what is it good for. chmoding symlink results
|
||||
in chmoding file where symlink points. These symlinks are just for Linux use and
|
||||
incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are
|
||||
stored in very crazy way. They tried to do it so that link changes when file is
|
||||
moved ... sometimes it works. But the link is partly stored in directory
|
||||
extended attributes and partly in OS2SYS.INI. I don't want (and don't know how)
|
||||
to analyze or change OS2SYS.INI.
|
||||
|
||||
|
||||
Codepages
|
||||
|
||||
HPFS can contain several uppercasing tables for several codepages and each
|
||||
file has a pointer to codepage it's name is in. However OS/2 was created in
|
||||
America where people don't care much about codepages and so multiple codepages
|
||||
support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk.
|
||||
Once I booted English OS/2 working in cp 850 and I created a file on my 852
|
||||
partition. It marked file name codepage as 850 - good. But when I again booted
|
||||
Czech OS/2, the file was completely inaccessible under any name. It seems that
|
||||
OS/2 uppercases the search pattern with its system code page (852) and file
|
||||
name it's comparing to with its code page (850). These could never match. Is it
|
||||
really what IBM developers wanted? But problems continued. When I created in
|
||||
Czech OS/2 another file in that directory, that file was inaccessible too. OS/2
|
||||
probably uses different uppercasing method when searching where to place a file
|
||||
(note, that files in HPFS directory must be sorted) and when searching for
|
||||
a file. Finally when I opened this directory in PmShell, PmShell crashed (the
|
||||
funny thing was that, when rebooted, PmShell tried to reopen this directory
|
||||
again :-). chkdsk happily ignores these errors and only low-level disk
|
||||
modification saved me. Never mix different language versions of OS/2 on one
|
||||
system although HPFS was designed to allow that.
|
||||
OK, I could implement complex codepage support to this driver but I think it
|
||||
would cause more problems than benefit with such buggy implementation in OS/2.
|
||||
So this driver simply uses first codepage it finds for uppercasing and
|
||||
lowercasing no matter what's file codepage index. Usually all file names are in
|
||||
this codepage - if you don't try to do what I described above :-)
|
||||
|
||||
|
||||
Known bugs
|
||||
|
||||
HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
|
||||
should work. If you have OS/2 server, use only read-only mode. I don't know how
|
||||
to handle some HPFS386 structures like access control list or extended perm
|
||||
list, I don't know how to delete them when file is deleted and how to not
|
||||
overwrite them with extended attributes. Send me some info on these structures
|
||||
and I'll make it. However, this driver should detect presence of HPFS386
|
||||
structures, remount read-only and not destroy them (I hope).
|
||||
|
||||
When there's not enough space for extended attributes, they will be truncated
|
||||
and no error is returned.
|
||||
|
||||
OS/2 can't access files if the path is longer than about 256 chars but this
|
||||
driver allows you to do it. chkdsk ignores such errors.
|
||||
|
||||
Sometimes you won't be able to delete some files on a very full filesystem
|
||||
(returning error ENOSPC). That's because file in non-leaf node in directory tree
|
||||
(one directory, if it's large, has dirents in tree on HPFS) must be replaced
|
||||
with another node when deleted. And that new file might have larger name than
|
||||
the old one so the new name doesn't fit in directory node (dnode). And that
|
||||
would result in directory tree splitting, that takes disk space. Workaround is
|
||||
to delete other files that are leaf (probability that the file is non-leaf is
|
||||
about 1/50) or to truncate file first to make some space.
|
||||
You encounter this problem only if you have many directories so that
|
||||
preallocated directory band is full i.e.
|
||||
number_of_directories / size_of_filesystem_in_mb > 4.
|
||||
|
||||
You can't delete open directories.
|
||||
|
||||
You can't rename over directories (what is it good for?).
|
||||
|
||||
Renaming files so that only case changes doesn't work. This driver supports it
|
||||
but vfs doesn't. Something like 'mv file FILE' won't work.
|
||||
|
||||
All atimes and directory mtimes are not updated. That's because of performance
|
||||
reasons. If you extremely wish to update them, let me know, I'll write it (but
|
||||
it will be slow).
|
||||
|
||||
When the system is out of memory and swap, it may slightly corrupt filesystem
|
||||
(lost files, unbalanced directories). (I guess all filesystem may do it).
|
||||
|
||||
When compiled, you get warning: function declaration isn't a prototype. Does
|
||||
anybody know what does it mean?
|
||||
|
||||
|
||||
What does "unbalanced tree" message mean?
|
||||
|
||||
Old versions of this driver created sometimes unbalanced dnode trees. OS/2
|
||||
chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
|
||||
unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely
|
||||
crashes when the tree is not balanced. This driver handles unbalanced trees
|
||||
correctly and writes warning if it finds them. If you see this message, this is
|
||||
probably because of directories created with old version of this driver.
|
||||
Workaround is to move all files from that directory to another and then back
|
||||
again. Do it in Linux, not OS/2! If you see this message in directory that is
|
||||
whole created by this driver, it is BUG - let me know about it.
|
||||
|
||||
|
||||
Bugs in OS/2
|
||||
|
||||
When you have two (or more) lost directories pointing each to other, chkdsk
|
||||
locks up when repairing filesystem.
|
||||
|
||||
Sometimes (I think it's random) when you create a file with one-char name under
|
||||
OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs
|
||||
error corrected".
|
||||
|
||||
File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and
|
||||
marks them as short (and writes "minor fs error corrected"). This bug is not in
|
||||
HPFS386.
|
||||
|
||||
Codepage bugs described above.
|
||||
|
||||
If you don't install fixpacks, there are many, many more...
|
||||
|
||||
|
||||
History
|
||||
|
||||
0.90 First public release
|
||||
0.91 Fixed bug that caused shooting to memory when write_inode was called on
|
||||
open inode (rarely happened)
|
||||
0.92 Fixed a little memory leak in freeing directory inodes
|
||||
0.93 Fixed bug that locked up the machine when there were too many filenames
|
||||
with first 15 characters same
|
||||
Fixed write_file to zero file when writing behind file end
|
||||
0.94 Fixed a little memory leak when trying to delete busy file or directory
|
||||
0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
|
||||
1.90 First version for 2.1.1xx kernels
|
||||
1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
|
||||
Fixed a race-condition when write_inode is called while deleting file
|
||||
Fixed a bug that could possibly happen (with very low probability) when
|
||||
using 0xff in filenames
|
||||
Rewritten locking to avoid race-conditions
|
||||
Mount option 'eas' now works
|
||||
Fsync no longer returns error
|
||||
Files beginning with '.' are marked hidden
|
||||
Remount support added
|
||||
Alloc is not so slow when filesystem becomes full
|
||||
Atimes are no more updated because it slows down operation
|
||||
Code cleanup (removed all commented debug prints)
|
||||
1.92 Corrected a bug when sync was called just before closing file
|
||||
1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
|
||||
works with previous versions
|
||||
Fixed a possible problem with disks > 64G (but I don't have one, so I can't
|
||||
test it)
|
||||
Fixed a file overflow at 2G
|
||||
Added new option 'timeshift'
|
||||
Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
|
||||
read-only mode
|
||||
Fixed a bug that slowed down alloc and prevented allocating 100% space
|
||||
(this bug was not destructive)
|
||||
1.94 Added workaround for one bug in Linux
|
||||
Fixed one buffer leak
|
||||
Fixed some incompatibilities with large extended attributes (but it's still
|
||||
not 100% ok, I have no info on it and OS/2 doesn't want to create them)
|
||||
Rewritten allocation
|
||||
Fixed a bug with i_blocks (du sometimes didn't display correct values)
|
||||
Directories have no longer archive attribute set (some programs don't like
|
||||
it)
|
||||
Fixed a bug that it set badly one flag in large anode tree (it was not
|
||||
destructive)
|
||||
1.95 Fixed one buffer leak, that could happen on corrupted filesystem
|
||||
Fixed one bug in allocation in 1.94
|
||||
1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
|
||||
error sometimes when opening directories in PMSHELL)
|
||||
Fixed a possible bitmap race
|
||||
Fixed possible problem on large disks
|
||||
You can now delete open files
|
||||
Fixed a nondestructive race in rename
|
||||
1.97 Support for HPFS v3 (on large partitions)
|
||||
Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
|
||||
1.97.1 Changed names of global symbols
|
||||
Fixed a bug when chmoding or chowning root directory
|
||||
1.98 Fixed a deadlock when using old_readdir
|
||||
Better directory handling; workaround for "unbalanced tree" bug in OS/2
|
||||
1.99 Corrected a possible problem when there's not enough space while deleting
|
||||
file
|
||||
Now it tries to truncate the file if there's not enough space when deleting
|
||||
Removed a lot of redundant code
|
||||
2.00 Fixed a bug in rename (it was there since 1.96)
|
||||
Better anti-fragmentation strategy
|
||||
2.01 Fixed problem with directory listing over NFS
|
||||
Directory lseek now checks for proper parameters
|
||||
Fixed race-condition in buffer code - it is in all filesystems in Linux;
|
||||
when reading device (cat /dev/hda) while creating files on it, files
|
||||
could be damaged
|
||||
2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond
|
||||
end of partition
|
||||
2.03 Char, block devices and pipes are correctly created
|
||||
Fixed non-crashing race in unlink (Alexander Viro)
|
||||
Now it works with Japanese version of OS/2
|
||||
2.04 Fixed error when ftruncate used to extend file
|
||||
2.05 Fixed crash when got mount parameters without =
|
||||
Fixed crash when allocation of anode failed due to full disk
|
||||
Fixed some crashes when block io or inode allocation failed
|
||||
2.06 Fixed some crash on corrupted disk structures
|
||||
Better allocation strategy
|
||||
Reschedule points added so that it doesn't lock CPU long time
|
||||
It should work in read-only mode on Warp Server
|
||||
2.07 More fixes for Warp Server. Now it really works
|
||||
2.08 Creating new files is not so slow on large disks
|
||||
An attempt to sync deleted file does not generate filesystem error
|
||||
2.09 Fixed error on extremly fragmented files
|
||||
|
||||
|
||||
vim: set textwidth=80:
|
||||
269
Documentation/filesystems/inotify.txt
Normal file
269
Documentation/filesystems/inotify.txt
Normal file
@@ -0,0 +1,269 @@
|
||||
inotify
|
||||
a powerful yet simple file change notification system
|
||||
|
||||
|
||||
|
||||
Document started 15 Mar 2005 by Robert Love <rml@novell.com>
|
||||
|
||||
|
||||
(i) User Interface
|
||||
|
||||
Inotify is controlled by a set of three system calls and normal file I/O on a
|
||||
returned file descriptor.
|
||||
|
||||
First step in using inotify is to initialise an inotify instance:
|
||||
|
||||
int fd = inotify_init ();
|
||||
|
||||
Each instance is associated with a unique, ordered queue.
|
||||
|
||||
Change events are managed by "watches". A watch is an (object,mask) pair where
|
||||
the object is a file or directory and the mask is a bit mask of one or more
|
||||
inotify events that the application wishes to receive. See <linux/inotify.h>
|
||||
for valid events. A watch is referenced by a watch descriptor, or wd.
|
||||
|
||||
Watches are added via a path to the file.
|
||||
|
||||
Watches on a directory will return events on any files inside of the directory.
|
||||
|
||||
Adding a watch is simple:
|
||||
|
||||
int wd = inotify_add_watch (fd, path, mask);
|
||||
|
||||
Where "fd" is the return value from inotify_init(), path is the path to the
|
||||
object to watch, and mask is the watch mask (see <linux/inotify.h>).
|
||||
|
||||
You can update an existing watch in the same manner, by passing in a new mask.
|
||||
|
||||
An existing watch is removed via
|
||||
|
||||
int ret = inotify_rm_watch (fd, wd);
|
||||
|
||||
Events are provided in the form of an inotify_event structure that is read(2)
|
||||
from a given inotify instance. The filename is of dynamic length and follows
|
||||
the struct. It is of size len. The filename is padded with null bytes to
|
||||
ensure proper alignment. This padding is reflected in len.
|
||||
|
||||
You can slurp multiple events by passing a large buffer, for example
|
||||
|
||||
size_t len = read (fd, buf, BUF_LEN);
|
||||
|
||||
Where "buf" is a pointer to an array of "inotify_event" structures at least
|
||||
BUF_LEN bytes in size. The above example will return as many events as are
|
||||
available and fit in BUF_LEN.
|
||||
|
||||
Each inotify instance fd is also select()- and poll()-able.
|
||||
|
||||
You can find the size of the current event queue via the standard FIONREAD
|
||||
ioctl on the fd returned by inotify_init().
|
||||
|
||||
All watches are destroyed and cleaned up on close.
|
||||
|
||||
|
||||
(ii)
|
||||
|
||||
Prototypes:
|
||||
|
||||
int inotify_init (void);
|
||||
int inotify_add_watch (int fd, const char *path, __u32 mask);
|
||||
int inotify_rm_watch (int fd, __u32 mask);
|
||||
|
||||
|
||||
(iii) Kernel Interface
|
||||
|
||||
Inotify's kernel API consists a set of functions for managing watches and an
|
||||
event callback.
|
||||
|
||||
To use the kernel API, you must first initialize an inotify instance with a set
|
||||
of inotify_operations. You are given an opaque inotify_handle, which you use
|
||||
for any further calls to inotify.
|
||||
|
||||
struct inotify_handle *ih = inotify_init(my_event_handler);
|
||||
|
||||
You must provide a function for processing events and a function for destroying
|
||||
the inotify watch.
|
||||
|
||||
void handle_event(struct inotify_watch *watch, u32 wd, u32 mask,
|
||||
u32 cookie, const char *name, struct inode *inode)
|
||||
|
||||
watch - the pointer to the inotify_watch that triggered this call
|
||||
wd - the watch descriptor
|
||||
mask - describes the event that occurred
|
||||
cookie - an identifier for synchronizing events
|
||||
name - the dentry name for affected files in a directory-based event
|
||||
inode - the affected inode in a directory-based event
|
||||
|
||||
void destroy_watch(struct inotify_watch *watch)
|
||||
|
||||
You may add watches by providing a pre-allocated and initialized inotify_watch
|
||||
structure and specifying the inode to watch along with an inotify event mask.
|
||||
You must pin the inode during the call. You will likely wish to embed the
|
||||
inotify_watch structure in a structure of your own which contains other
|
||||
information about the watch. Once you add an inotify watch, it is immediately
|
||||
subject to removal depending on filesystem events. You must grab a reference if
|
||||
you depend on the watch hanging around after the call.
|
||||
|
||||
inotify_init_watch(&my_watch->iwatch);
|
||||
inotify_get_watch(&my_watch->iwatch); // optional
|
||||
s32 wd = inotify_add_watch(ih, &my_watch->iwatch, inode, mask);
|
||||
inotify_put_watch(&my_watch->iwatch); // optional
|
||||
|
||||
You may use the watch descriptor (wd) or the address of the inotify_watch for
|
||||
other inotify operations. You must not directly read or manipulate data in the
|
||||
inotify_watch. Additionally, you must not call inotify_add_watch() more than
|
||||
once for a given inotify_watch structure, unless you have first called either
|
||||
inotify_rm_watch() or inotify_rm_wd().
|
||||
|
||||
To determine if you have already registered a watch for a given inode, you may
|
||||
call inotify_find_watch(), which gives you both the wd and the watch pointer for
|
||||
the inotify_watch, or an error if the watch does not exist.
|
||||
|
||||
wd = inotify_find_watch(ih, inode, &watchp);
|
||||
|
||||
You may use container_of() on the watch pointer to access your own data
|
||||
associated with a given watch. When an existing watch is found,
|
||||
inotify_find_watch() bumps the refcount before releasing its locks. You must
|
||||
put that reference with:
|
||||
|
||||
put_inotify_watch(watchp);
|
||||
|
||||
Call inotify_find_update_watch() to update the event mask for an existing watch.
|
||||
inotify_find_update_watch() returns the wd of the updated watch, or an error if
|
||||
the watch does not exist.
|
||||
|
||||
wd = inotify_find_update_watch(ih, inode, mask);
|
||||
|
||||
An existing watch may be removed by calling either inotify_rm_watch() or
|
||||
inotify_rm_wd().
|
||||
|
||||
int ret = inotify_rm_watch(ih, &my_watch->iwatch);
|
||||
int ret = inotify_rm_wd(ih, wd);
|
||||
|
||||
A watch may be removed while executing your event handler with the following:
|
||||
|
||||
inotify_remove_watch_locked(ih, iwatch);
|
||||
|
||||
Call inotify_destroy() to remove all watches from your inotify instance and
|
||||
release it. If there are no outstanding references, inotify_destroy() will call
|
||||
your destroy_watch op for each watch.
|
||||
|
||||
inotify_destroy(ih);
|
||||
|
||||
When inotify removes a watch, it sends an IN_IGNORED event to your callback.
|
||||
You may use this event as an indication to free the watch memory. Note that
|
||||
inotify may remove a watch due to filesystem events, as well as by your request.
|
||||
If you use IN_ONESHOT, inotify will remove the watch after the first event, at
|
||||
which point you may call the final inotify_put_watch.
|
||||
|
||||
(iv) Kernel Interface Prototypes
|
||||
|
||||
struct inotify_handle *inotify_init(struct inotify_operations *ops);
|
||||
|
||||
inotify_init_watch(struct inotify_watch *watch);
|
||||
|
||||
s32 inotify_add_watch(struct inotify_handle *ih,
|
||||
struct inotify_watch *watch,
|
||||
struct inode *inode, u32 mask);
|
||||
|
||||
s32 inotify_find_watch(struct inotify_handle *ih, struct inode *inode,
|
||||
struct inotify_watch **watchp);
|
||||
|
||||
s32 inotify_find_update_watch(struct inotify_handle *ih,
|
||||
struct inode *inode, u32 mask);
|
||||
|
||||
int inotify_rm_wd(struct inotify_handle *ih, u32 wd);
|
||||
|
||||
int inotify_rm_watch(struct inotify_handle *ih,
|
||||
struct inotify_watch *watch);
|
||||
|
||||
void inotify_remove_watch_locked(struct inotify_handle *ih,
|
||||
struct inotify_watch *watch);
|
||||
|
||||
void inotify_destroy(struct inotify_handle *ih);
|
||||
|
||||
void get_inotify_watch(struct inotify_watch *watch);
|
||||
void put_inotify_watch(struct inotify_watch *watch);
|
||||
|
||||
|
||||
(v) Internal Kernel Implementation
|
||||
|
||||
Each inotify instance is represented by an inotify_handle structure.
|
||||
Inotify's userspace consumers also have an inotify_device which is
|
||||
associated with the inotify_handle, and on which events are queued.
|
||||
|
||||
Each watch is associated with an inotify_watch structure. Watches are chained
|
||||
off of each associated inotify_handle and each associated inode.
|
||||
|
||||
See fs/inotify.c and fs/inotify_user.c for the locking and lifetime rules.
|
||||
|
||||
|
||||
(vi) Rationale
|
||||
|
||||
Q: What is the design decision behind not tying the watch to the open fd of
|
||||
the watched object?
|
||||
|
||||
A: Watches are associated with an open inotify device, not an open file.
|
||||
This solves the primary problem with dnotify: keeping the file open pins
|
||||
the file and thus, worse, pins the mount. Dnotify is therefore infeasible
|
||||
for use on a desktop system with removable media as the media cannot be
|
||||
unmounted. Watching a file should not require that it be open.
|
||||
|
||||
Q: What is the design decision behind using an-fd-per-instance as opposed to
|
||||
an fd-per-watch?
|
||||
|
||||
A: An fd-per-watch quickly consumes more file descriptors than are allowed,
|
||||
more fd's than are feasible to manage, and more fd's than are optimally
|
||||
select()-able. Yes, root can bump the per-process fd limit and yes, users
|
||||
can use epoll, but requiring both is a silly and extraneous requirement.
|
||||
A watch consumes less memory than an open file, separating the number
|
||||
spaces is thus sensible. The current design is what user-space developers
|
||||
want: Users initialize inotify, once, and add n watches, requiring but one
|
||||
fd and no twiddling with fd limits. Initializing an inotify instance two
|
||||
thousand times is silly. If we can implement user-space's preferences
|
||||
cleanly--and we can, the idr layer makes stuff like this trivial--then we
|
||||
should.
|
||||
|
||||
There are other good arguments. With a single fd, there is a single
|
||||
item to block on, which is mapped to a single queue of events. The single
|
||||
fd returns all watch events and also any potential out-of-band data. If
|
||||
every fd was a separate watch,
|
||||
|
||||
- There would be no way to get event ordering. Events on file foo and
|
||||
file bar would pop poll() on both fd's, but there would be no way to tell
|
||||
which happened first. A single queue trivially gives you ordering. Such
|
||||
ordering is crucial to existing applications such as Beagle. Imagine
|
||||
"mv a b ; mv b a" events without ordering.
|
||||
|
||||
- We'd have to maintain n fd's and n internal queues with state,
|
||||
versus just one. It is a lot messier in the kernel. A single, linear
|
||||
queue is the data structure that makes sense.
|
||||
|
||||
- User-space developers prefer the current API. The Beagle guys, for
|
||||
example, love it. Trust me, I asked. It is not a surprise: Who'd want
|
||||
to manage and block on 1000 fd's via select?
|
||||
|
||||
- No way to get out of band data.
|
||||
|
||||
- 1024 is still too low. ;-)
|
||||
|
||||
When you talk about designing a file change notification system that
|
||||
scales to 1000s of directories, juggling 1000s of fd's just does not seem
|
||||
the right interface. It is too heavy.
|
||||
|
||||
Additionally, it _is_ possible to more than one instance and
|
||||
juggle more than one queue and thus more than one associated fd. There
|
||||
need not be a one-fd-per-process mapping; it is one-fd-per-queue and a
|
||||
process can easily want more than one queue.
|
||||
|
||||
Q: Why the system call approach?
|
||||
|
||||
A: The poor user-space interface is the second biggest problem with dnotify.
|
||||
Signals are a terrible, terrible interface for file notification. Or for
|
||||
anything, for that matter. The ideal solution, from all perspectives, is a
|
||||
file descriptor-based one that allows basic file I/O and poll/select.
|
||||
Obtaining the fd and managing the watches could have been done either via a
|
||||
device file or a family of new system calls. We decided to implement a
|
||||
family of system calls because that is the preferred approach for new kernel
|
||||
interfaces. The only real difference was whether we wanted to use open(2)
|
||||
and ioctl(2) or a couple of new system calls. System calls beat ioctls.
|
||||
|
||||
42
Documentation/filesystems/isofs.txt
Normal file
42
Documentation/filesystems/isofs.txt
Normal file
@@ -0,0 +1,42 @@
|
||||
Mount options that are the same as for msdos and vfat partitions.
|
||||
|
||||
gid=nnn All files in the partition will be in group nnn.
|
||||
uid=nnn All files in the partition will be owned by user id nnn.
|
||||
umask=nnn The permission mask (see umask(1)) for the partition.
|
||||
|
||||
Mount options that are the same as vfat partitions. These are only useful
|
||||
when using discs encoded using Microsoft's Joliet extensions.
|
||||
iocharset=name Character set to use for converting from Unicode to
|
||||
ASCII. Joliet filenames are stored in Unicode format, but
|
||||
Unix for the most part doesn't know how to deal with Unicode.
|
||||
There is also an option of doing UTF-8 translations with the
|
||||
utf8 option.
|
||||
utf8 Encode Unicode names in UTF-8 format. Default is no.
|
||||
|
||||
Mount options unique to the isofs filesystem.
|
||||
block=512 Set the block size for the disk to 512 bytes
|
||||
block=1024 Set the block size for the disk to 1024 bytes
|
||||
block=2048 Set the block size for the disk to 2048 bytes
|
||||
check=relaxed Matches filenames with different cases
|
||||
check=strict Matches only filenames with the exact same case
|
||||
cruft Try to handle badly formatted CDs.
|
||||
map=off Do not map non-Rock Ridge filenames to lower case
|
||||
map=normal Map non-Rock Ridge filenames to lower case
|
||||
map=acorn As map=normal but also apply Acorn extensions if present
|
||||
mode=xxx Sets the permissions on files to xxx
|
||||
nojoliet Ignore Joliet extensions if they are present.
|
||||
norock Ignore Rock Ridge extensions if they are present.
|
||||
hide Completely strip hidden files from the file system.
|
||||
showassoc Show files marked with the 'associated' bit
|
||||
unhide Deprecated; showing hidden files is now default;
|
||||
If given, it is a synonym for 'showassoc' which will
|
||||
recreate previous unhide behavior
|
||||
session=x Select number of session on multisession CD
|
||||
sbsector=xxx Session begins from sector xxx
|
||||
|
||||
Recommended documents about ISO 9660 standard are located at:
|
||||
http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm
|
||||
ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
|
||||
Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
|
||||
identical with ISO 9660.", so it is a valid and gratis substitute of the
|
||||
official ISO specification.
|
||||
35
Documentation/filesystems/jfs.txt
Normal file
35
Documentation/filesystems/jfs.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
IBM's Journaled File System (JFS) for Linux
|
||||
|
||||
JFS Homepage: http://jfs.sourceforge.net/
|
||||
|
||||
The following mount options are supported:
|
||||
|
||||
iocharset=name Character set to use for converting from Unicode to
|
||||
ASCII. The default is to do no conversion. Use
|
||||
iocharset=utf8 for UTF-8 translations. This requires
|
||||
CONFIG_NLS_UTF8 to be set in the kernel .config file.
|
||||
iocharset=none specifies the default behavior explicitly.
|
||||
|
||||
resize=value Resize the volume to <value> blocks. JFS only supports
|
||||
growing a volume, not shrinking it. This option is only
|
||||
valid during a remount, when the volume is mounted
|
||||
read-write. The resize keyword with no value will grow
|
||||
the volume to the full size of the partition.
|
||||
|
||||
nointegrity Do not write to the journal. The primary use of this option
|
||||
is to allow for higher performance when restoring a volume
|
||||
from backup media. The integrity of the volume is not
|
||||
guaranteed if the system abnormally abends.
|
||||
|
||||
integrity Default. Commit metadata changes to the journal. Use this
|
||||
option to remount a volume where the nointegrity option was
|
||||
previously specified in order to restore normal behavior.
|
||||
|
||||
errors=continue Keep going on a filesystem error.
|
||||
errors=remount-ro Default. Remount the filesystem read-only on an error.
|
||||
errors=panic Panic and halt the machine if an error occurs.
|
||||
|
||||
Please send bugs, comments, cards and letters to shaggy@austin.ibm.com.
|
||||
|
||||
The JFS mailing list can be subscribed to by using the link labeled
|
||||
"Mail list Subscribe" at our web page http://jfs.sourceforge.net/
|
||||
12
Documentation/filesystems/ncpfs.txt
Normal file
12
Documentation/filesystems/ncpfs.txt
Normal file
@@ -0,0 +1,12 @@
|
||||
The ncpfs filesystem understands the NCP protocol, designed by the
|
||||
Novell Corporation for their NetWare(tm) product. NCP is functionally
|
||||
similar to the NFS used in the TCP/IP community.
|
||||
To mount a NetWare filesystem, you need a special mount program, which
|
||||
can be found in the ncpfs package. The home site for ncpfs is
|
||||
ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors
|
||||
will have it as well.
|
||||
|
||||
Related products are linware and mars_nwe, which will give Linux partial
|
||||
NetWare server functionality. Linware's home site is
|
||||
klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on
|
||||
ftp.gwdg.de/pub/linux/misc/ncpfs.
|
||||
714
Documentation/filesystems/ntfs.txt
Normal file
714
Documentation/filesystems/ntfs.txt
Normal file
@@ -0,0 +1,714 @@
|
||||
The Linux NTFS filesystem driver
|
||||
================================
|
||||
|
||||
|
||||
Table of contents
|
||||
=================
|
||||
|
||||
- Overview
|
||||
- Web site
|
||||
- Features
|
||||
- Supported mount options
|
||||
- Known bugs and (mis-)features
|
||||
- Using NTFS volume and stripe sets
|
||||
- The Device-Mapper driver
|
||||
- The Software RAID / MD driver
|
||||
- Limitations when using the MD driver
|
||||
- ChangeLog
|
||||
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
|
||||
These include mkntfs, a full-featured ntfs filesystem format utility,
|
||||
ntfsundelete used for recovering files that were unintentionally deleted
|
||||
from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
|
||||
See the web site for more information.
|
||||
|
||||
To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
|
||||
system type 'ntfs'. The driver currently supports read-only mode (with no
|
||||
fault-tolerance, encryption or journalling) and very limited, but safe, write
|
||||
support.
|
||||
|
||||
For fault tolerance and raid support (i.e. volume and stripe sets), you can
|
||||
use the kernel's Software RAID / MD driver. See section "Using Software RAID
|
||||
with NTFS" for details.
|
||||
|
||||
|
||||
Web site
|
||||
========
|
||||
|
||||
There is plenty of additional information on the linux-ntfs web site
|
||||
at http://linux-ntfs.sourceforge.net/
|
||||
|
||||
The web site has a lot of additional information, such as a comprehensive
|
||||
FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS
|
||||
userspace utilities, etc.
|
||||
|
||||
|
||||
Features
|
||||
========
|
||||
|
||||
- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and
|
||||
earlier kernels. This new driver implements NTFS read support and is
|
||||
functionally equivalent to the old ntfs driver and it also implements limited
|
||||
write support. The biggest limitation at present is that files/directories
|
||||
cannot be created or deleted. See below for the list of write features that
|
||||
are so far supported. Another limitation is that writing to compressed files
|
||||
is not implemented at all. Also, neither read nor write access to encrypted
|
||||
files is so far implemented.
|
||||
- The new driver has full support for sparse files on NTFS 3.x volumes which
|
||||
the old driver isn't happy with.
|
||||
- The new driver supports execution of binaries due to mmap() now being
|
||||
supported.
|
||||
- The new driver supports loopback mounting of files on NTFS which is used by
|
||||
some Linux distributions to enable the user to run Linux from an NTFS
|
||||
partition by creating a large file while in Windows and then loopback
|
||||
mounting the file while in Linux and creating a Linux filesystem on it that
|
||||
is used to install Linux on it.
|
||||
- A comparison of the two drivers using:
|
||||
time find . -type f -exec md5sum "{}" \;
|
||||
run three times in sequence with each driver (after a reboot) on a 1.4GiB
|
||||
NTFS partition, showed the new driver to be 20% faster in total time elapsed
|
||||
(from 9:43 minutes on average down to 7:53). The time spent in user space
|
||||
was unchanged but the time spent in the kernel was decreased by a factor of
|
||||
2.5 (from 85 CPU seconds down to 33).
|
||||
- The driver does not support short file names in general. For backwards
|
||||
compatibility, we implement access to files using their short file names if
|
||||
they exist. The driver will not create short file names however, and a
|
||||
rename will discard any existing short file name.
|
||||
- The new driver supports exporting of mounted NTFS volumes via NFS.
|
||||
- The new driver supports async io (aio).
|
||||
- The new driver supports fsync(2), fdatasync(2), and msync(2).
|
||||
- The new driver supports readv(2) and writev(2).
|
||||
- The new driver supports access time updates (including mtime and ctime).
|
||||
- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present
|
||||
only very limited support for highly fragmented files, i.e. ones which have
|
||||
their data attribute split across multiple extents, is included. Another
|
||||
limitation is that at present truncate(2) will never create sparse files,
|
||||
since to mark a file sparse we need to modify the directory entry for the
|
||||
file and we do not implement directory modifications yet.
|
||||
- The new driver supports write(2) which can both overwrite existing data and
|
||||
extend the file size so that you can write beyond the existing data. Also,
|
||||
writing into sparse regions is supported and the holes are filled in with
|
||||
clusters. But at present only limited support for highly fragmented files,
|
||||
i.e. ones which have their data attribute split across multiple extents, is
|
||||
included. Another limitation is that write(2) will never create sparse
|
||||
files, since to mark a file sparse we need to modify the directory entry for
|
||||
the file and we do not implement directory modifications yet.
|
||||
|
||||
Supported mount options
|
||||
=======================
|
||||
|
||||
In addition to the generic mount options described by the manual page for the
|
||||
mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
|
||||
following mount options:
|
||||
|
||||
iocharset=name Deprecated option. Still supported but please use
|
||||
nls=name in the future. See description for nls=name.
|
||||
|
||||
nls=name Character set to use when returning file names.
|
||||
Unlike VFAT, NTFS suppresses names that contain
|
||||
unconvertible characters. Note that most character
|
||||
sets contain insufficient characters to represent all
|
||||
possible Unicode characters that can exist on NTFS.
|
||||
To be sure you are not missing any files, you are
|
||||
advised to use nls=utf8 which is capable of
|
||||
representing all Unicode characters.
|
||||
|
||||
utf8=<bool> Option no longer supported. Currently mapped to
|
||||
nls=utf8 but please use nls=utf8 in the future and
|
||||
make sure utf8 is compiled either as module or into
|
||||
the kernel. See description for nls=name.
|
||||
|
||||
uid=
|
||||
gid=
|
||||
umask= Provide default owner, group, and access mode mask.
|
||||
These options work as documented in mount(8). By
|
||||
default, the files/directories are owned by root and
|
||||
he/she has read and write permissions, as well as
|
||||
browse permission for directories. No one else has any
|
||||
access permissions. I.e. the mode on all files is by
|
||||
default rw------- and for directories rwx------, a
|
||||
consequence of the default fmask=0177 and dmask=0077.
|
||||
Using a umask of zero will grant all permissions to
|
||||
everyone, i.e. all files and directories will have mode
|
||||
rwxrwxrwx.
|
||||
|
||||
fmask=
|
||||
dmask= Instead of specifying umask which applies both to
|
||||
files and directories, fmask applies only to files and
|
||||
dmask only to directories.
|
||||
|
||||
sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
|
||||
Otherwise the default behaviour is to abort mount if
|
||||
any unknown options are found.
|
||||
|
||||
show_sys_files=<BOOL> If show_sys_files is specified, show the system files
|
||||
in directory listings. Otherwise the default behaviour
|
||||
is to hide the system files.
|
||||
Note that even when show_sys_files is specified, "$MFT"
|
||||
will not be visible due to bugs/mis-features in glibc.
|
||||
Further, note that irrespective of show_sys_files, all
|
||||
files are accessible by name, i.e. you can always do
|
||||
"ls -l \$UpCase" for example to specifically show the
|
||||
system file containing the Unicode upcase table.
|
||||
|
||||
case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
|
||||
case sensitive and create file names in the POSIX
|
||||
namespace. Otherwise the default behaviour is to treat
|
||||
file names as case insensitive and to create file names
|
||||
in the WIN32/LONG name space. Note, the Linux NTFS
|
||||
driver will never create short file names and will
|
||||
remove them on rename/delete of the corresponding long
|
||||
file name.
|
||||
Note that files remain accessible via their short file
|
||||
name, if it exists. If case_sensitive, you will need
|
||||
to provide the correct case of the short file name.
|
||||
|
||||
disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse
|
||||
regions, i.e. holes, inside files is disabled for the
|
||||
volume (for the duration of this mount only). By
|
||||
default, creation of sparse regions is enabled, which
|
||||
is consistent with the behaviour of traditional Unix
|
||||
filesystems.
|
||||
|
||||
errors=opt What to do when critical filesystem errors are found.
|
||||
Following values can be used for "opt":
|
||||
continue: DEFAULT, try to clean-up as much as
|
||||
possible, e.g. marking a corrupt inode as
|
||||
bad so it is no longer accessed, and then
|
||||
continue.
|
||||
recover: At present only supported is recovery of
|
||||
the boot sector from the backup copy.
|
||||
If read-only mount, the recovery is done
|
||||
in memory only and not written to disk.
|
||||
Note that the options are additive, i.e. specifying:
|
||||
errors=continue,errors=recover
|
||||
means the driver will attempt to recover and if that
|
||||
fails it will clean-up as much as possible and
|
||||
continue.
|
||||
|
||||
mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
|
||||
setting is not persistent across mounts and can be
|
||||
changed from mount to mount but cannot be changed on
|
||||
remount). Values of 1 to 4 are allowed, 1 being the
|
||||
default. The MFT zone multiplier determines how much
|
||||
space is reserved for the MFT on the volume. If all
|
||||
other space is used up, then the MFT zone will be
|
||||
shrunk dynamically, so this has no impact on the
|
||||
amount of free space. However, it can have an impact
|
||||
on performance by affecting fragmentation of the MFT.
|
||||
In general use the default. If you have a lot of small
|
||||
files then use a higher value. The values have the
|
||||
following meaning:
|
||||
Value MFT zone size (% of volume size)
|
||||
1 12.5%
|
||||
2 25%
|
||||
3 37.5%
|
||||
4 50%
|
||||
Note this option is irrelevant for read-only mounts.
|
||||
|
||||
|
||||
Known bugs and (mis-)features
|
||||
=============================
|
||||
|
||||
- The link count on each directory inode entry is set to 1, due to Linux not
|
||||
supporting directory hard links. This may well confuse some user space
|
||||
applications, since the directory names will have the same inode numbers.
|
||||
This also speeds up ntfs_read_inode() immensely. And we haven't found any
|
||||
problems with this approach so far. If you find a problem with this, please
|
||||
let us know.
|
||||
|
||||
|
||||
Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
|
||||
list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
|
||||
|
||||
|
||||
Using NTFS volume and stripe sets
|
||||
=================================
|
||||
|
||||
For support of volume and stripe sets, you can either use the kernel's
|
||||
Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
|
||||
the recommended one to use for linear raid. But the latter is required for
|
||||
raid level 5. For striping and mirroring, either driver should work fine.
|
||||
|
||||
|
||||
The Device-Mapper driver
|
||||
------------------------
|
||||
|
||||
You will need to create a table of the components of the volume/stripe set and
|
||||
how they fit together and load this into the kernel using the dmsetup utility
|
||||
(see man 8 dmsetup).
|
||||
|
||||
Linear volume sets, i.e. linear raid, has been tested and works fine. Even
|
||||
though untested, there is no reason why stripe sets, i.e. raid level 0, and
|
||||
mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
|
||||
raid level 5, unfortunately cannot work yet because the current version of the
|
||||
Device-Mapper driver does not support raid level 5. You may be able to use the
|
||||
Software RAID / MD driver for raid level 5, see the next section for details.
|
||||
|
||||
To create the table describing your volume you will need to know each of its
|
||||
components and their sizes in sectors, i.e. multiples of 512-byte blocks.
|
||||
|
||||
For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
|
||||
example if one of your partitions is /dev/hda2 you would do:
|
||||
|
||||
$ fdisk -ul /dev/hda
|
||||
|
||||
Disk /dev/hda: 81.9 GB, 81964302336 bytes
|
||||
255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
|
||||
Units = sectors of 1 * 512 = 512 bytes
|
||||
|
||||
Device Boot Start End Blocks Id System
|
||||
/dev/hda1 * 63 4209029 2104483+ 83 Linux
|
||||
/dev/hda2 4209030 37768814 16779892+ 86 NTFS
|
||||
/dev/hda3 37768815 46170809 4200997+ 83 Linux
|
||||
|
||||
And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
|
||||
33559785 sectors.
|
||||
|
||||
For Win2k and later dynamic disks, you can for example use the ldminfo utility
|
||||
which is part of the Linux LDM tools (the latest version at the time of
|
||||
writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
|
||||
http://linux-ntfs.sourceforge.net/downloads.html
|
||||
Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
|
||||
into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
|
||||
will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
|
||||
able to compile this yourself easily so use the binary version!
|
||||
|
||||
Then you would use ldminfo in dump mode to obtain the necessary information:
|
||||
|
||||
$ ./ldminfo --dump /dev/hda
|
||||
|
||||
This would dump the LDM database found on /dev/hda which describes all of your
|
||||
dynamic disks and all the volumes on them. At the bottom you will see the
|
||||
VOLUME DEFINITIONS section which is all you really need. You may need to look
|
||||
further above to determine which of the disks in the volume definitions is
|
||||
which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
|
||||
look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
|
||||
section). You can then find these Disk Ids in the VBLK DATABASE section in the
|
||||
<Disk> components where you will get the LDM Name for the disk that is found in
|
||||
the VOLUME DEFINITIONS section.
|
||||
|
||||
Note you will also need to enable the LDM driver in the Linux kernel. If your
|
||||
distribution did not enable it, you will need to recompile the kernel with it
|
||||
enabled. This will create the LDM partitions on each device at boot time. You
|
||||
would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
|
||||
in the Device-Mapper table.
|
||||
|
||||
You can also bypass using the LDM driver by using the main device (e.g.
|
||||
/dev/hda) and then using the offsets of the LDM partitions into this device as
|
||||
the "Start sector of device" when creating the table. Once again ldminfo would
|
||||
give you the correct information to do this.
|
||||
|
||||
Assuming you know all your devices and their sizes things are easy.
|
||||
|
||||
For a linear raid the table would look like this (note all values are in
|
||||
512-byte sectors):
|
||||
|
||||
--- cut here ---
|
||||
# Offset into Size of this Raid type Device Start sector
|
||||
# volume device of device
|
||||
0 1028161 linear /dev/hda1 0
|
||||
1028161 3903762 linear /dev/hdb2 0
|
||||
4931923 2103211 linear /dev/hdc1 0
|
||||
--- cut here ---
|
||||
|
||||
For a striped volume, i.e. raid level 0, you will need to know the chunk size
|
||||
you used when creating the volume. Windows uses 64kiB as the default, so it
|
||||
will probably be this unless you changes the defaults when creating the array.
|
||||
|
||||
For a raid level 0 the table would look like this (note all values are in
|
||||
512-byte sectors):
|
||||
|
||||
--- cut here ---
|
||||
# Offset Size Raid Number Chunk 1st Start 2nd Start
|
||||
# into of the type of size Device in Device in
|
||||
# volume volume stripes device device
|
||||
0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
|
||||
--- cut here ---
|
||||
|
||||
If there are more than two devices, just add each of them to the end of the
|
||||
line.
|
||||
|
||||
Finally, for a mirrored volume, i.e. raid level 1, the table would look like
|
||||
this (note all values are in 512-byte sectors):
|
||||
|
||||
--- cut here ---
|
||||
# Ofs Size Raid Log Number Region Should Number Source Start Target Start
|
||||
# in of the type type of log size sync? of Device in Device in
|
||||
# vol volume params mirrors Device Device
|
||||
0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
|
||||
--- cut here ---
|
||||
|
||||
If you are mirroring to multiple devices you can specify further targets at the
|
||||
end of the line.
|
||||
|
||||
Note the "Should sync?" parameter "nosync" means that the two mirrors are
|
||||
already in sync which will be the case on a clean shutdown of Windows. If the
|
||||
mirrors are not clean, you can specify the "sync" option instead of "nosync"
|
||||
and the Device-Mapper driver will then copy the entirey of the "Source Device"
|
||||
to the "Target Device" or if you specified multipled target devices to all of
|
||||
them.
|
||||
|
||||
Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
|
||||
and hand it over to dmsetup to work with, like so:
|
||||
|
||||
$ dmsetup create myvolume1 /etc/ntfsvolume1
|
||||
|
||||
You can obviously replace "myvolume1" with whatever name you like.
|
||||
|
||||
If it all worked, you will now have the device /dev/device-mapper/myvolume1
|
||||
which you can then just use as an argument to the mount command as usual to
|
||||
mount the ntfs volume. For example:
|
||||
|
||||
$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
|
||||
|
||||
(You need to create the directory /mnt/myvol1 first and of course you can use
|
||||
anything you like instead of /mnt/myvol1 as long as it is an existing
|
||||
directory.)
|
||||
|
||||
It is advisable to do the mount read-only to see if the volume has been setup
|
||||
correctly to avoid the possibility of causing damage to the data on the ntfs
|
||||
volume.
|
||||
|
||||
|
||||
The Software RAID / MD driver
|
||||
-----------------------------
|
||||
|
||||
An alternative to using the Device-Mapper driver is to use the kernel's
|
||||
Software RAID / MD driver. For which you need to set up your /etc/raidtab
|
||||
appropriately (see man 5 raidtab).
|
||||
|
||||
Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
|
||||
0, have been tested and work fine (though see section "Limitations when using
|
||||
the MD driver with NTFS volumes" especially if you want to use linear raid).
|
||||
Even though untested, there is no reason why mirrors, i.e. raid level 1, and
|
||||
stripes with parity, i.e. raid level 5, should not work, too.
|
||||
|
||||
You have to use the "persistent-superblock 0" option for each raid-disk in the
|
||||
NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
|
||||
superblock used by the MD driver would damage the NTFS volume.
|
||||
|
||||
Windows by default uses a stripe chunk size of 64k, so you probably want the
|
||||
"chunk-size 64k" option for each raid-disk, too.
|
||||
|
||||
For example, if you have a stripe set consisting of two partitions /dev/hda5
|
||||
and /dev/hdb1 your /etc/raidtab would look like this:
|
||||
|
||||
raiddev /dev/md0
|
||||
raid-level 0
|
||||
nr-raid-disks 2
|
||||
nr-spare-disks 0
|
||||
persistent-superblock 0
|
||||
chunk-size 64k
|
||||
device /dev/hda5
|
||||
raid-disk 0
|
||||
device /dev/hdb1
|
||||
raid-disl 1
|
||||
|
||||
For linear raid, just change the raid-level above to "raid-level linear", for
|
||||
mirrors, change it to "raid-level 1", and for stripe sets with parity, change
|
||||
it to "raid-level 5".
|
||||
|
||||
Note for stripe sets with parity you will also need to tell the MD driver
|
||||
which parity algorithm to use by specifying the option "parity-algorithm
|
||||
which", where you need to replace "which" with the name of the algorithm to
|
||||
use (see man 5 raidtab for available algorithms) and you will have to try the
|
||||
different available algorithms until you find one that works. Make sure you
|
||||
are working read-only when playing with this as you may damage your data
|
||||
otherwise. If you find which algorithm works please let us know (email the
|
||||
linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
|
||||
IRC in channel #ntfs on the irc.freenode.net network) so we can update this
|
||||
documentation.
|
||||
|
||||
Once the raidtab is setup, run for example raid0run -a to start all devices or
|
||||
raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
|
||||
|
||||
Then just use the mount command as usual to mount the ntfs volume using for
|
||||
example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
|
||||
|
||||
It is advisable to do the mount read-only to see if the md volume has been
|
||||
setup correctly to avoid the possibility of causing damage to the data on the
|
||||
ntfs volume.
|
||||
|
||||
|
||||
Limitations when using the Software RAID / MD driver
|
||||
-----------------------------------------------------
|
||||
|
||||
Using the md driver will not work properly if any of your NTFS partitions have
|
||||
an odd number of sectors. This is especially important for linear raid as all
|
||||
data after the first partition with an odd number of sectors will be offset by
|
||||
one or more sectors so if you mount such a partition with write support you
|
||||
will cause massive damage to the data on the volume which will only become
|
||||
apparent when you try to use the volume again under Windows.
|
||||
|
||||
So when using linear raid, make sure that all your partitions have an even
|
||||
number of sectors BEFORE attempting to use it. You have been warned!
|
||||
|
||||
Even better is to simply use the Device-Mapper for linear raid and then you do
|
||||
not have this problem with odd numbers of sectors.
|
||||
|
||||
|
||||
ChangeLog
|
||||
=========
|
||||
|
||||
Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
|
||||
|
||||
2.1.28:
|
||||
- Fix a deadlock.
|
||||
2.1.27:
|
||||
- Implement page migration support so the kernel can move memory used
|
||||
by NTFS files and directories around for management purposes.
|
||||
- Add support for writing to sparse files created with Windows XP SP2.
|
||||
- Many minor improvements and bug fixes.
|
||||
2.1.26:
|
||||
- Implement support for sector sizes above 512 bytes (up to the maximum
|
||||
supported by NTFS which is 4096 bytes).
|
||||
- Enhance support for NTFS volumes which were supported by Windows but
|
||||
not by Linux due to invalid attribute list attribute flags.
|
||||
- A few minor updates and bug fixes.
|
||||
2.1.25:
|
||||
- Write support is now extended with write(2) being able to both
|
||||
overwrite existing file data and to extend files. Also, if a write
|
||||
to a sparse region occurs, write(2) will fill in the hole. Note,
|
||||
mmap(2) based writes still do not support writing into holes or
|
||||
writing beyond the initialized size.
|
||||
- Write support has a new feature and that is that truncate(2) and
|
||||
open(2) with O_TRUNC are now implemented thus files can be both made
|
||||
smaller and larger.
|
||||
- Note: Both write(2) and truncate(2)/open(2) with O_TRUNC still have
|
||||
limitations in that they
|
||||
- only provide limited support for highly fragmented files.
|
||||
- only work on regular, i.e. uncompressed and unencrypted files.
|
||||
- never create sparse files although this will change once directory
|
||||
operations are implemented.
|
||||
- Lots of bug fixes and enhancements across the board.
|
||||
2.1.24:
|
||||
- Support journals ($LogFile) which have been modified by chkdsk. This
|
||||
means users can boot into Windows after we marked the volume dirty.
|
||||
The Windows boot will run chkdsk and then reboot. The user can then
|
||||
immediately boot into Linux rather than having to do a full Windows
|
||||
boot first before rebooting into Linux and we will recognize such a
|
||||
journal and empty it as it is clean by definition.
|
||||
- Support journals ($LogFile) with only one restart page as well as
|
||||
journals with two different restart pages. We sanity check both and
|
||||
either use the only sane one or the more recent one of the two in the
|
||||
case that both are valid.
|
||||
- Lots of bug fixes and enhancements across the board.
|
||||
2.1.23:
|
||||
- Stamp the user space journal, aka transaction log, aka $UsnJrnl, if
|
||||
it is present and active thus telling Windows and applications using
|
||||
the transaction log that changes can have happened on the volume
|
||||
which are not recorded in $UsnJrnl.
|
||||
- Detect the case when Windows has been hibernated (suspended to disk)
|
||||
and if this is the case do not allow (re)mounting read-write to
|
||||
prevent data corruption when you boot back into the suspended
|
||||
Windows session.
|
||||
- Implement extension of resident files using the normal file write
|
||||
code paths, i.e. most very small files can be extended to be a little
|
||||
bit bigger but not by much.
|
||||
- Add new mount option "disable_sparse". (See list of mount options
|
||||
above for details.)
|
||||
- Improve handling of ntfs volumes with errors and strange boot sectors
|
||||
in particular.
|
||||
- Fix various bugs including a nasty deadlock that appeared in recent
|
||||
kernels (around 2.6.11-2.6.12 timeframe).
|
||||
2.1.22:
|
||||
- Improve handling of ntfs volumes with errors.
|
||||
- Fix various bugs and race conditions.
|
||||
2.1.21:
|
||||
- Fix several race conditions and various other bugs.
|
||||
- Many internal cleanups, code reorganization, optimizations, and mft
|
||||
and index record writing code rewritten to fit in with the changes.
|
||||
- Update Documentation/filesystems/ntfs.txt with instructions on how to
|
||||
use the Device-Mapper driver with NTFS ftdisk/LDM raid.
|
||||
2.1.20:
|
||||
- Fix two stupid bugs introduced in 2.1.18 release.
|
||||
2.1.19:
|
||||
- Minor bugfix in handling of the default upcase table.
|
||||
- Many internal cleanups and improvements. Many thanks to Linus
|
||||
Torvalds and Al Viro for the help and advice with the sparse
|
||||
annotations and cleanups.
|
||||
2.1.18:
|
||||
- Fix scheduling latencies at mount time. (Ingo Molnar)
|
||||
- Fix endianness bug in a little traversed portion of the attribute
|
||||
lookup code.
|
||||
2.1.17:
|
||||
- Fix bugs in mount time error code paths.
|
||||
2.1.16:
|
||||
- Implement access time updates (including mtime and ctime).
|
||||
- Implement fsync(2), fdatasync(2), and msync(2) system calls.
|
||||
- Enable the readv(2) and writev(2) system calls.
|
||||
- Enable access via the asynchronous io (aio) API by adding support for
|
||||
the aio_read(3) and aio_write(3) functions.
|
||||
2.1.15:
|
||||
- Invalidate quotas when (re)mounting read-write.
|
||||
NOTE: This now only leave user space journalling on the side. (See
|
||||
note for version 2.1.13, below.)
|
||||
2.1.14:
|
||||
- Fix an NFSd caused deadlock reported by several users.
|
||||
2.1.13:
|
||||
- Implement writing of inodes (access time updates are not implemented
|
||||
yet so mounting with -o noatime,nodiratime is enforced).
|
||||
- Enable writing out of resident files so you can now overwrite any
|
||||
uncompressed, unencrypted, nonsparse file as long as you do not
|
||||
change the file size.
|
||||
- Add housekeeping of ntfs system files so that ntfsfix no longer needs
|
||||
to be run after writing to an NTFS volume.
|
||||
NOTE: This still leaves quota tracking and user space journalling on
|
||||
the side but they should not cause data corruption. In the worst
|
||||
case the charged quotas will be out of date ($Quota) and some
|
||||
userspace applications might get confused due to the out of date
|
||||
userspace journal ($UsnJrnl).
|
||||
2.1.12:
|
||||
- Fix the second fix to the decompression engine from the 2.1.9 release
|
||||
and some further internals cleanups.
|
||||
2.1.11:
|
||||
- Driver internal cleanups.
|
||||
2.1.10:
|
||||
- Force read-only (re)mounting of volumes with unsupported volume
|
||||
flags and various cleanups.
|
||||
2.1.9:
|
||||
- Fix two bugs in handling of corner cases in the decompression engine.
|
||||
2.1.8:
|
||||
- Read the $MFT mirror and compare it to the $MFT and if the two do not
|
||||
match, force a read-only mount and do not allow read-write remounts.
|
||||
- Read and parse the $LogFile journal and if it indicates that the
|
||||
volume was not shutdown cleanly, force a read-only mount and do not
|
||||
allow read-write remounts. If the $LogFile indicates a clean
|
||||
shutdown and a read-write (re)mount is requested, empty $LogFile to
|
||||
ensure that Windows cannot cause data corruption by replaying a stale
|
||||
journal after Linux has written to the volume.
|
||||
- Improve time handling so that the NTFS time is fully preserved when
|
||||
converted to kernel time and only up to 99 nano-seconds are lost when
|
||||
kernel time is converted to NTFS time.
|
||||
2.1.7:
|
||||
- Enable NFS exporting of mounted NTFS volumes.
|
||||
2.1.6:
|
||||
- Fix minor bug in handling of compressed directories that fixes the
|
||||
erroneous "du" and "stat" output people reported.
|
||||
2.1.5:
|
||||
- Minor bug fix in attribute list attribute handling that fixes the
|
||||
I/O errors on "ls" of certain fragmented files found by at least two
|
||||
people running Windows XP.
|
||||
2.1.4:
|
||||
- Minor update allowing compilation with all gcc versions (well, the
|
||||
ones the kernel can be compiled with anyway).
|
||||
2.1.3:
|
||||
- Major bug fixes for reading files and volumes in corner cases which
|
||||
were being hit by Windows 2k/XP users.
|
||||
2.1.2:
|
||||
- Major bug fixes alleviating the hangs in statfs experienced by some
|
||||
users.
|
||||
2.1.1:
|
||||
- Update handling of compressed files so people no longer get the
|
||||
frequently reported warning messages about initialized_size !=
|
||||
data_size.
|
||||
2.1.0:
|
||||
- Add configuration option for developmental write support.
|
||||
- Initial implementation of file overwriting. (Writes to resident files
|
||||
are not written out to disk yet, so avoid writing to files smaller
|
||||
than about 1kiB.)
|
||||
- Intercept/abort changes in file size as they are not implemented yet.
|
||||
2.0.25:
|
||||
- Minor bugfixes in error code paths and small cleanups.
|
||||
2.0.24:
|
||||
- Small internal cleanups.
|
||||
- Support for sendfile system call. (Christoph Hellwig)
|
||||
2.0.23:
|
||||
- Massive internal locking changes to mft record locking. Fixes
|
||||
various race conditions and deadlocks.
|
||||
- Fix ntfs over loopback for compressed files by adding an
|
||||
optimization barrier. (gcc was screwing up otherwise ?)
|
||||
Thanks go to Christoph Hellwig for pointing these two out:
|
||||
- Remove now unused function fs/ntfs/malloc.h::vmalloc_nofs().
|
||||
- Fix ntfs_free() for ia64 and parisc.
|
||||
2.0.22:
|
||||
- Small internal cleanups.
|
||||
2.0.21:
|
||||
These only affect 32-bit architectures:
|
||||
- Check for, and refuse to mount too large volumes (maximum is 2TiB).
|
||||
- Check for, and refuse to open too large files and directories
|
||||
(maximum is 16TiB).
|
||||
2.0.20:
|
||||
- Support non-resident directory index bitmaps. This means we now cope
|
||||
with huge directories without problems.
|
||||
- Fix a page leak that manifested itself in some cases when reading
|
||||
directory contents.
|
||||
- Internal cleanups.
|
||||
2.0.19:
|
||||
- Fix race condition and improvements in block i/o interface.
|
||||
- Optimization when reading compressed files.
|
||||
2.0.18:
|
||||
- Fix race condition in reading of compressed files.
|
||||
2.0.17:
|
||||
- Cleanups and optimizations.
|
||||
2.0.16:
|
||||
- Fix stupid bug introduced in 2.0.15 in new attribute inode API.
|
||||
- Big internal cleanup replacing the mftbmp access hacks by using the
|
||||
new attribute inode API instead.
|
||||
2.0.15:
|
||||
- Bug fix in parsing of remount options.
|
||||
- Internal changes implementing attribute (fake) inodes allowing all
|
||||
attribute i/o to go via the page cache and to use all the normal
|
||||
vfs/mm functionality.
|
||||
2.0.14:
|
||||
- Internal changes improving run list merging code and minor locking
|
||||
change to not rely on BKL in ntfs_statfs().
|
||||
2.0.13:
|
||||
- Internal changes towards using iget5_locked() in preparation for
|
||||
fake inodes and small cleanups to ntfs_volume structure.
|
||||
2.0.12:
|
||||
- Internal cleanups in address space operations made possible by the
|
||||
changes introduced in the previous release.
|
||||
2.0.11:
|
||||
- Internal updates and cleanups introducing the first step towards
|
||||
fake inode based attribute i/o.
|
||||
2.0.10:
|
||||
- Microsoft says that the maximum number of inodes is 2^32 - 1. Update
|
||||
the driver accordingly to only use 32-bits to store inode numbers on
|
||||
32-bit architectures. This improves the speed of the driver a little.
|
||||
2.0.9:
|
||||
- Change decompression engine to use a single buffer. This should not
|
||||
affect performance except perhaps on the most heavy i/o on SMP
|
||||
systems when accessing multiple compressed files from multiple
|
||||
devices simultaneously.
|
||||
- Minor updates and cleanups.
|
||||
2.0.8:
|
||||
- Remove now obsolete show_inodes and posix mount option(s).
|
||||
- Restore show_sys_files mount option.
|
||||
- Add new mount option case_sensitive, to determine if the driver
|
||||
treats file names as case sensitive or not.
|
||||
- Mostly drop support for short file names (for backwards compatibility
|
||||
we only support accessing files via their short file name if one
|
||||
exists).
|
||||
- Fix dcache aliasing issues wrt short/long file names.
|
||||
- Cleanups and minor fixes.
|
||||
2.0.7:
|
||||
- Just cleanups.
|
||||
2.0.6:
|
||||
- Major bugfix to make compatible with other kernel changes. This fixes
|
||||
the hangs/oopses on umount.
|
||||
- Locking cleanup in directory operations (remove BKL usage).
|
||||
2.0.5:
|
||||
- Major buffer overflow bug fix.
|
||||
- Minor cleanups and updates for kernel 2.5.12.
|
||||
2.0.4:
|
||||
- Cleanups and updates for kernel 2.5.11.
|
||||
2.0.3:
|
||||
- Small bug fixes, cleanups, and performance improvements.
|
||||
2.0.2:
|
||||
- Use default fmask of 0177 so that files are no executable by default.
|
||||
If you want owner executable files, just use fmask=0077.
|
||||
- Update for kernel 2.5.9 but preserve backwards compatibility with
|
||||
kernel 2.5.7.
|
||||
- Minor bug fixes, cleanups, and updates.
|
||||
2.0.1:
|
||||
- Minor updates, primarily set the executable bit by default on files
|
||||
so they can be executed.
|
||||
2.0.0:
|
||||
- Started ChangeLog.
|
||||
|
||||
59
Documentation/filesystems/ocfs2.txt
Normal file
59
Documentation/filesystems/ocfs2.txt
Normal file
@@ -0,0 +1,59 @@
|
||||
OCFS2 filesystem
|
||||
==================
|
||||
OCFS2 is a general purpose extent based shared disk cluster file
|
||||
system with many similarities to ext3. It supports 64 bit inode
|
||||
numbers, and has automatically extending metadata groups which may
|
||||
also make it attractive for non-clustered use.
|
||||
|
||||
You'll want to install the ocfs2-tools package in order to at least
|
||||
get "mount.ocfs2" and "ocfs2_hb_ctl".
|
||||
|
||||
Project web page: http://oss.oracle.com/projects/ocfs2
|
||||
Tools web page: http://oss.oracle.com/projects/ocfs2-tools
|
||||
OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
|
||||
|
||||
All code copyright 2005 Oracle except when otherwise noted.
|
||||
|
||||
CREDITS:
|
||||
Lots of code taken from ext3 and other projects.
|
||||
|
||||
Authors in alphabetical order:
|
||||
Joel Becker <joel.becker@oracle.com>
|
||||
Zach Brown <zach.brown@oracle.com>
|
||||
Mark Fasheh <mark.fasheh@oracle.com>
|
||||
Kurt Hackel <kurt.hackel@oracle.com>
|
||||
Sunil Mushran <sunil.mushran@oracle.com>
|
||||
Manish Singh <manish.singh@oracle.com>
|
||||
|
||||
Caveats
|
||||
=======
|
||||
Features which OCFS2 does not support yet:
|
||||
- sparse files
|
||||
- extended attributes
|
||||
- shared writable mmap
|
||||
- loopback is supported, but data written will not
|
||||
be cluster coherent.
|
||||
- quotas
|
||||
- cluster aware flock
|
||||
- cluster aware lockf
|
||||
- Directory change notification (F_NOTIFY)
|
||||
- Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease)
|
||||
- POSIX ACLs
|
||||
- readpages / writepages (not user visible)
|
||||
|
||||
Mount options
|
||||
=============
|
||||
|
||||
OCFS2 supports the following mount options:
|
||||
(*) == default
|
||||
|
||||
barrier=1 This enables/disables barriers. barrier=0 disables it,
|
||||
barrier=1 enables it.
|
||||
errors=remount-ro(*) Remount the filesystem read-only on an error.
|
||||
errors=panic Panic and halt the machine if an error occurs.
|
||||
intr (*) Allow signals to interrupt cluster operations.
|
||||
nointr Do not allow signals to interrupt cluster
|
||||
operations.
|
||||
atime_quantum=60(*) OCFS2 will not update atime unless this number
|
||||
of seconds has passed since the last update.
|
||||
Set to zero to always update atime.
|
||||
267
Documentation/filesystems/porting
Normal file
267
Documentation/filesystems/porting
Normal file
@@ -0,0 +1,267 @@
|
||||
Changes since 2.5.0:
|
||||
|
||||
---
|
||||
[recommended]
|
||||
|
||||
New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),
|
||||
sb_set_blocksize() and sb_min_blocksize().
|
||||
|
||||
Use them.
|
||||
|
||||
(sb_find_get_block() replaces 2.4's get_hash_table())
|
||||
|
||||
---
|
||||
[recommended]
|
||||
|
||||
New methods: ->alloc_inode() and ->destroy_inode().
|
||||
|
||||
Remove inode->u.foo_inode_i
|
||||
Declare
|
||||
struct foo_inode_info {
|
||||
/* fs-private stuff */
|
||||
struct inode vfs_inode;
|
||||
};
|
||||
static inline struct foo_inode_info *FOO_I(struct inode *inode)
|
||||
{
|
||||
return list_entry(inode, struct foo_inode_info, vfs_inode);
|
||||
}
|
||||
|
||||
Use FOO_I(inode) instead of &inode->u.foo_inode_i;
|
||||
|
||||
Add foo_alloc_inode() and foo_destory_inode() - the former should allocate
|
||||
foo_inode_info and return the address of ->vfs_inode, the latter should free
|
||||
FOO_I(inode) (see in-tree filesystems for examples).
|
||||
|
||||
Make them ->alloc_inode and ->destroy_inode in your super_operations.
|
||||
|
||||
Keep in mind that now you need explicit initialization of private data -
|
||||
typically in ->read_inode() and after getting an inode from new_inode().
|
||||
|
||||
At some point that will become mandatory.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
Change of file_system_type method (->read_super to ->get_sb)
|
||||
|
||||
->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.
|
||||
|
||||
Turn your foo_read_super() into a function that would return 0 in case of
|
||||
success and negative number in case of error (-EINVAL unless you have more
|
||||
informative error value to report). Call it foo_fill_super(). Now declare
|
||||
|
||||
int foo_get_sb(struct file_system_type *fs_type,
|
||||
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
|
||||
{
|
||||
return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,
|
||||
mnt);
|
||||
}
|
||||
|
||||
(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
|
||||
filesystem).
|
||||
|
||||
Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as
|
||||
foo_get_sb.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.
|
||||
Most likely there is no need to change anything, but if you relied on
|
||||
global exclusion between renames for some internal purpose - you need to
|
||||
change your internal locking. Otherwise exclusion warranties remain the
|
||||
same (i.e. parents and victim are locked, etc.).
|
||||
|
||||
---
|
||||
[informational]
|
||||
|
||||
Now we have the exclusion between ->lookup() and directory removal (by
|
||||
->rmdir() and ->rename()). If you used to need that exclusion and do
|
||||
it by internal locking (most of filesystems couldn't care less) - you
|
||||
can relax your locking.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
|
||||
->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()
|
||||
and ->readdir() are called without BKL now. Grab it on entry, drop upon return
|
||||
- that will guarantee the same locking you used to have. If your method or its
|
||||
parts do not need BKL - better yet, now you can shift lock_kernel() and
|
||||
unlock_kernel() so that they would protect exactly what needs to be
|
||||
protected.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
BKL is also moved from around sb operations. ->write_super() Is now called
|
||||
without BKL held. BKL should have been shifted into individual fs sb_op
|
||||
functions. If you don't need it, remove it.
|
||||
|
||||
---
|
||||
[informational]
|
||||
|
||||
check for ->link() target not being a directory is done by callers. Feel
|
||||
free to drop it...
|
||||
|
||||
---
|
||||
[informational]
|
||||
|
||||
->link() callers hold ->i_sem on the object we are linking to. Some of your
|
||||
problems might be over...
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
new file_system_type method - kill_sb(superblock). If you are converting
|
||||
an existing filesystem, set it according to ->fs_flags:
|
||||
FS_REQUIRES_DEV - kill_block_super
|
||||
FS_LITTER - kill_litter_super
|
||||
neither - kill_anon_super
|
||||
FS_LITTER is gone - just remove it from fs_flags.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
FS_SINGLE is gone (actually, that had happened back when ->get_sb()
|
||||
went in - and hadn't been documented ;-/). Just remove it from fs_flags
|
||||
(and see ->get_sb() entry for other actions).
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->setattr() is called without BKL now. Caller _always_ holds ->i_sem, so
|
||||
watch for ->i_sem-grabbing code that might be used by your ->setattr().
|
||||
Callers of notify_change() need ->i_sem now.
|
||||
|
||||
---
|
||||
[recommended]
|
||||
|
||||
New super_block field "struct export_operations *s_export_op" for
|
||||
explicit support for exporting, e.g. via NFS. The structure is fully
|
||||
documented at its declaration in include/linux/fs.h, and in
|
||||
Documentation/filesystems/Exporting.
|
||||
|
||||
Briefly it allows for the definition of decode_fh and encode_fh operations
|
||||
to encode and decode filehandles, and allows the filesystem to use
|
||||
a standard helper function for decode_fh, and provide file-system specific
|
||||
support for this helper, particularly get_parent.
|
||||
|
||||
It is planned that this will be required for exporting once the code
|
||||
settles down a bit.
|
||||
|
||||
[mandatory]
|
||||
|
||||
s_export_op is now required for exporting a filesystem.
|
||||
isofs, ext2, ext3, resierfs, fat
|
||||
can be used as examples of very different filesystems.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
iget4() and the read_inode2 callback have been superseded by iget5_locked()
|
||||
which has the following prototype,
|
||||
|
||||
struct inode *iget5_locked(struct super_block *sb, unsigned long ino,
|
||||
int (*test)(struct inode *, void *),
|
||||
int (*set)(struct inode *, void *),
|
||||
void *data);
|
||||
|
||||
'test' is an additional function that can be used when the inode
|
||||
number is not sufficient to identify the actual file object. 'set'
|
||||
should be a non-blocking function that initializes those parts of a
|
||||
newly created inode to allow the test function to succeed. 'data' is
|
||||
passed as an opaque value to both test and set functions.
|
||||
|
||||
When the inode has been created by iget5_locked(), it will be returned with
|
||||
the I_NEW flag set and will still be locked. read_inode has not been
|
||||
called so the file system still has to finalize the initialization. Once
|
||||
the inode is initialized it must be unlocked by calling unlock_new_inode().
|
||||
|
||||
The filesystem is responsible for setting (and possibly testing) i_ino
|
||||
when appropriate. There is also a simpler iget_locked function that
|
||||
just takes the superblock and inode number as arguments and does the
|
||||
test and set for you.
|
||||
|
||||
e.g.
|
||||
inode = iget_locked(sb, ino);
|
||||
if (inode->i_state & I_NEW) {
|
||||
read_inode_from_disk(inode);
|
||||
unlock_new_inode(inode);
|
||||
}
|
||||
|
||||
---
|
||||
[recommended]
|
||||
|
||||
->getattr() finally getting used. See instances in nfs, minix, etc.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->revalidate() is gone. If your filesystem had it - provide ->getattr()
|
||||
and let it call whatever you had as ->revlidate() + (for symlinks that
|
||||
had ->revalidate()) add calls in ->follow_link()/->readlink().
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->d_parent changes are not protected by BKL anymore. Read access is safe
|
||||
if at least one of the following is true:
|
||||
* filesystem has no cross-directory rename()
|
||||
* dcache_lock is held
|
||||
* we know that parent had been locked (e.g. we are looking at
|
||||
->d_parent of ->lookup() argument).
|
||||
* we are called from ->rename().
|
||||
* the child's ->d_lock is held
|
||||
Audit your code and add locking if needed. Notice that any place that is
|
||||
not protected by the conditions above is risky even in the old tree - you
|
||||
had been relying on BKL and that's prone to screwups. Old tree had quite
|
||||
a few holes of that kind - unprotected access to ->d_parent leading to
|
||||
anything from oops to silent memory corruption.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags
|
||||
(see rootfs for one kind of solution and bdev/socket/pipe for another).
|
||||
|
||||
---
|
||||
[recommended]
|
||||
|
||||
Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter
|
||||
is still alive, but only because of the mess in drivers/s390/block/dasd.c.
|
||||
As soon as it gets fixed is_read_only() will die.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->permission() is called without BKL now. Grab it on entry, drop upon
|
||||
return - that will guarantee the same locking you used to have. If
|
||||
your method or its parts do not need BKL - better yet, now you can
|
||||
shift lock_kernel() and unlock_kernel() so that they would protect
|
||||
exactly what needs to be protected.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
->statfs() is now called without BKL held. BKL should have been
|
||||
shifted into individual fs sb_op functions where it's not clear that
|
||||
it's safe to remove it. If you don't need it, remove it.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
is_read_only() is gone; use bdev_read_only() instead.
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
destroy_buffers() is gone; use invalidate_bdev().
|
||||
|
||||
---
|
||||
[mandatory]
|
||||
|
||||
fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is
|
||||
deliberate; as soon as struct block_device * is propagated in a reasonable
|
||||
way by that code fixing will become trivial; until then nothing can be
|
||||
done.
|
||||
2097
Documentation/filesystems/proc.txt
Normal file
2097
Documentation/filesystems/proc.txt
Normal file
File diff suppressed because it is too large
Load Diff
355
Documentation/filesystems/ramfs-rootfs-initramfs.txt
Normal file
355
Documentation/filesystems/ramfs-rootfs-initramfs.txt
Normal file
@@ -0,0 +1,355 @@
|
||||
ramfs, rootfs and initramfs
|
||||
October 17, 2005
|
||||
Rob Landley <rob@landley.net>
|
||||
=============================
|
||||
|
||||
What is ramfs?
|
||||
--------------
|
||||
|
||||
Ramfs is a very simple filesystem that exports Linux's disk caching
|
||||
mechanisms (the page cache and dentry cache) as a dynamically resizable
|
||||
ram-based filesystem.
|
||||
|
||||
Normally all files are cached in memory by Linux. Pages of data read from
|
||||
backing store (usually the block device the filesystem is mounted on) are kept
|
||||
around in case it's needed again, but marked as clean (freeable) in case the
|
||||
Virtual Memory system needs the memory for something else. Similarly, data
|
||||
written to files is marked clean as soon as it has been written to backing
|
||||
store, but kept around for caching purposes until the VM reallocates the
|
||||
memory. A similar mechanism (the dentry cache) greatly speeds up access to
|
||||
directories.
|
||||
|
||||
With ramfs, there is no backing store. Files written into ramfs allocate
|
||||
dentries and page cache as usual, but there's nowhere to write them to.
|
||||
This means the pages are never marked clean, so they can't be freed by the
|
||||
VM when it's looking to recycle memory.
|
||||
|
||||
The amount of code required to implement ramfs is tiny, because all the
|
||||
work is done by the existing Linux caching infrastructure. Basically,
|
||||
you're mounting the disk cache as a filesystem. Because of this, ramfs is not
|
||||
an optional component removable via menuconfig, since there would be negligible
|
||||
space savings.
|
||||
|
||||
ramfs and ramdisk:
|
||||
------------------
|
||||
|
||||
The older "ram disk" mechanism created a synthetic block device out of
|
||||
an area of ram and used it as backing store for a filesystem. This block
|
||||
device was of fixed size, so the filesystem mounted on it was of fixed
|
||||
size. Using a ram disk also required unnecessarily copying memory from the
|
||||
fake block device into the page cache (and copying changes back out), as well
|
||||
as creating and destroying dentries. Plus it needed a filesystem driver
|
||||
(such as ext2) to format and interpret this data.
|
||||
|
||||
Compared to ramfs, this wastes memory (and memory bus bandwidth), creates
|
||||
unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks
|
||||
to avoid this copying by playing with the page tables, but they're unpleasantly
|
||||
complicated and turn out to be about as expensive as the copying anyway.)
|
||||
More to the point, all the work ramfs is doing has to happen _anyway_,
|
||||
since all file access goes through the page and dentry caches. The ram
|
||||
disk is simply unnecessary, ramfs is internally much simpler.
|
||||
|
||||
Another reason ramdisks are semi-obsolete is that the introduction of
|
||||
loopback devices offered a more flexible and convenient way to create
|
||||
synthetic block devices, now from files instead of from chunks of memory.
|
||||
See losetup (8) for details.
|
||||
|
||||
ramfs and tmpfs:
|
||||
----------------
|
||||
|
||||
One downside of ramfs is you can keep writing data into it until you fill
|
||||
up all memory, and the VM can't free it because the VM thinks that files
|
||||
should get written to backing store (rather than swap space), but ramfs hasn't
|
||||
got any backing store. Because of this, only root (or a trusted user) should
|
||||
be allowed write access to a ramfs mount.
|
||||
|
||||
A ramfs derivative called tmpfs was created to add size limits, and the ability
|
||||
to write the data to swap space. Normal users can be allowed write access to
|
||||
tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information.
|
||||
|
||||
What is rootfs?
|
||||
---------------
|
||||
|
||||
Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is
|
||||
always present in 2.6 systems. You can't unmount rootfs for approximately the
|
||||
same reason you can't kill the init process; rather than having special code
|
||||
to check for and handle an empty list, it's smaller and simpler for the kernel
|
||||
to just make sure certain lists can't become empty.
|
||||
|
||||
Most systems just mount another filesystem over rootfs and ignore it. The
|
||||
amount of space an empty instance of ramfs takes up is tiny.
|
||||
|
||||
What is initramfs?
|
||||
------------------
|
||||
|
||||
All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is
|
||||
extracted into rootfs when the kernel boots up. After extracting, the kernel
|
||||
checks to see if rootfs contains a file "init", and if so it executes it as PID
|
||||
1. If found, this init process is responsible for bringing the system the
|
||||
rest of the way up, including locating and mounting the real root device (if
|
||||
any). If rootfs does not contain an init program after the embedded cpio
|
||||
archive is extracted into it, the kernel will fall through to the older code
|
||||
to locate and mount a root partition, then exec some variant of /sbin/init
|
||||
out of that.
|
||||
|
||||
All this differs from the old initrd in several ways:
|
||||
|
||||
- The old initrd was always a separate file, while the initramfs archive is
|
||||
linked into the linux kernel image. (The directory linux-*/usr is devoted
|
||||
to generating this archive during the build.)
|
||||
|
||||
- The old initrd file was a gzipped filesystem image (in some file format,
|
||||
such as ext2, that needed a driver built into the kernel), while the new
|
||||
initramfs archive is a gzipped cpio archive (like tar only simpler,
|
||||
see cpio(1) and Documentation/early-userspace/buffer-format.txt). The
|
||||
kernel's cpio extraction code is not only extremely small, it's also
|
||||
__init data that can be discarded during the boot process.
|
||||
|
||||
- The program run by the old initrd (which was called /initrd, not /init) did
|
||||
some setup and then returned to the kernel, while the init program from
|
||||
initramfs is not expected to return to the kernel. (If /init needs to hand
|
||||
off control it can overmount / with a new root device and exec another init
|
||||
program. See the switch_root utility, below.)
|
||||
|
||||
- When switching another root device, initrd would pivot_root and then
|
||||
umount the ramdisk. But initramfs is rootfs: you can neither pivot_root
|
||||
rootfs, nor unmount it. Instead delete everything out of rootfs to
|
||||
free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs
|
||||
with the new root (cd /newmount; mount --move . /; chroot .), attach
|
||||
stdin/stdout/stderr to the new /dev/console, and exec the new init.
|
||||
|
||||
Since this is a remarkably persnickity process (and involves deleting
|
||||
commands before you can run them), the klibc package introduced a helper
|
||||
program (utils/run_init.c) to do all this for you. Most other packages
|
||||
(such as busybox) have named this command "switch_root".
|
||||
|
||||
Populating initramfs:
|
||||
---------------------
|
||||
|
||||
The 2.6 kernel build process always creates a gzipped cpio format initramfs
|
||||
archive and links it into the resulting kernel binary. By default, this
|
||||
archive is empty (consuming 134 bytes on x86).
|
||||
|
||||
The config option CONFIG_INITRAMFS_SOURCE (for some reason buried under
|
||||
devices->block devices in menuconfig, and living in usr/Kconfig) can be used
|
||||
to specify a source for the initramfs archive, which will automatically be
|
||||
incorporated into the resulting binary. This option can point to an existing
|
||||
gzipped cpio archive, a directory containing files to be archived, or a text
|
||||
file specification such as the following example:
|
||||
|
||||
dir /dev 755 0 0
|
||||
nod /dev/console 644 0 0 c 5 1
|
||||
nod /dev/loop0 644 0 0 b 7 0
|
||||
dir /bin 755 1000 1000
|
||||
slink /bin/sh busybox 777 0 0
|
||||
file /bin/busybox initramfs/busybox 755 0 0
|
||||
dir /proc 755 0 0
|
||||
dir /sys 755 0 0
|
||||
dir /mnt 755 0 0
|
||||
file /init initramfs/init.sh 755 0 0
|
||||
|
||||
Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
|
||||
documenting the above file format.
|
||||
|
||||
One advantage of the configuration file is that root access is not required to
|
||||
set permissions or create device nodes in the new archive. (Note that those
|
||||
two example "file" entries expect to find files named "init.sh" and "busybox" in
|
||||
a directory called "initramfs", under the linux-2.6.* directory. See
|
||||
Documentation/early-userspace/README for more details.)
|
||||
|
||||
The kernel does not depend on external cpio tools. If you specify a
|
||||
directory instead of a configuration file, the kernel's build infrastructure
|
||||
creates a configuration file from that directory (usr/Makefile calls
|
||||
scripts/gen_initramfs_list.sh), and proceeds to package up that directory
|
||||
using the config file (by feeding it to usr/gen_init_cpio, which is created
|
||||
from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is
|
||||
entirely self-contained, and the kernel's boot-time extractor is also
|
||||
(obviously) self-contained.
|
||||
|
||||
The one thing you might need external cpio utilities installed for is creating
|
||||
or extracting your own preprepared cpio files to feed to the kernel build
|
||||
(instead of a config file or directory).
|
||||
|
||||
The following command line can extract a cpio image (either by the above script
|
||||
or by the kernel build) back into its component files:
|
||||
|
||||
cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
|
||||
|
||||
The following shell script can create a prebuilt cpio archive you can
|
||||
use in place of the above config file:
|
||||
|
||||
#!/bin/sh
|
||||
|
||||
# Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation.
|
||||
# Licensed under GPL version 2
|
||||
|
||||
if [ $# -ne 2 ]
|
||||
then
|
||||
echo "usage: mkinitramfs directory imagename.cpio.gz"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ -d "$1" ]
|
||||
then
|
||||
echo "creating $2 from $1"
|
||||
(cd "$1"; find . | cpio -o -H newc | gzip) > "$2"
|
||||
else
|
||||
echo "First argument must be a directory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
Note: The cpio man page contains some bad advice that will break your initramfs
|
||||
archive if you follow it. It says "A typical way to generate the list
|
||||
of filenames is with the find command; you should give find the -depth option
|
||||
to minimize problems with permissions on directories that are unwritable or not
|
||||
searchable." Don't do this when creating initramfs.cpio.gz images, it won't
|
||||
work. The Linux kernel cpio extractor won't create files in a directory that
|
||||
doesn't exist, so the directory entries must go before the files that go in
|
||||
those directories. The above script gets them in the right order.
|
||||
|
||||
External initramfs images:
|
||||
--------------------------
|
||||
|
||||
If the kernel has initrd support enabled, an external cpio.gz archive can also
|
||||
be passed into a 2.6 kernel in place of an initrd. In this case, the kernel
|
||||
will autodetect the type (initramfs, not initrd) and extract the external cpio
|
||||
archive into rootfs before trying to run /init.
|
||||
|
||||
This has the memory efficiency advantages of initramfs (no ramdisk block
|
||||
device) but the separate packaging of initrd (which is nice if you have
|
||||
non-GPL code you'd like to run from initramfs, without conflating it with
|
||||
the GPL licensed Linux kernel binary).
|
||||
|
||||
It can also be used to supplement the kernel's built-in initamfs image. The
|
||||
files in the external archive will overwrite any conflicting files in
|
||||
the built-in initramfs archive. Some distributors also prefer to customize
|
||||
a single kernel image with task-specific initramfs images, without recompiling.
|
||||
|
||||
Contents of initramfs:
|
||||
----------------------
|
||||
|
||||
An initramfs archive is a complete self-contained root filesystem for Linux.
|
||||
If you don't already understand what shared libraries, devices, and paths
|
||||
you need to get a minimal root filesystem up and running, here are some
|
||||
references:
|
||||
http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
|
||||
http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
|
||||
http://www.linuxfromscratch.org/lfs/view/stable/
|
||||
|
||||
The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
|
||||
designed to be a tiny C library to statically link early userspace
|
||||
code against, along with some related utilities. It is BSD licensed.
|
||||
|
||||
I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
|
||||
myself. These are LGPL and GPL, respectively. (A self-contained initramfs
|
||||
package is planned for the busybox 1.3 release.)
|
||||
|
||||
In theory you could use glibc, but that's not well suited for small embedded
|
||||
uses like this. (A "hello world" program statically linked against glibc is
|
||||
over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do
|
||||
name lookups, even when otherwise statically linked.)
|
||||
|
||||
A good first step is to get initramfs to run a statically linked "hello world"
|
||||
program as init, and test it under an emulator like qemu (www.qemu.org) or
|
||||
User Mode Linux, like so:
|
||||
|
||||
cat > hello.c << EOF
|
||||
#include <stdio.h>
|
||||
#include <unistd.h>
|
||||
|
||||
int main(int argc, char *argv[])
|
||||
{
|
||||
printf("Hello world!\n");
|
||||
sleep(999999999);
|
||||
}
|
||||
EOF
|
||||
gcc -static hello2.c -o init
|
||||
echo init | cpio -o -H newc | gzip > test.cpio.gz
|
||||
# Testing external initramfs using the initrd loading mechanism.
|
||||
qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero
|
||||
|
||||
When debugging a normal root filesystem, it's nice to be able to boot with
|
||||
"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's
|
||||
just as useful.
|
||||
|
||||
Why cpio rather than tar?
|
||||
-------------------------
|
||||
|
||||
This decision was made back in December, 2001. The discussion started here:
|
||||
|
||||
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html
|
||||
|
||||
And spawned a second thread (specifically on tar vs cpio), starting here:
|
||||
|
||||
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html
|
||||
|
||||
The quick and dirty summary version (which is no substitute for reading
|
||||
the above threads) is:
|
||||
|
||||
1) cpio is a standard. It's decades old (from the AT&T days), and already
|
||||
widely used on Linux (inside RPM, Red Hat's device driver disks). Here's
|
||||
a Linux Journal article about it from 1996:
|
||||
|
||||
http://www.linuxjournal.com/article/1213
|
||||
|
||||
It's not as popular as tar because the traditional cpio command line tools
|
||||
require _truly_hideous_ command line arguments. But that says nothing
|
||||
either way about the archive format, and there are alternative tools,
|
||||
such as:
|
||||
|
||||
http://freshmeat.net/projects/afio/
|
||||
|
||||
2) The cpio archive format chosen by the kernel is simpler and cleaner (and
|
||||
thus easier to create and parse) than any of the (literally dozens of)
|
||||
various tar archive formats. The complete initramfs archive format is
|
||||
explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
|
||||
extracted in init/initramfs.c. All three together come to less than 26k
|
||||
total of human-readable text.
|
||||
|
||||
3) The GNU project standardizing on tar is approximately as relevant as
|
||||
Windows standardizing on zip. Linux is not part of either, and is free
|
||||
to make its own technical decisions.
|
||||
|
||||
4) Since this is a kernel internal format, it could easily have been
|
||||
something brand new. The kernel provides its own tools to create and
|
||||
extract this format anyway. Using an existing standard was preferable,
|
||||
but not essential.
|
||||
|
||||
5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be
|
||||
supported on the kernel side"):
|
||||
|
||||
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html
|
||||
|
||||
explained his reasoning:
|
||||
|
||||
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
|
||||
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
|
||||
|
||||
and, most importantly, designed and implemented the initramfs code.
|
||||
|
||||
Future directions:
|
||||
------------------
|
||||
|
||||
Today (2.6.16), initramfs is always compiled in, but not always used. The
|
||||
kernel falls back to legacy boot code that is reached only if initramfs does
|
||||
not contain an /init program. The fallback is legacy code, there to ensure a
|
||||
smooth transition and allowing early boot functionality to gradually move to
|
||||
"early userspace" (I.E. initramfs).
|
||||
|
||||
The move to early userspace is necessary because finding and mounting the real
|
||||
root device is complex. Root partitions can span multiple devices (raid or
|
||||
separate journal). They can be out on the network (requiring dhcp, setting a
|
||||
specific mac address, logging into a server, etc). They can live on removable
|
||||
media, with dynamically allocated major/minor numbers and persistent naming
|
||||
issues requiring a full udev implementation to sort out. They can be
|
||||
compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned,
|
||||
and so on.
|
||||
|
||||
This kind of complexity (which inevitably includes policy) is rightly handled
|
||||
in userspace. Both klibc and busybox/uClibc are working on simple initramfs
|
||||
packages to drop into a kernel build.
|
||||
|
||||
The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree.
|
||||
The kernel's current early boot code (partition detection, etc) will probably
|
||||
be migrated into a default initramfs, automatically created and used by the
|
||||
kernel build.
|
||||
484
Documentation/filesystems/relay.txt
Normal file
484
Documentation/filesystems/relay.txt
Normal file
@@ -0,0 +1,484 @@
|
||||
relay interface (formerly relayfs)
|
||||
==================================
|
||||
|
||||
The relay interface provides a means for kernel applications to
|
||||
efficiently log and transfer large quantities of data from the kernel
|
||||
to userspace via user-defined 'relay channels'.
|
||||
|
||||
A 'relay channel' is a kernel->user data relay mechanism implemented
|
||||
as a set of per-cpu kernel buffers ('channel buffers'), each
|
||||
represented as a regular file ('relay file') in user space. Kernel
|
||||
clients write into the channel buffers using efficient write
|
||||
functions; these automatically log into the current cpu's channel
|
||||
buffer. User space applications mmap() or read() from the relay files
|
||||
and retrieve the data as it becomes available. The relay files
|
||||
themselves are files created in a host filesystem, e.g. debugfs, and
|
||||
are associated with the channel buffers using the API described below.
|
||||
|
||||
The format of the data logged into the channel buffers is completely
|
||||
up to the kernel client; the relay interface does however provide
|
||||
hooks which allow kernel clients to impose some structure on the
|
||||
buffer data. The relay interface doesn't implement any form of data
|
||||
filtering - this also is left to the kernel client. The purpose is to
|
||||
keep things as simple as possible.
|
||||
|
||||
This document provides an overview of the relay interface API. The
|
||||
details of the function parameters are documented along with the
|
||||
functions in the relay interface code - please see that for details.
|
||||
|
||||
Semantics
|
||||
=========
|
||||
|
||||
Each relay channel has one buffer per CPU, each buffer has one or more
|
||||
sub-buffers. Messages are written to the first sub-buffer until it is
|
||||
too full to contain a new message, in which case it it is written to
|
||||
the next (if available). Messages are never split across sub-buffers.
|
||||
At this point, userspace can be notified so it empties the first
|
||||
sub-buffer, while the kernel continues writing to the next.
|
||||
|
||||
When notified that a sub-buffer is full, the kernel knows how many
|
||||
bytes of it are padding i.e. unused space occurring because a complete
|
||||
message couldn't fit into a sub-buffer. Userspace can use this
|
||||
knowledge to copy only valid data.
|
||||
|
||||
After copying it, userspace can notify the kernel that a sub-buffer
|
||||
has been consumed.
|
||||
|
||||
A relay channel can operate in a mode where it will overwrite data not
|
||||
yet collected by userspace, and not wait for it to be consumed.
|
||||
|
||||
The relay channel itself does not provide for communication of such
|
||||
data between userspace and kernel, allowing the kernel side to remain
|
||||
simple and not impose a single interface on userspace. It does
|
||||
provide a set of examples and a separate helper though, described
|
||||
below.
|
||||
|
||||
The read() interface both removes padding and internally consumes the
|
||||
read sub-buffers; thus in cases where read(2) is being used to drain
|
||||
the channel buffers, special-purpose communication between kernel and
|
||||
user isn't necessary for basic operation.
|
||||
|
||||
One of the major goals of the relay interface is to provide a low
|
||||
overhead mechanism for conveying kernel data to userspace. While the
|
||||
read() interface is easy to use, it's not as efficient as the mmap()
|
||||
approach; the example code attempts to make the tradeoff between the
|
||||
two approaches as small as possible.
|
||||
|
||||
klog and relay-apps example code
|
||||
================================
|
||||
|
||||
The relay interface itself is ready to use, but to make things easier,
|
||||
a couple simple utility functions and a set of examples are provided.
|
||||
|
||||
The relay-apps example tarball, available on the relay sourceforge
|
||||
site, contains a set of self-contained examples, each consisting of a
|
||||
pair of .c files containing boilerplate code for each of the user and
|
||||
kernel sides of a relay application. When combined these two sets of
|
||||
boilerplate code provide glue to easily stream data to disk, without
|
||||
having to bother with mundane housekeeping chores.
|
||||
|
||||
The 'klog debugging functions' patch (klog.patch in the relay-apps
|
||||
tarball) provides a couple of high-level logging functions to the
|
||||
kernel which allow writing formatted text or raw data to a channel,
|
||||
regardless of whether a channel to write into exists or not, or even
|
||||
whether the relay interface is compiled into the kernel or not. These
|
||||
functions allow you to put unconditional 'trace' statements anywhere
|
||||
in the kernel or kernel modules; only when there is a 'klog handler'
|
||||
registered will data actually be logged (see the klog and kleak
|
||||
examples for details).
|
||||
|
||||
It is of course possible to use the relay interface from scratch,
|
||||
i.e. without using any of the relay-apps example code or klog, but
|
||||
you'll have to implement communication between userspace and kernel,
|
||||
allowing both to convey the state of buffers (full, empty, amount of
|
||||
padding). The read() interface both removes padding and internally
|
||||
consumes the read sub-buffers; thus in cases where read(2) is being
|
||||
used to drain the channel buffers, special-purpose communication
|
||||
between kernel and user isn't necessary for basic operation. Things
|
||||
such as buffer-full conditions would still need to be communicated via
|
||||
some channel though.
|
||||
|
||||
klog and the relay-apps examples can be found in the relay-apps
|
||||
tarball on http://relayfs.sourceforge.net
|
||||
|
||||
The relay interface user space API
|
||||
==================================
|
||||
|
||||
The relay interface implements basic file operations for user space
|
||||
access to relay channel buffer data. Here are the file operations
|
||||
that are available and some comments regarding their behavior:
|
||||
|
||||
open() enables user to open an _existing_ channel buffer.
|
||||
|
||||
mmap() results in channel buffer being mapped into the caller's
|
||||
memory space. Note that you can't do a partial mmap - you
|
||||
must map the entire file, which is NRBUF * SUBBUFSIZE.
|
||||
|
||||
read() read the contents of a channel buffer. The bytes read are
|
||||
'consumed' by the reader, i.e. they won't be available
|
||||
again to subsequent reads. If the channel is being used
|
||||
in no-overwrite mode (the default), it can be read at any
|
||||
time even if there's an active kernel writer. If the
|
||||
channel is being used in overwrite mode and there are
|
||||
active channel writers, results may be unpredictable -
|
||||
users should make sure that all logging to the channel has
|
||||
ended before using read() with overwrite mode. Sub-buffer
|
||||
padding is automatically removed and will not be seen by
|
||||
the reader.
|
||||
|
||||
sendfile() transfer data from a channel buffer to an output file
|
||||
descriptor. Sub-buffer padding is automatically removed
|
||||
and will not be seen by the reader.
|
||||
|
||||
poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
|
||||
notified when sub-buffer boundaries are crossed.
|
||||
|
||||
close() decrements the channel buffer's refcount. When the refcount
|
||||
reaches 0, i.e. when no process or kernel client has the
|
||||
buffer open, the channel buffer is freed.
|
||||
|
||||
In order for a user application to make use of relay files, the
|
||||
host filesystem must be mounted. For example,
|
||||
|
||||
mount -t debugfs debugfs /debug
|
||||
|
||||
NOTE: the host filesystem doesn't need to be mounted for kernel
|
||||
clients to create or use channels - it only needs to be
|
||||
mounted when user space applications need access to the buffer
|
||||
data.
|
||||
|
||||
|
||||
The relay interface kernel API
|
||||
==============================
|
||||
|
||||
Here's a summary of the API the relay interface provides to in-kernel clients:
|
||||
|
||||
TBD(curr. line MT:/API/)
|
||||
channel management functions:
|
||||
|
||||
relay_open(base_filename, parent, subbuf_size, n_subbufs,
|
||||
callbacks, private_data)
|
||||
relay_close(chan)
|
||||
relay_flush(chan)
|
||||
relay_reset(chan)
|
||||
|
||||
channel management typically called on instigation of userspace:
|
||||
|
||||
relay_subbufs_consumed(chan, cpu, subbufs_consumed)
|
||||
|
||||
write functions:
|
||||
|
||||
relay_write(chan, data, length)
|
||||
__relay_write(chan, data, length)
|
||||
relay_reserve(chan, length)
|
||||
|
||||
callbacks:
|
||||
|
||||
subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
|
||||
buf_mapped(buf, filp)
|
||||
buf_unmapped(buf, filp)
|
||||
create_buf_file(filename, parent, mode, buf, is_global)
|
||||
remove_buf_file(dentry)
|
||||
|
||||
helper functions:
|
||||
|
||||
relay_buf_full(buf)
|
||||
subbuf_start_reserve(buf, length)
|
||||
|
||||
|
||||
Creating a channel
|
||||
------------------
|
||||
|
||||
relay_open() is used to create a channel, along with its per-cpu
|
||||
channel buffers. Each channel buffer will have an associated file
|
||||
created for it in the host filesystem, which can be and mmapped or
|
||||
read from in user space. The files are named basename0...basenameN-1
|
||||
where N is the number of online cpus, and by default will be created
|
||||
in the root of the filesystem (if the parent param is NULL). If you
|
||||
want a directory structure to contain your relay files, you should
|
||||
create it using the host filesystem's directory creation function,
|
||||
e.g. debugfs_create_dir(), and pass the parent directory to
|
||||
relay_open(). Users are responsible for cleaning up any directory
|
||||
structure they create, when the channel is closed - again the host
|
||||
filesystem's directory removal functions should be used for that,
|
||||
e.g. debugfs_remove().
|
||||
|
||||
In order for a channel to be created and the host filesystem's files
|
||||
associated with its channel buffers, the user must provide definitions
|
||||
for two callback functions, create_buf_file() and remove_buf_file().
|
||||
create_buf_file() is called once for each per-cpu buffer from
|
||||
relay_open() and allows the user to create the file which will be used
|
||||
to represent the corresponding channel buffer. The callback should
|
||||
return the dentry of the file created to represent the channel buffer.
|
||||
remove_buf_file() must also be defined; it's responsible for deleting
|
||||
the file(s) created in create_buf_file() and is called during
|
||||
relay_close().
|
||||
|
||||
Here are some typical definitions for these callbacks, in this case
|
||||
using debugfs:
|
||||
|
||||
/*
|
||||
* create_buf_file() callback. Creates relay file in debugfs.
|
||||
*/
|
||||
static struct dentry *create_buf_file_handler(const char *filename,
|
||||
struct dentry *parent,
|
||||
int mode,
|
||||
struct rchan_buf *buf,
|
||||
int *is_global)
|
||||
{
|
||||
return debugfs_create_file(filename, mode, parent, buf,
|
||||
&relay_file_operations);
|
||||
}
|
||||
|
||||
/*
|
||||
* remove_buf_file() callback. Removes relay file from debugfs.
|
||||
*/
|
||||
static int remove_buf_file_handler(struct dentry *dentry)
|
||||
{
|
||||
debugfs_remove(dentry);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* relay interface callbacks
|
||||
*/
|
||||
static struct rchan_callbacks relay_callbacks =
|
||||
{
|
||||
.create_buf_file = create_buf_file_handler,
|
||||
.remove_buf_file = remove_buf_file_handler,
|
||||
};
|
||||
|
||||
And an example relay_open() invocation using them:
|
||||
|
||||
chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL);
|
||||
|
||||
If the create_buf_file() callback fails, or isn't defined, channel
|
||||
creation and thus relay_open() will fail.
|
||||
|
||||
The total size of each per-cpu buffer is calculated by multiplying the
|
||||
number of sub-buffers by the sub-buffer size passed into relay_open().
|
||||
The idea behind sub-buffers is that they're basically an extension of
|
||||
double-buffering to N buffers, and they also allow applications to
|
||||
easily implement random-access-on-buffer-boundary schemes, which can
|
||||
be important for some high-volume applications. The number and size
|
||||
of sub-buffers is completely dependent on the application and even for
|
||||
the same application, different conditions will warrant different
|
||||
values for these parameters at different times. Typically, the right
|
||||
values to use are best decided after some experimentation; in general,
|
||||
though, it's safe to assume that having only 1 sub-buffer is a bad
|
||||
idea - you're guaranteed to either overwrite data or lose events
|
||||
depending on the channel mode being used.
|
||||
|
||||
The create_buf_file() implementation can also be defined in such a way
|
||||
as to allow the creation of a single 'global' buffer instead of the
|
||||
default per-cpu set. This can be useful for applications interested
|
||||
mainly in seeing the relative ordering of system-wide events without
|
||||
the need to bother with saving explicit timestamps for the purpose of
|
||||
merging/sorting per-cpu files in a postprocessing step.
|
||||
|
||||
To have relay_open() create a global buffer, the create_buf_file()
|
||||
implementation should set the value of the is_global outparam to a
|
||||
non-zero value in addition to creating the file that will be used to
|
||||
represent the single buffer. In the case of a global buffer,
|
||||
create_buf_file() and remove_buf_file() will be called only once. The
|
||||
normal channel-writing functions, e.g. relay_write(), can still be
|
||||
used - writes from any cpu will transparently end up in the global
|
||||
buffer - but since it is a global buffer, callers should make sure
|
||||
they use the proper locking for such a buffer, either by wrapping
|
||||
writes in a spinlock, or by copying a write function from relay.h and
|
||||
creating a local version that internally does the proper locking.
|
||||
|
||||
The private_data passed into relay_open() allows clients to associate
|
||||
user-defined data with a channel, and is immediately available
|
||||
(including in create_buf_file()) via chan->private_data or
|
||||
buf->chan->private_data.
|
||||
|
||||
Channel 'modes'
|
||||
---------------
|
||||
|
||||
relay channels can be used in either of two modes - 'overwrite' or
|
||||
'no-overwrite'. The mode is entirely determined by the implementation
|
||||
of the subbuf_start() callback, as described below. The default if no
|
||||
subbuf_start() callback is defined is 'no-overwrite' mode. If the
|
||||
default mode suits your needs, and you plan to use the read()
|
||||
interface to retrieve channel data, you can ignore the details of this
|
||||
section, as it pertains mainly to mmap() implementations.
|
||||
|
||||
In 'overwrite' mode, also known as 'flight recorder' mode, writes
|
||||
continuously cycle around the buffer and will never fail, but will
|
||||
unconditionally overwrite old data regardless of whether it's actually
|
||||
been consumed. In no-overwrite mode, writes will fail, i.e. data will
|
||||
be lost, if the number of unconsumed sub-buffers equals the total
|
||||
number of sub-buffers in the channel. It should be clear that if
|
||||
there is no consumer or if the consumer can't consume sub-buffers fast
|
||||
enough, data will be lost in either case; the only difference is
|
||||
whether data is lost from the beginning or the end of a buffer.
|
||||
|
||||
As explained above, a relay channel is made of up one or more
|
||||
per-cpu channel buffers, each implemented as a circular buffer
|
||||
subdivided into one or more sub-buffers. Messages are written into
|
||||
the current sub-buffer of the channel's current per-cpu buffer via the
|
||||
write functions described below. Whenever a message can't fit into
|
||||
the current sub-buffer, because there's no room left for it, the
|
||||
client is notified via the subbuf_start() callback that a switch to a
|
||||
new sub-buffer is about to occur. The client uses this callback to 1)
|
||||
initialize the next sub-buffer if appropriate 2) finalize the previous
|
||||
sub-buffer if appropriate and 3) return a boolean value indicating
|
||||
whether or not to actually move on to the next sub-buffer.
|
||||
|
||||
To implement 'no-overwrite' mode, the userspace client would provide
|
||||
an implementation of the subbuf_start() callback something like the
|
||||
following:
|
||||
|
||||
static int subbuf_start(struct rchan_buf *buf,
|
||||
void *subbuf,
|
||||
void *prev_subbuf,
|
||||
unsigned int prev_padding)
|
||||
{
|
||||
if (prev_subbuf)
|
||||
*((unsigned *)prev_subbuf) = prev_padding;
|
||||
|
||||
if (relay_buf_full(buf))
|
||||
return 0;
|
||||
|
||||
subbuf_start_reserve(buf, sizeof(unsigned int));
|
||||
|
||||
return 1;
|
||||
}
|
||||
|
||||
If the current buffer is full, i.e. all sub-buffers remain unconsumed,
|
||||
the callback returns 0 to indicate that the buffer switch should not
|
||||
occur yet, i.e. until the consumer has had a chance to read the
|
||||
current set of ready sub-buffers. For the relay_buf_full() function
|
||||
to make sense, the consumer is reponsible for notifying the relay
|
||||
interface when sub-buffers have been consumed via
|
||||
relay_subbufs_consumed(). Any subsequent attempts to write into the
|
||||
buffer will again invoke the subbuf_start() callback with the same
|
||||
parameters; only when the consumer has consumed one or more of the
|
||||
ready sub-buffers will relay_buf_full() return 0, in which case the
|
||||
buffer switch can continue.
|
||||
|
||||
The implementation of the subbuf_start() callback for 'overwrite' mode
|
||||
would be very similar:
|
||||
|
||||
static int subbuf_start(struct rchan_buf *buf,
|
||||
void *subbuf,
|
||||
void *prev_subbuf,
|
||||
unsigned int prev_padding)
|
||||
{
|
||||
if (prev_subbuf)
|
||||
*((unsigned *)prev_subbuf) = prev_padding;
|
||||
|
||||
subbuf_start_reserve(buf, sizeof(unsigned int));
|
||||
|
||||
return 1;
|
||||
}
|
||||
|
||||
In this case, the relay_buf_full() check is meaningless and the
|
||||
callback always returns 1, causing the buffer switch to occur
|
||||
unconditionally. It's also meaningless for the client to use the
|
||||
relay_subbufs_consumed() function in this mode, as it's never
|
||||
consulted.
|
||||
|
||||
The default subbuf_start() implementation, used if the client doesn't
|
||||
define any callbacks, or doesn't define the subbuf_start() callback,
|
||||
implements the simplest possible 'no-overwrite' mode, i.e. it does
|
||||
nothing but return 0.
|
||||
|
||||
Header information can be reserved at the beginning of each sub-buffer
|
||||
by calling the subbuf_start_reserve() helper function from within the
|
||||
subbuf_start() callback. This reserved area can be used to store
|
||||
whatever information the client wants. In the example above, room is
|
||||
reserved in each sub-buffer to store the padding count for that
|
||||
sub-buffer. This is filled in for the previous sub-buffer in the
|
||||
subbuf_start() implementation; the padding value for the previous
|
||||
sub-buffer is passed into the subbuf_start() callback along with a
|
||||
pointer to the previous sub-buffer, since the padding value isn't
|
||||
known until a sub-buffer is filled. The subbuf_start() callback is
|
||||
also called for the first sub-buffer when the channel is opened, to
|
||||
give the client a chance to reserve space in it. In this case the
|
||||
previous sub-buffer pointer passed into the callback will be NULL, so
|
||||
the client should check the value of the prev_subbuf pointer before
|
||||
writing into the previous sub-buffer.
|
||||
|
||||
Writing to a channel
|
||||
--------------------
|
||||
|
||||
Kernel clients write data into the current cpu's channel buffer using
|
||||
relay_write() or __relay_write(). relay_write() is the main logging
|
||||
function - it uses local_irqsave() to protect the buffer and should be
|
||||
used if you might be logging from interrupt context. If you know
|
||||
you'll never be logging from interrupt context, you can use
|
||||
__relay_write(), which only disables preemption. These functions
|
||||
don't return a value, so you can't determine whether or not they
|
||||
failed - the assumption is that you wouldn't want to check a return
|
||||
value in the fast logging path anyway, and that they'll always succeed
|
||||
unless the buffer is full and no-overwrite mode is being used, in
|
||||
which case you can detect a failed write in the subbuf_start()
|
||||
callback by calling the relay_buf_full() helper function.
|
||||
|
||||
relay_reserve() is used to reserve a slot in a channel buffer which
|
||||
can be written to later. This would typically be used in applications
|
||||
that need to write directly into a channel buffer without having to
|
||||
stage data in a temporary buffer beforehand. Because the actual write
|
||||
may not happen immediately after the slot is reserved, applications
|
||||
using relay_reserve() can keep a count of the number of bytes actually
|
||||
written, either in space reserved in the sub-buffers themselves or as
|
||||
a separate array. See the 'reserve' example in the relay-apps tarball
|
||||
at http://relayfs.sourceforge.net for an example of how this can be
|
||||
done. Because the write is under control of the client and is
|
||||
separated from the reserve, relay_reserve() doesn't protect the buffer
|
||||
at all - it's up to the client to provide the appropriate
|
||||
synchronization when using relay_reserve().
|
||||
|
||||
Closing a channel
|
||||
-----------------
|
||||
|
||||
The client calls relay_close() when it's finished using the channel.
|
||||
The channel and its associated buffers are destroyed when there are no
|
||||
longer any references to any of the channel buffers. relay_flush()
|
||||
forces a sub-buffer switch on all the channel buffers, and can be used
|
||||
to finalize and process the last sub-buffers before the channel is
|
||||
closed.
|
||||
|
||||
Misc
|
||||
----
|
||||
|
||||
Some applications may want to keep a channel around and re-use it
|
||||
rather than open and close a new channel for each use. relay_reset()
|
||||
can be used for this purpose - it resets a channel to its initial
|
||||
state without reallocating channel buffer memory or destroying
|
||||
existing mappings. It should however only be called when it's safe to
|
||||
do so, i.e. when the channel isn't currently being written to.
|
||||
|
||||
Finally, there are a couple of utility callbacks that can be used for
|
||||
different purposes. buf_mapped() is called whenever a channel buffer
|
||||
is mmapped from user space and buf_unmapped() is called when it's
|
||||
unmapped. The client can use this notification to trigger actions
|
||||
within the kernel application, such as enabling/disabling logging to
|
||||
the channel.
|
||||
|
||||
|
||||
Resources
|
||||
=========
|
||||
|
||||
For news, example code, mailing list, etc. see the relay interface homepage:
|
||||
|
||||
http://relayfs.sourceforge.net
|
||||
|
||||
|
||||
Credits
|
||||
=======
|
||||
|
||||
The ideas and specs for the relay interface came about as a result of
|
||||
discussions on tracing involving the following:
|
||||
|
||||
Michel Dagenais <michel.dagenais@polymtl.ca>
|
||||
Richard Moore <richardj_moore@uk.ibm.com>
|
||||
Bob Wisniewski <bob@watson.ibm.com>
|
||||
Karim Yaghmour <karim@opersys.com>
|
||||
Tom Zanussi <zanussi@us.ibm.com>
|
||||
|
||||
Also thanks to Hubertus Franke for a lot of useful suggestions and bug
|
||||
reports.
|
||||
187
Documentation/filesystems/romfs.txt
Normal file
187
Documentation/filesystems/romfs.txt
Normal file
@@ -0,0 +1,187 @@
|
||||
ROMFS - ROM FILE SYSTEM
|
||||
|
||||
This is a quite dumb, read only filesystem, mainly for initial RAM
|
||||
disks of installation disks. It has grown up by the need of having
|
||||
modules linked at boot time. Using this filesystem, you get a very
|
||||
similar feature, and even the possibility of a small kernel, with a
|
||||
file system which doesn't take up useful memory from the router
|
||||
functions in the basement of your office.
|
||||
|
||||
For comparison, both the older minix and xiafs (the latter is now
|
||||
defunct) filesystems, compiled as module need more than 20000 bytes,
|
||||
while romfs is less than a page, about 4000 bytes (assuming i586
|
||||
code). Under the same conditions, the msdos filesystem would need
|
||||
about 30K (and does not support device nodes or symlinks), while the
|
||||
nfs module with nfsroot is about 57K. Furthermore, as a bit unfair
|
||||
comparison, an actual rescue disk used up 3202 blocks with ext2, while
|
||||
with romfs, it needed 3079 blocks.
|
||||
|
||||
To create such a file system, you'll need a user program named
|
||||
genromfs. It is available via anonymous ftp on sunsite.unc.edu and
|
||||
its mirrors, in the /pub/Linux/system/recovery/ directory.
|
||||
|
||||
As the name suggests, romfs could be also used (space-efficiently) on
|
||||
various read-only media, like (E)EPROM disks if someone will have the
|
||||
motivation.. :)
|
||||
|
||||
However, the main purpose of romfs is to have a very small kernel,
|
||||
which has only this filesystem linked in, and then can load any module
|
||||
later, with the current module utilities. It can also be used to run
|
||||
some program to decide if you need SCSI devices, and even IDE or
|
||||
floppy drives can be loaded later if you use the "initrd"--initial
|
||||
RAM disk--feature of the kernel. This would not be really news
|
||||
flash, but with romfs, you can even spare off your ext2 or minix or
|
||||
maybe even affs filesystem until you really know that you need it.
|
||||
|
||||
For example, a distribution boot disk can contain only the cd disk
|
||||
drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem
|
||||
module. The kernel can be small enough, since it doesn't have other
|
||||
filesystems, like the quite large ext2fs module, which can then be
|
||||
loaded off the CD at a later stage of the installation. Another use
|
||||
would be for a recovery disk, when you are reinstalling a workstation
|
||||
from the network, and you will have all the tools/modules available
|
||||
from a nearby server, so you don't want to carry two disks for this
|
||||
purpose, just because it won't fit into ext2.
|
||||
|
||||
romfs operates on block devices as you can expect, and the underlying
|
||||
structure is very simple. Every accessible structure begins on 16
|
||||
byte boundaries for fast access. The minimum space a file will take
|
||||
is 32 bytes (this is an empty file, with a less than 16 character
|
||||
name). The maximum overhead for any non-empty file is the header, and
|
||||
the 16 byte padding for the name and the contents, also 16+14+15 = 45
|
||||
bytes. This is quite rare however, since most file names are longer
|
||||
than 3 bytes, and shorter than 15 bytes.
|
||||
|
||||
The layout of the filesystem is the following:
|
||||
|
||||
offset content
|
||||
|
||||
+---+---+---+---+
|
||||
0 | - | r | o | m | \
|
||||
+---+---+---+---+ The ASCII representation of those bytes
|
||||
4 | 1 | f | s | - | / (i.e. "-rom1fs-")
|
||||
+---+---+---+---+
|
||||
8 | full size | The number of accessible bytes in this fs.
|
||||
+---+---+---+---+
|
||||
12 | checksum | The checksum of the FIRST 512 BYTES.
|
||||
+---+---+---+---+
|
||||
16 | volume name | The zero terminated name of the volume,
|
||||
: : padded to 16 byte boundary.
|
||||
+---+---+---+---+
|
||||
xx | file |
|
||||
: headers :
|
||||
|
||||
Every multi byte value (32 bit words, I'll use the longwords term from
|
||||
now on) must be in big endian order.
|
||||
|
||||
The first eight bytes identify the filesystem, even for the casual
|
||||
inspector. After that, in the 3rd longword, it contains the number of
|
||||
bytes accessible from the start of this filesystem. The 4th longword
|
||||
is the checksum of the first 512 bytes (or the number of bytes
|
||||
accessible, whichever is smaller). The applied algorithm is the same
|
||||
as in the AFFS filesystem, namely a simple sum of the longwords
|
||||
(assuming bigendian quantities again). For details, please consult
|
||||
the source. This algorithm was chosen because although it's not quite
|
||||
reliable, it does not require any tables, and it is very simple.
|
||||
|
||||
The following bytes are now part of the file system; each file header
|
||||
must begin on a 16 byte boundary.
|
||||
|
||||
offset content
|
||||
|
||||
+---+---+---+---+
|
||||
0 | next filehdr|X| The offset of the next file header
|
||||
+---+---+---+---+ (zero if no more files)
|
||||
4 | spec.info | Info for directories/hard links/devices
|
||||
+---+---+---+---+
|
||||
8 | size | The size of this file in bytes
|
||||
+---+---+---+---+
|
||||
12 | checksum | Covering the meta data, including the file
|
||||
+---+---+---+---+ name, and padding
|
||||
16 | file name | The zero terminated name of the file,
|
||||
: : padded to 16 byte boundary
|
||||
+---+---+---+---+
|
||||
xx | file data |
|
||||
: :
|
||||
|
||||
Since the file headers begin always at a 16 byte boundary, the lowest
|
||||
4 bits would be always zero in the next filehdr pointer. These four
|
||||
bits are used for the mode information. Bits 0..2 specify the type of
|
||||
the file; while bit 4 shows if the file is executable or not. The
|
||||
permissions are assumed to be world readable, if this bit is not set,
|
||||
and world executable if it is; except the character and block devices,
|
||||
they are never accessible for other than owner. The owner of every
|
||||
file is user and group 0, this should never be a problem for the
|
||||
intended use. The mapping of the 8 possible values to file types is
|
||||
the following:
|
||||
|
||||
mapping spec.info means
|
||||
0 hard link link destination [file header]
|
||||
1 directory first file's header
|
||||
2 regular file unused, must be zero [MBZ]
|
||||
3 symbolic link unused, MBZ (file data is the link content)
|
||||
4 block device 16/16 bits major/minor number
|
||||
5 char device - " -
|
||||
6 socket unused, MBZ
|
||||
7 fifo unused, MBZ
|
||||
|
||||
Note that hard links are specifically marked in this filesystem, but
|
||||
they will behave as you can expect (i.e. share the inode number).
|
||||
Note also that it is your responsibility to not create hard link
|
||||
loops, and creating all the . and .. links for directories. This is
|
||||
normally done correctly by the genromfs program. Please refrain from
|
||||
using the executable bits for special purposes on the socket and fifo
|
||||
special files, they may have other uses in the future. Additionally,
|
||||
please remember that only regular files, and symlinks are supposed to
|
||||
have a nonzero size field; they contain the number of bytes available
|
||||
directly after the (padded) file name.
|
||||
|
||||
Another thing to note is that romfs works on file headers and data
|
||||
aligned to 16 byte boundaries, but most hardware devices and the block
|
||||
device drivers are unable to cope with smaller than block-sized data.
|
||||
To overcome this limitation, the whole size of the file system must be
|
||||
padded to an 1024 byte boundary.
|
||||
|
||||
If you have any problems or suggestions concerning this file system,
|
||||
please contact me. However, think twice before wanting me to add
|
||||
features and code, because the primary and most important advantage of
|
||||
this file system is the small code. On the other hand, don't be
|
||||
alarmed, I'm not getting that much romfs related mail. Now I can
|
||||
understand why Avery wrote poems in the ARCnet docs to get some more
|
||||
feedback. :)
|
||||
|
||||
romfs has also a mailing list, and to date, it hasn't received any
|
||||
traffic, so you are welcome to join it to discuss your ideas. :)
|
||||
|
||||
It's run by ezmlm, so you can subscribe to it by sending a message
|
||||
to romfs-subscribe@shadow.banki.hu, the content is irrelevant.
|
||||
|
||||
Pending issues:
|
||||
|
||||
- Permissions and owner information are pretty essential features of a
|
||||
Un*x like system, but romfs does not provide the full possibilities.
|
||||
I have never found this limiting, but others might.
|
||||
|
||||
- The file system is read only, so it can be very small, but in case
|
||||
one would want to write _anything_ to a file system, he still needs
|
||||
a writable file system, thus negating the size advantages. Possible
|
||||
solutions: implement write access as a compile-time option, or a new,
|
||||
similarly small writable filesystem for RAM disks.
|
||||
|
||||
- Since the files are only required to have alignment on a 16 byte
|
||||
boundary, it is currently possibly suboptimal to read or execute files
|
||||
from the filesystem. It might be resolved by reordering file data to
|
||||
have most of it (i.e. except the start and the end) laying at "natural"
|
||||
boundaries, thus it would be possible to directly map a big portion of
|
||||
the file contents to the mm subsystem.
|
||||
|
||||
- Compression might be an useful feature, but memory is quite a
|
||||
limiting factor in my eyes.
|
||||
|
||||
- Where it is used?
|
||||
|
||||
- Does it work on other architectures than intel and motorola?
|
||||
|
||||
|
||||
Have fun,
|
||||
Janos Farkas <chexum@shadow.banki.hu>
|
||||
8
Documentation/filesystems/smbfs.txt
Normal file
8
Documentation/filesystems/smbfs.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
Smbfs is a filesystem that implements the SMB protocol, which is the
|
||||
protocol used by Windows for Workgroups, Windows 95 and Windows NT.
|
||||
Smbfs was inspired by Samba, the program written by Andrew Tridgell
|
||||
that turns any Unix host into a file server for DOS or Windows clients.
|
||||
|
||||
Smbfs is a SMB client, but uses parts of samba for it's operation. For
|
||||
more info on samba, including documentation, please go to
|
||||
http://www.samba.org/ and then on to your nearest mirror.
|
||||
521
Documentation/filesystems/spufs.txt
Normal file
521
Documentation/filesystems/spufs.txt
Normal file
@@ -0,0 +1,521 @@
|
||||
SPUFS(2) Linux Programmer's Manual SPUFS(2)
|
||||
|
||||
|
||||
|
||||
NAME
|
||||
spufs - the SPU file system
|
||||
|
||||
|
||||
DESCRIPTION
|
||||
The SPU file system is used on PowerPC machines that implement the Cell
|
||||
Broadband Engine Architecture in order to access Synergistic Processor
|
||||
Units (SPUs).
|
||||
|
||||
The file system provides a name space similar to posix shared memory or
|
||||
message queues. Users that have write permissions on the file system
|
||||
can use spu_create(2) to establish SPU contexts in the spufs root.
|
||||
|
||||
Every SPU context is represented by a directory containing a predefined
|
||||
set of files. These files can be used for manipulating the state of the
|
||||
logical SPU. Users can change permissions on those files, but not actu-
|
||||
ally add or remove files.
|
||||
|
||||
|
||||
MOUNT OPTIONS
|
||||
uid=<uid>
|
||||
set the user owning the mount point, the default is 0 (root).
|
||||
|
||||
gid=<gid>
|
||||
set the group owning the mount point, the default is 0 (root).
|
||||
|
||||
|
||||
FILES
|
||||
The files in spufs mostly follow the standard behavior for regular sys-
|
||||
tem calls like read(2) or write(2), but often support only a subset of
|
||||
the operations supported on regular file systems. This list details the
|
||||
supported operations and the deviations from the behaviour in the
|
||||
respective man pages.
|
||||
|
||||
All files that support the read(2) operation also support readv(2) and
|
||||
all files that support the write(2) operation also support writev(2).
|
||||
All files support the access(2) and stat(2) family of operations, but
|
||||
only the st_mode, st_nlink, st_uid and st_gid fields of struct stat
|
||||
contain reliable information.
|
||||
|
||||
All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera-
|
||||
tions, but will not be able to grant permissions that contradict the
|
||||
possible operations, e.g. read access on the wbox file.
|
||||
|
||||
The current set of files is:
|
||||
|
||||
|
||||
/mem
|
||||
the contents of the local storage memory of the SPU. This can be
|
||||
accessed like a regular shared memory file and contains both code and
|
||||
data in the address space of the SPU. The possible operations on an
|
||||
open mem file are:
|
||||
|
||||
read(2), pread(2), write(2), pwrite(2), lseek(2)
|
||||
These operate as documented, with the exception that seek(2),
|
||||
write(2) and pwrite(2) are not supported beyond the end of the
|
||||
file. The file size is the size of the local storage of the SPU,
|
||||
which normally is 256 kilobytes.
|
||||
|
||||
mmap(2)
|
||||
Mapping mem into the process address space gives access to the
|
||||
SPU local storage within the process address space. Only
|
||||
MAP_SHARED mappings are allowed.
|
||||
|
||||
|
||||
/mbox
|
||||
The first SPU to CPU communication mailbox. This file is read-only and
|
||||
can be read in units of 32 bits. The file can only be used in non-
|
||||
blocking mode and it even poll() will not block on it. The possible
|
||||
operations on an open mbox file are:
|
||||
|
||||
read(2)
|
||||
If a count smaller than four is requested, read returns -1 and
|
||||
sets errno to EINVAL. If there is no data available in the mail
|
||||
box, the return value is set to -1 and errno becomes EAGAIN.
|
||||
When data has been read successfully, four bytes are placed in
|
||||
the data buffer and the value four is returned.
|
||||
|
||||
|
||||
/ibox
|
||||
The second SPU to CPU communication mailbox. This file is similar to
|
||||
the first mailbox file, but can be read in blocking I/O mode, and the
|
||||
poll family of system calls can be used to wait for it. The possible
|
||||
operations on an open ibox file are:
|
||||
|
||||
read(2)
|
||||
If a count smaller than four is requested, read returns -1 and
|
||||
sets errno to EINVAL. If there is no data available in the mail
|
||||
box and the file descriptor has been opened with O_NONBLOCK, the
|
||||
return value is set to -1 and errno becomes EAGAIN.
|
||||
|
||||
If there is no data available in the mail box and the file
|
||||
descriptor has been opened without O_NONBLOCK, the call will
|
||||
block until the SPU writes to its interrupt mailbox channel.
|
||||
When data has been read successfully, four bytes are placed in
|
||||
the data buffer and the value four is returned.
|
||||
|
||||
poll(2)
|
||||
Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever
|
||||
data is available for reading.
|
||||
|
||||
|
||||
/wbox
|
||||
The CPU to SPU communation mailbox. It is write-only and can be written
|
||||
in units of 32 bits. If the mailbox is full, write() will block and
|
||||
poll can be used to wait for it becoming empty again. The possible
|
||||
operations on an open wbox file are: write(2) If a count smaller than
|
||||
four is requested, write returns -1 and sets errno to EINVAL. If there
|
||||
is no space available in the mail box and the file descriptor has been
|
||||
opened with O_NONBLOCK, the return value is set to -1 and errno becomes
|
||||
EAGAIN.
|
||||
|
||||
If there is no space available in the mail box and the file descriptor
|
||||
has been opened without O_NONBLOCK, the call will block until the SPU
|
||||
reads from its PPE mailbox channel. When data has been read success-
|
||||
fully, four bytes are placed in the data buffer and the value four is
|
||||
returned.
|
||||
|
||||
poll(2)
|
||||
Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever
|
||||
space is available for writing.
|
||||
|
||||
|
||||
/mbox_stat
|
||||
/ibox_stat
|
||||
/wbox_stat
|
||||
Read-only files that contain the length of the current queue, i.e. how
|
||||
many words can be read from mbox or ibox or how many words can be
|
||||
written to wbox without blocking. The files can be read only in 4-byte
|
||||
units and return a big-endian binary integer number. The possible
|
||||
operations on an open *box_stat file are:
|
||||
|
||||
read(2)
|
||||
If a count smaller than four is requested, read returns -1 and
|
||||
sets errno to EINVAL. Otherwise, a four byte value is placed in
|
||||
the data buffer, containing the number of elements that can be
|
||||
read from (for mbox_stat and ibox_stat) or written to (for
|
||||
wbox_stat) the respective mail box without blocking or resulting
|
||||
in EAGAIN.
|
||||
|
||||
|
||||
/npc
|
||||
/decr
|
||||
/decr_status
|
||||
/spu_tag_mask
|
||||
/event_mask
|
||||
/srr0
|
||||
Internal registers of the SPU. The representation is an ASCII string
|
||||
with the numeric value of the next instruction to be executed. These
|
||||
can be used in read/write mode for debugging, but normal operation of
|
||||
programs should not rely on them because access to any of them except
|
||||
npc requires an SPU context save and is therefore very inefficient.
|
||||
|
||||
The contents of these files are:
|
||||
|
||||
npc Next Program Counter
|
||||
|
||||
decr SPU Decrementer
|
||||
|
||||
decr_status Decrementer Status
|
||||
|
||||
spu_tag_mask MFC tag mask for SPU DMA
|
||||
|
||||
event_mask Event mask for SPU interrupts
|
||||
|
||||
srr0 Interrupt Return address register
|
||||
|
||||
|
||||
The possible operations on an open npc, decr, decr_status,
|
||||
spu_tag_mask, event_mask or srr0 file are:
|
||||
|
||||
read(2)
|
||||
When the count supplied to the read call is shorter than the
|
||||
required length for the pointer value plus a newline character,
|
||||
subsequent reads from the same file descriptor will result in
|
||||
completing the string, regardless of changes to the register by
|
||||
a running SPU task. When a complete string has been read, all
|
||||
subsequent read operations will return zero bytes and a new file
|
||||
descriptor needs to be opened to read the value again.
|
||||
|
||||
write(2)
|
||||
A write operation on the file results in setting the register to
|
||||
the value given in the string. The string is parsed from the
|
||||
beginning to the first non-numeric character or the end of the
|
||||
buffer. Subsequent writes to the same file descriptor overwrite
|
||||
the previous setting.
|
||||
|
||||
|
||||
/fpcr
|
||||
This file gives access to the Floating Point Status and Control Regis-
|
||||
ter as a four byte long file. The operations on the fpcr file are:
|
||||
|
||||
read(2)
|
||||
If a count smaller than four is requested, read returns -1 and
|
||||
sets errno to EINVAL. Otherwise, a four byte value is placed in
|
||||
the data buffer, containing the current value of the fpcr regis-
|
||||
ter.
|
||||
|
||||
write(2)
|
||||
If a count smaller than four is requested, write returns -1 and
|
||||
sets errno to EINVAL. Otherwise, a four byte value is copied
|
||||
from the data buffer, updating the value of the fpcr register.
|
||||
|
||||
|
||||
/signal1
|
||||
/signal2
|
||||
The two signal notification channels of an SPU. These are read-write
|
||||
files that operate on a 32 bit word. Writing to one of these files
|
||||
triggers an interrupt on the SPU. The value written to the signal
|
||||
files can be read from the SPU through a channel read or from host user
|
||||
space through the file. After the value has been read by the SPU, it
|
||||
is reset to zero. The possible operations on an open signal1 or sig-
|
||||
nal2 file are:
|
||||
|
||||
read(2)
|
||||
If a count smaller than four is requested, read returns -1 and
|
||||
sets errno to EINVAL. Otherwise, a four byte value is placed in
|
||||
the data buffer, containing the current value of the specified
|
||||
signal notification register.
|
||||
|
||||
write(2)
|
||||
If a count smaller than four is requested, write returns -1 and
|
||||
sets errno to EINVAL. Otherwise, a four byte value is copied
|
||||
from the data buffer, updating the value of the specified signal
|
||||
notification register. The signal notification register will
|
||||
either be replaced with the input data or will be updated to the
|
||||
bitwise OR or the old value and the input data, depending on the
|
||||
contents of the signal1_type, or signal2_type respectively,
|
||||
file.
|
||||
|
||||
|
||||
/signal1_type
|
||||
/signal2_type
|
||||
These two files change the behavior of the signal1 and signal2 notifi-
|
||||
cation files. The contain a numerical ASCII string which is read as
|
||||
either "1" or "0". In mode 0 (overwrite), the hardware replaces the
|
||||
contents of the signal channel with the data that is written to it. in
|
||||
mode 1 (logical OR), the hardware accumulates the bits that are subse-
|
||||
quently written to it. The possible operations on an open signal1_type
|
||||
or signal2_type file are:
|
||||
|
||||
read(2)
|
||||
When the count supplied to the read call is shorter than the
|
||||
required length for the digit plus a newline character, subse-
|
||||
quent reads from the same file descriptor will result in com-
|
||||
pleting the string. When a complete string has been read, all
|
||||
subsequent read operations will return zero bytes and a new file
|
||||
descriptor needs to be opened to read the value again.
|
||||
|
||||
write(2)
|
||||
A write operation on the file results in setting the register to
|
||||
the value given in the string. The string is parsed from the
|
||||
beginning to the first non-numeric character or the end of the
|
||||
buffer. Subsequent writes to the same file descriptor overwrite
|
||||
the previous setting.
|
||||
|
||||
|
||||
EXAMPLES
|
||||
/etc/fstab entry
|
||||
none /spu spufs gid=spu 0 0
|
||||
|
||||
|
||||
AUTHORS
|
||||
Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
|
||||
Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
|
||||
|
||||
SEE ALSO
|
||||
capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
|
||||
|
||||
|
||||
|
||||
Linux 2005-09-28 SPUFS(2)
|
||||
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2)
|
||||
|
||||
|
||||
|
||||
NAME
|
||||
spu_run - execute an spu context
|
||||
|
||||
|
||||
SYNOPSIS
|
||||
#include <sys/spu.h>
|
||||
|
||||
int spu_run(int fd, unsigned int *npc, unsigned int *event);
|
||||
|
||||
DESCRIPTION
|
||||
The spu_run system call is used on PowerPC machines that implement the
|
||||
Cell Broadband Engine Architecture in order to access Synergistic Pro-
|
||||
cessor Units (SPUs). It uses the fd that was returned from spu_cre-
|
||||
ate(2) to address a specific SPU context. When the context gets sched-
|
||||
uled to a physical SPU, it starts execution at the instruction pointer
|
||||
passed in npc.
|
||||
|
||||
Execution of SPU code happens synchronously, meaning that spu_run does
|
||||
not return while the SPU is still running. If there is a need to exe-
|
||||
cute SPU code in parallel with other code on either the main CPU or
|
||||
other SPUs, you need to create a new thread of execution first, e.g.
|
||||
using the pthread_create(3) call.
|
||||
|
||||
When spu_run returns, the current value of the SPU instruction pointer
|
||||
is written back to npc, so you can call spu_run again without updating
|
||||
the pointers.
|
||||
|
||||
event can be a NULL pointer or point to an extended status code that
|
||||
gets filled when spu_run returns. It can be one of the following con-
|
||||
stants:
|
||||
|
||||
SPE_EVENT_DMA_ALIGNMENT
|
||||
A DMA alignment error
|
||||
|
||||
SPE_EVENT_SPE_DATA_SEGMENT
|
||||
A DMA segmentation error
|
||||
|
||||
SPE_EVENT_SPE_DATA_STORAGE
|
||||
A DMA storage error
|
||||
|
||||
If NULL is passed as the event argument, these errors will result in a
|
||||
signal delivered to the calling process.
|
||||
|
||||
RETURN VALUE
|
||||
spu_run returns the value of the spu_status register or -1 to indicate
|
||||
an error and set errno to one of the error codes listed below. The
|
||||
spu_status register value contains a bit mask of status codes and
|
||||
optionally a 14 bit code returned from the stop-and-signal instruction
|
||||
on the SPU. The bit masks for the status codes are:
|
||||
|
||||
0x02 SPU was stopped by stop-and-signal.
|
||||
|
||||
0x04 SPU was stopped by halt.
|
||||
|
||||
0x08 SPU is waiting for a channel.
|
||||
|
||||
0x10 SPU is in single-step mode.
|
||||
|
||||
0x20 SPU has tried to execute an invalid instruction.
|
||||
|
||||
0x40 SPU has tried to access an invalid channel.
|
||||
|
||||
0x3fff0000
|
||||
The bits masked with this value contain the code returned from
|
||||
stop-and-signal.
|
||||
|
||||
There are always one or more of the lower eight bits set or an error
|
||||
code is returned from spu_run.
|
||||
|
||||
ERRORS
|
||||
EAGAIN or EWOULDBLOCK
|
||||
fd is in non-blocking mode and spu_run would block.
|
||||
|
||||
EBADF fd is not a valid file descriptor.
|
||||
|
||||
EFAULT npc is not a valid pointer or status is neither NULL nor a valid
|
||||
pointer.
|
||||
|
||||
EINTR A signal occurred while spu_run was in progress. The npc value
|
||||
has been updated to the new program counter value if necessary.
|
||||
|
||||
EINVAL fd is not a file descriptor returned from spu_create(2).
|
||||
|
||||
ENOMEM Insufficient memory was available to handle a page fault result-
|
||||
ing from an MFC direct memory access.
|
||||
|
||||
ENOSYS the functionality is not provided by the current system, because
|
||||
either the hardware does not provide SPUs or the spufs module is
|
||||
not loaded.
|
||||
|
||||
|
||||
NOTES
|
||||
spu_run is meant to be used from libraries that implement a more
|
||||
abstract interface to SPUs, not to be used from regular applications.
|
||||
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
|
||||
ommended libraries.
|
||||
|
||||
|
||||
CONFORMING TO
|
||||
This call is Linux specific and only implemented by the ppc64 architec-
|
||||
ture. Programs using this system call are not portable.
|
||||
|
||||
|
||||
BUGS
|
||||
The code does not yet fully implement all features lined out here.
|
||||
|
||||
|
||||
AUTHOR
|
||||
Arnd Bergmann <arndb@de.ibm.com>
|
||||
|
||||
SEE ALSO
|
||||
capabilities(7), close(2), spu_create(2), spufs(7)
|
||||
|
||||
|
||||
|
||||
Linux 2005-09-28 SPU_RUN(2)
|
||||
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2)
|
||||
|
||||
|
||||
|
||||
NAME
|
||||
spu_create - create a new spu context
|
||||
|
||||
|
||||
SYNOPSIS
|
||||
#include <sys/types.h>
|
||||
#include <sys/spu.h>
|
||||
|
||||
int spu_create(const char *pathname, int flags, mode_t mode);
|
||||
|
||||
DESCRIPTION
|
||||
The spu_create system call is used on PowerPC machines that implement
|
||||
the Cell Broadband Engine Architecture in order to access Synergistic
|
||||
Processor Units (SPUs). It creates a new logical context for an SPU in
|
||||
pathname and returns a handle to associated with it. pathname must
|
||||
point to a non-existing directory in the mount point of the SPU file
|
||||
system (spufs). When spu_create is successful, a directory gets cre-
|
||||
ated on pathname and it is populated with files.
|
||||
|
||||
The returned file handle can only be passed to spu_run(2) or closed,
|
||||
other operations are not defined on it. When it is closed, all associ-
|
||||
ated directory entries in spufs are removed. When the last file handle
|
||||
pointing either inside of the context directory or to this file
|
||||
descriptor is closed, the logical SPU context is destroyed.
|
||||
|
||||
The parameter flags can be zero or any bitwise or'd combination of the
|
||||
following constants:
|
||||
|
||||
SPU_RAWIO
|
||||
Allow mapping of some of the hardware registers of the SPU into
|
||||
user space. This flag requires the CAP_SYS_RAWIO capability, see
|
||||
capabilities(7).
|
||||
|
||||
The mode parameter specifies the permissions used for creating the new
|
||||
directory in spufs. mode is modified with the user's umask(2) value
|
||||
and then used for both the directory and the files contained in it. The
|
||||
file permissions mask out some more bits of mode because they typically
|
||||
support only read or write access. See stat(2) for a full list of the
|
||||
possible mode values.
|
||||
|
||||
|
||||
RETURN VALUE
|
||||
spu_create returns a new file descriptor. It may return -1 to indicate
|
||||
an error condition and set errno to one of the error codes listed
|
||||
below.
|
||||
|
||||
|
||||
ERRORS
|
||||
EACCESS
|
||||
The current user does not have write access on the spufs mount
|
||||
point.
|
||||
|
||||
EEXIST An SPU context already exists at the given path name.
|
||||
|
||||
EFAULT pathname is not a valid string pointer in the current address
|
||||
space.
|
||||
|
||||
EINVAL pathname is not a directory in the spufs mount point.
|
||||
|
||||
ELOOP Too many symlinks were found while resolving pathname.
|
||||
|
||||
EMFILE The process has reached its maximum open file limit.
|
||||
|
||||
ENAMETOOLONG
|
||||
pathname was too long.
|
||||
|
||||
ENFILE The system has reached the global open file limit.
|
||||
|
||||
ENOENT Part of pathname could not be resolved.
|
||||
|
||||
ENOMEM The kernel could not allocate all resources required.
|
||||
|
||||
ENOSPC There are not enough SPU resources available to create a new
|
||||
context or the user specific limit for the number of SPU con-
|
||||
texts has been reached.
|
||||
|
||||
ENOSYS the functionality is not provided by the current system, because
|
||||
either the hardware does not provide SPUs or the spufs module is
|
||||
not loaded.
|
||||
|
||||
ENOTDIR
|
||||
A part of pathname is not a directory.
|
||||
|
||||
|
||||
|
||||
NOTES
|
||||
spu_create is meant to be used from libraries that implement a more
|
||||
abstract interface to SPUs, not to be used from regular applications.
|
||||
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
|
||||
ommended libraries.
|
||||
|
||||
|
||||
FILES
|
||||
pathname must point to a location beneath the mount point of spufs. By
|
||||
convention, it gets mounted in /spu.
|
||||
|
||||
|
||||
CONFORMING TO
|
||||
This call is Linux specific and only implemented by the ppc64 architec-
|
||||
ture. Programs using this system call are not portable.
|
||||
|
||||
|
||||
BUGS
|
||||
The code does not yet fully implement all features lined out here.
|
||||
|
||||
|
||||
AUTHOR
|
||||
Arnd Bergmann <arndb@de.ibm.com>
|
||||
|
||||
SEE ALSO
|
||||
capabilities(7), close(2), spu_run(2), spufs(7)
|
||||
|
||||
|
||||
|
||||
Linux 2005-09-28 SPU_CREATE(2)
|
||||
95
Documentation/filesystems/sysfs-pci.txt
Normal file
95
Documentation/filesystems/sysfs-pci.txt
Normal file
@@ -0,0 +1,95 @@
|
||||
Accessing PCI device resources through sysfs
|
||||
--------------------------------------------
|
||||
|
||||
sysfs, usually mounted at /sys, provides access to PCI resources on platforms
|
||||
that support it. For example, a given bus might look like this:
|
||||
|
||||
/sys/devices/pci0000:17
|
||||
|-- 0000:17:00.0
|
||||
| |-- class
|
||||
| |-- config
|
||||
| |-- device
|
||||
| |-- irq
|
||||
| |-- local_cpus
|
||||
| |-- resource
|
||||
| |-- resource0
|
||||
| |-- resource1
|
||||
| |-- resource2
|
||||
| |-- rom
|
||||
| |-- subsystem_device
|
||||
| |-- subsystem_vendor
|
||||
| `-- vendor
|
||||
`-- ...
|
||||
|
||||
The topmost element describes the PCI domain and bus number. In this case,
|
||||
the domain number is 0000 and the bus number is 17 (both values are in hex).
|
||||
This bus contains a single function device in slot 0. The domain and bus
|
||||
numbers are reproduced for convenience. Under the device directory are several
|
||||
files, each with their own function.
|
||||
|
||||
file function
|
||||
---- --------
|
||||
class PCI class (ascii, ro)
|
||||
config PCI config space (binary, rw)
|
||||
device PCI device (ascii, ro)
|
||||
irq IRQ number (ascii, ro)
|
||||
local_cpus nearby CPU mask (cpumask, ro)
|
||||
resource PCI resource host addresses (ascii, ro)
|
||||
resource0..N PCI resource N, if present (binary, mmap)
|
||||
rom PCI ROM resource, if present (binary, ro)
|
||||
subsystem_device PCI subsystem device (ascii, ro)
|
||||
subsystem_vendor PCI subsystem vendor (ascii, ro)
|
||||
vendor PCI vendor (ascii, ro)
|
||||
|
||||
ro - read only file
|
||||
rw - file is readable and writable
|
||||
mmap - file is mmapable
|
||||
ascii - file contains ascii text
|
||||
binary - file contains binary data
|
||||
cpumask - file contains a cpumask type
|
||||
|
||||
The read only files are informational, writes to them will be ignored, with
|
||||
the exception of the 'rom' file. Writable files can be used to perform
|
||||
actions on the device (e.g. changing config space, detaching a device).
|
||||
mmapable files are available via an mmap of the file at offset 0 and can be
|
||||
used to do actual device programming from userspace. Note that some platforms
|
||||
don't support mmapping of certain resources, so be sure to check the return
|
||||
value from any attempted mmap.
|
||||
|
||||
The 'rom' file is special in that it provides read-only access to the device's
|
||||
ROM file, if available. It's disabled by default, however, so applications
|
||||
should write the string "1" to the file to enable it before attempting a read
|
||||
call, and disable it following the access by writing "0" to the file.
|
||||
|
||||
Accessing legacy resources through sysfs
|
||||
----------------------------------------
|
||||
|
||||
Legacy I/O port and ISA memory resources are also provided in sysfs if the
|
||||
underlying platform supports them. They're located in the PCI class hierarchy,
|
||||
e.g.
|
||||
|
||||
/sys/class/pci_bus/0000:17/
|
||||
|-- bridge -> ../../../devices/pci0000:17
|
||||
|-- cpuaffinity
|
||||
|-- legacy_io
|
||||
`-- legacy_mem
|
||||
|
||||
The legacy_io file is a read/write file that can be used by applications to
|
||||
do legacy port I/O. The application should open the file, seek to the desired
|
||||
port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem
|
||||
file should be mmapped with an offset corresponding to the memory offset
|
||||
desired, e.g. 0xa0000 for the VGA frame buffer. The application can then
|
||||
simply dereference the returned pointer (after checking for errors of course)
|
||||
to access legacy memory space.
|
||||
|
||||
Supporting PCI access on new platforms
|
||||
--------------------------------------
|
||||
|
||||
In order to support PCI resource mapping as described above, Linux platform
|
||||
code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
|
||||
Platforms are free to only support subsets of the mmap functionality, but
|
||||
useful return codes should be provided.
|
||||
|
||||
Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms
|
||||
wishing to support legacy functionality should define it and provide
|
||||
pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.
|
||||
346
Documentation/filesystems/sysfs.txt
Normal file
346
Documentation/filesystems/sysfs.txt
Normal file
@@ -0,0 +1,346 @@
|
||||
|
||||
sysfs - _The_ filesystem for exporting kernel objects.
|
||||
|
||||
Patrick Mochel <mochel@osdl.org>
|
||||
|
||||
10 January 2003
|
||||
|
||||
|
||||
What it is:
|
||||
~~~~~~~~~~~
|
||||
|
||||
sysfs is a ram-based filesystem initially based on ramfs. It provides
|
||||
a means to export kernel data structures, their attributes, and the
|
||||
linkages between them to userspace.
|
||||
|
||||
sysfs is tied inherently to the kobject infrastructure. Please read
|
||||
Documentation/kobject.txt for more information concerning the kobject
|
||||
interface.
|
||||
|
||||
|
||||
Using sysfs
|
||||
~~~~~~~~~~~
|
||||
|
||||
sysfs is always compiled in. You can access it by doing:
|
||||
|
||||
mount -t sysfs sysfs /sys
|
||||
|
||||
|
||||
Directory Creation
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
For every kobject that is registered with the system, a directory is
|
||||
created for it in sysfs. That directory is created as a subdirectory
|
||||
of the kobject's parent, expressing internal object hierarchies to
|
||||
userspace. Top-level directories in sysfs represent the common
|
||||
ancestors of object hierarchies; i.e. the subsystems the objects
|
||||
belong to.
|
||||
|
||||
Sysfs internally stores the kobject that owns the directory in the
|
||||
->d_fsdata pointer of the directory's dentry. This allows sysfs to do
|
||||
reference counting directly on the kobject when the file is opened and
|
||||
closed.
|
||||
|
||||
|
||||
Attributes
|
||||
~~~~~~~~~~
|
||||
|
||||
Attributes can be exported for kobjects in the form of regular files in
|
||||
the filesystem. Sysfs forwards file I/O operations to methods defined
|
||||
for the attributes, providing a means to read and write kernel
|
||||
attributes.
|
||||
|
||||
Attributes should be ASCII text files, preferably with only one value
|
||||
per file. It is noted that it may not be efficient to contain only
|
||||
value per file, so it is socially acceptable to express an array of
|
||||
values of the same type.
|
||||
|
||||
Mixing types, expressing multiple lines of data, and doing fancy
|
||||
formatting of data is heavily frowned upon. Doing these things may get
|
||||
you publically humiliated and your code rewritten without notice.
|
||||
|
||||
|
||||
An attribute definition is simply:
|
||||
|
||||
struct attribute {
|
||||
char * name;
|
||||
mode_t mode;
|
||||
};
|
||||
|
||||
|
||||
int sysfs_create_file(struct kobject * kobj, struct attribute * attr);
|
||||
void sysfs_remove_file(struct kobject * kobj, struct attribute * attr);
|
||||
|
||||
|
||||
A bare attribute contains no means to read or write the value of the
|
||||
attribute. Subsystems are encouraged to define their own attribute
|
||||
structure and wrapper functions for adding and removing attributes for
|
||||
a specific object type.
|
||||
|
||||
For example, the driver model defines struct device_attribute like:
|
||||
|
||||
struct device_attribute {
|
||||
struct attribute attr;
|
||||
ssize_t (*show)(struct device * dev, char * buf);
|
||||
ssize_t (*store)(struct device * dev, const char * buf);
|
||||
};
|
||||
|
||||
int device_create_file(struct device *, struct device_attribute *);
|
||||
void device_remove_file(struct device *, struct device_attribute *);
|
||||
|
||||
It also defines this helper for defining device attributes:
|
||||
|
||||
#define DEVICE_ATTR(_name, _mode, _show, _store) \
|
||||
struct device_attribute dev_attr_##_name = { \
|
||||
.attr = {.name = __stringify(_name) , .mode = _mode }, \
|
||||
.show = _show, \
|
||||
.store = _store, \
|
||||
};
|
||||
|
||||
For example, declaring
|
||||
|
||||
static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
|
||||
|
||||
is equivalent to doing:
|
||||
|
||||
static struct device_attribute dev_attr_foo = {
|
||||
.attr = {
|
||||
.name = "foo",
|
||||
.mode = S_IWUSR | S_IRUGO,
|
||||
},
|
||||
.show = show_foo,
|
||||
.store = store_foo,
|
||||
};
|
||||
|
||||
|
||||
Subsystem-Specific Callbacks
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When a subsystem defines a new attribute type, it must implement a
|
||||
set of sysfs operations for forwarding read and write calls to the
|
||||
show and store methods of the attribute owners.
|
||||
|
||||
struct sysfs_ops {
|
||||
ssize_t (*show)(struct kobject *, struct attribute *, char *);
|
||||
ssize_t (*store)(struct kobject *, struct attribute *, const char *);
|
||||
};
|
||||
|
||||
[ Subsystems should have already defined a struct kobj_type as a
|
||||
descriptor for this type, which is where the sysfs_ops pointer is
|
||||
stored. See the kobject documentation for more information. ]
|
||||
|
||||
When a file is read or written, sysfs calls the appropriate method
|
||||
for the type. The method then translates the generic struct kobject
|
||||
and struct attribute pointers to the appropriate pointer types, and
|
||||
calls the associated methods.
|
||||
|
||||
|
||||
To illustrate:
|
||||
|
||||
#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
|
||||
#define to_dev(d) container_of(d, struct device, kobj)
|
||||
|
||||
static ssize_t
|
||||
dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf)
|
||||
{
|
||||
struct device_attribute * dev_attr = to_dev_attr(attr);
|
||||
struct device * dev = to_dev(kobj);
|
||||
ssize_t ret = 0;
|
||||
|
||||
if (dev_attr->show)
|
||||
ret = dev_attr->show(dev, buf);
|
||||
return ret;
|
||||
}
|
||||
|
||||
|
||||
|
||||
Reading/Writing Attribute Data
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To read or write attributes, show() or store() methods must be
|
||||
specified when declaring the attribute. The method types should be as
|
||||
simple as those defined for device attributes:
|
||||
|
||||
ssize_t (*show)(struct device * dev, char * buf);
|
||||
ssize_t (*store)(struct device * dev, const char * buf);
|
||||
|
||||
IOW, they should take only an object and a buffer as parameters.
|
||||
|
||||
|
||||
sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
|
||||
method. Sysfs will call the method exactly once for each read or
|
||||
write. This forces the following behavior on the method
|
||||
implementations:
|
||||
|
||||
- On read(2), the show() method should fill the entire buffer.
|
||||
Recall that an attribute should only be exporting one value, or an
|
||||
array of similar values, so this shouldn't be that expensive.
|
||||
|
||||
This allows userspace to do partial reads and seeks arbitrarily over
|
||||
the entire file at will.
|
||||
|
||||
- On write(2), sysfs expects the entire buffer to be passed during the
|
||||
first write. Sysfs then passes the entire buffer to the store()
|
||||
method.
|
||||
|
||||
When writing sysfs files, userspace processes should first read the
|
||||
entire file, modify the values it wishes to change, then write the
|
||||
entire buffer back.
|
||||
|
||||
Attribute method implementations should operate on an identical
|
||||
buffer when reading and writing values.
|
||||
|
||||
Other notes:
|
||||
|
||||
- The buffer will always be PAGE_SIZE bytes in length. On i386, this
|
||||
is 4096.
|
||||
|
||||
- show() methods should return the number of bytes printed into the
|
||||
buffer. This is the return value of snprintf().
|
||||
|
||||
- show() should always use snprintf().
|
||||
|
||||
- store() should return the number of bytes used from the buffer. This
|
||||
can be done using strlen().
|
||||
|
||||
- show() or store() can always return errors. If a bad value comes
|
||||
through, be sure to return an error.
|
||||
|
||||
- The object passed to the methods will be pinned in memory via sysfs
|
||||
referencing counting its embedded object. However, the physical
|
||||
entity (e.g. device) the object represents may not be present. Be
|
||||
sure to have a way to check this, if necessary.
|
||||
|
||||
|
||||
A very simple (and naive) implementation of a device attribute is:
|
||||
|
||||
static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf)
|
||||
{
|
||||
return snprintf(buf, PAGE_SIZE, "%s\n", dev->name);
|
||||
}
|
||||
|
||||
static ssize_t store_name(struct device * dev, const char * buf)
|
||||
{
|
||||
sscanf(buf, "%20s", dev->name);
|
||||
return strnlen(buf, PAGE_SIZE);
|
||||
}
|
||||
|
||||
static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
|
||||
|
||||
|
||||
(Note that the real implementation doesn't allow userspace to set the
|
||||
name for a device.)
|
||||
|
||||
|
||||
Top Level Directory Layout
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The sysfs directory arrangement exposes the relationship of kernel
|
||||
data structures.
|
||||
|
||||
The top level sysfs directory looks like:
|
||||
|
||||
block/
|
||||
bus/
|
||||
class/
|
||||
devices/
|
||||
firmware/
|
||||
net/
|
||||
fs/
|
||||
|
||||
devices/ contains a filesystem representation of the device tree. It maps
|
||||
directly to the internal kernel device tree, which is a hierarchy of
|
||||
struct device.
|
||||
|
||||
bus/ contains flat directory layout of the various bus types in the
|
||||
kernel. Each bus's directory contains two subdirectories:
|
||||
|
||||
devices/
|
||||
drivers/
|
||||
|
||||
devices/ contains symlinks for each device discovered in the system
|
||||
that point to the device's directory under root/.
|
||||
|
||||
drivers/ contains a directory for each device driver that is loaded
|
||||
for devices on that particular bus (this assumes that drivers do not
|
||||
span multiple bus types).
|
||||
|
||||
fs/ contains a directory for some filesystems. Currently each
|
||||
filesystem wanting to export attributes must create its own hierarchy
|
||||
below fs/ (see ./fuse.txt for an example).
|
||||
|
||||
|
||||
More information can driver-model specific features can be found in
|
||||
Documentation/driver-model/.
|
||||
|
||||
|
||||
TODO: Finish this section.
|
||||
|
||||
|
||||
Current Interfaces
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The following interface layers currently exist in sysfs:
|
||||
|
||||
|
||||
- devices (include/linux/device.h)
|
||||
----------------------------------
|
||||
Structure:
|
||||
|
||||
struct device_attribute {
|
||||
struct attribute attr;
|
||||
ssize_t (*show)(struct device * dev, char * buf);
|
||||
ssize_t (*store)(struct device * dev, const char * buf);
|
||||
};
|
||||
|
||||
Declaring:
|
||||
|
||||
DEVICE_ATTR(_name, _str, _mode, _show, _store);
|
||||
|
||||
Creation/Removal:
|
||||
|
||||
int device_create_file(struct device *device, struct device_attribute * attr);
|
||||
void device_remove_file(struct device * dev, struct device_attribute * attr);
|
||||
|
||||
|
||||
- bus drivers (include/linux/device.h)
|
||||
--------------------------------------
|
||||
Structure:
|
||||
|
||||
struct bus_attribute {
|
||||
struct attribute attr;
|
||||
ssize_t (*show)(struct bus_type *, char * buf);
|
||||
ssize_t (*store)(struct bus_type *, const char * buf);
|
||||
};
|
||||
|
||||
Declaring:
|
||||
|
||||
BUS_ATTR(_name, _mode, _show, _store)
|
||||
|
||||
Creation/Removal:
|
||||
|
||||
int bus_create_file(struct bus_type *, struct bus_attribute *);
|
||||
void bus_remove_file(struct bus_type *, struct bus_attribute *);
|
||||
|
||||
|
||||
- device drivers (include/linux/device.h)
|
||||
-----------------------------------------
|
||||
|
||||
Structure:
|
||||
|
||||
struct driver_attribute {
|
||||
struct attribute attr;
|
||||
ssize_t (*show)(struct device_driver *, char * buf);
|
||||
ssize_t (*store)(struct device_driver *, const char * buf);
|
||||
};
|
||||
|
||||
Declaring:
|
||||
|
||||
DRIVER_ATTR(_name, _mode, _show, _store)
|
||||
|
||||
Creation/Removal:
|
||||
|
||||
int driver_create_file(struct device_driver *, struct driver_attribute *);
|
||||
void driver_remove_file(struct device_driver *, struct driver_attribute *);
|
||||
|
||||
|
||||
197
Documentation/filesystems/sysv-fs.txt
Normal file
197
Documentation/filesystems/sysv-fs.txt
Normal file
@@ -0,0 +1,197 @@
|
||||
It implements all of
|
||||
- Xenix FS,
|
||||
- SystemV/386 FS,
|
||||
- Coherent FS.
|
||||
|
||||
To install:
|
||||
* Answer the 'System V and Coherent filesystem support' question with 'y'
|
||||
when configuring the kernel.
|
||||
* To mount a disk or a partition, use
|
||||
mount [-r] -t sysv device mountpoint
|
||||
The file system type names
|
||||
-t sysv
|
||||
-t xenix
|
||||
-t coherent
|
||||
may be used interchangeably, but the last two will eventually disappear.
|
||||
|
||||
Bugs in the present implementation:
|
||||
- Coherent FS:
|
||||
- The "free list interleave" n:m is currently ignored.
|
||||
- Only file systems with no filesystem name and no pack name are recognized.
|
||||
(See Coherent "man mkfs" for a description of these features.)
|
||||
- SystemV Release 2 FS:
|
||||
The superblock is only searched in the blocks 9, 15, 18, which
|
||||
corresponds to the beginning of track 1 on floppy disks. No support
|
||||
for this FS on hard disk yet.
|
||||
|
||||
|
||||
These filesystems are rather similar. Here is a comparison with Minix FS:
|
||||
|
||||
* Linux fdisk reports on partitions
|
||||
- Minix FS 0x81 Linux/Minix
|
||||
- Xenix FS ??
|
||||
- SystemV FS ??
|
||||
- Coherent FS 0x08 AIX bootable
|
||||
|
||||
* Size of a block or zone (data allocation unit on disk)
|
||||
- Minix FS 1024
|
||||
- Xenix FS 1024 (also 512 ??)
|
||||
- SystemV FS 1024 (also 512 and 2048)
|
||||
- Coherent FS 512
|
||||
|
||||
* General layout: all have one boot block, one super block and
|
||||
separate areas for inodes and for directories/data.
|
||||
On SystemV Release 2 FS (e.g. Microport) the first track is reserved and
|
||||
all the block numbers (including the super block) are offset by one track.
|
||||
|
||||
* Byte ordering of "short" (16 bit entities) on disk:
|
||||
- Minix FS little endian 0 1
|
||||
- Xenix FS little endian 0 1
|
||||
- SystemV FS little endian 0 1
|
||||
- Coherent FS little endian 0 1
|
||||
Of course, this affects only the file system, not the data of files on it!
|
||||
|
||||
* Byte ordering of "long" (32 bit entities) on disk:
|
||||
- Minix FS little endian 0 1 2 3
|
||||
- Xenix FS little endian 0 1 2 3
|
||||
- SystemV FS little endian 0 1 2 3
|
||||
- Coherent FS PDP-11 2 3 0 1
|
||||
Of course, this affects only the file system, not the data of files on it!
|
||||
|
||||
* Inode on disk: "short", 0 means non-existent, the root dir ino is:
|
||||
- Minix FS 1
|
||||
- Xenix FS, SystemV FS, Coherent FS 2
|
||||
|
||||
* Maximum number of hard links to a file:
|
||||
- Minix FS 250
|
||||
- Xenix FS ??
|
||||
- SystemV FS ??
|
||||
- Coherent FS >=10000
|
||||
|
||||
* Free inode management:
|
||||
- Minix FS a bitmap
|
||||
- Xenix FS, SystemV FS, Coherent FS
|
||||
There is a cache of a certain number of free inodes in the super-block.
|
||||
When it is exhausted, new free inodes are found using a linear search.
|
||||
|
||||
* Free block management:
|
||||
- Minix FS a bitmap
|
||||
- Xenix FS, SystemV FS, Coherent FS
|
||||
Free blocks are organized in a "free list". Maybe a misleading term,
|
||||
since it is not true that every free block contains a pointer to
|
||||
the next free block. Rather, the free blocks are organized in chunks
|
||||
of limited size, and every now and then a free block contains pointers
|
||||
to the free blocks pertaining to the next chunk; the first of these
|
||||
contains pointers and so on. The list terminates with a "block number"
|
||||
0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS.
|
||||
|
||||
* Super-block location:
|
||||
- Minix FS block 1 = bytes 1024..2047
|
||||
- Xenix FS block 1 = bytes 1024..2047
|
||||
- SystemV FS bytes 512..1023
|
||||
- Coherent FS block 1 = bytes 512..1023
|
||||
|
||||
* Super-block layout:
|
||||
- Minix FS
|
||||
unsigned short s_ninodes;
|
||||
unsigned short s_nzones;
|
||||
unsigned short s_imap_blocks;
|
||||
unsigned short s_zmap_blocks;
|
||||
unsigned short s_firstdatazone;
|
||||
unsigned short s_log_zone_size;
|
||||
unsigned long s_max_size;
|
||||
unsigned short s_magic;
|
||||
- Xenix FS, SystemV FS, Coherent FS
|
||||
unsigned short s_firstdatazone;
|
||||
unsigned long s_nzones;
|
||||
unsigned short s_fzone_count;
|
||||
unsigned long s_fzones[NICFREE];
|
||||
unsigned short s_finode_count;
|
||||
unsigned short s_finodes[NICINOD];
|
||||
char s_flock;
|
||||
char s_ilock;
|
||||
char s_modified;
|
||||
char s_rdonly;
|
||||
unsigned long s_time;
|
||||
short s_dinfo[4]; -- SystemV FS only
|
||||
unsigned long s_free_zones;
|
||||
unsigned short s_free_inodes;
|
||||
short s_dinfo[4]; -- Xenix FS only
|
||||
unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only
|
||||
char s_fname[6];
|
||||
char s_fpack[6];
|
||||
then they differ considerably:
|
||||
Xenix FS
|
||||
char s_clean;
|
||||
char s_fill[371];
|
||||
long s_magic;
|
||||
long s_type;
|
||||
SystemV FS
|
||||
long s_fill[12 or 14];
|
||||
long s_state;
|
||||
long s_magic;
|
||||
long s_type;
|
||||
Coherent FS
|
||||
unsigned long s_unique;
|
||||
Note that Coherent FS has no magic.
|
||||
|
||||
* Inode layout:
|
||||
- Minix FS
|
||||
unsigned short i_mode;
|
||||
unsigned short i_uid;
|
||||
unsigned long i_size;
|
||||
unsigned long i_time;
|
||||
unsigned char i_gid;
|
||||
unsigned char i_nlinks;
|
||||
unsigned short i_zone[7+1+1];
|
||||
- Xenix FS, SystemV FS, Coherent FS
|
||||
unsigned short i_mode;
|
||||
unsigned short i_nlink;
|
||||
unsigned short i_uid;
|
||||
unsigned short i_gid;
|
||||
unsigned long i_size;
|
||||
unsigned char i_zone[3*(10+1+1+1)];
|
||||
unsigned long i_atime;
|
||||
unsigned long i_mtime;
|
||||
unsigned long i_ctime;
|
||||
|
||||
* Regular file data blocks are organized as
|
||||
- Minix FS
|
||||
7 direct blocks
|
||||
1 indirect block (pointers to blocks)
|
||||
1 double-indirect block (pointer to pointers to blocks)
|
||||
- Xenix FS, SystemV FS, Coherent FS
|
||||
10 direct blocks
|
||||
1 indirect block (pointers to blocks)
|
||||
1 double-indirect block (pointer to pointers to blocks)
|
||||
1 triple-indirect block (pointer to pointers to pointers to blocks)
|
||||
|
||||
* Inode size, inodes per block
|
||||
- Minix FS 32 32
|
||||
- Xenix FS 64 16
|
||||
- SystemV FS 64 16
|
||||
- Coherent FS 64 8
|
||||
|
||||
* Directory entry on disk
|
||||
- Minix FS
|
||||
unsigned short inode;
|
||||
char name[14/30];
|
||||
- Xenix FS, SystemV FS, Coherent FS
|
||||
unsigned short inode;
|
||||
char name[14];
|
||||
|
||||
* Dir entry size, dir entries per block
|
||||
- Minix FS 16/32 64/32
|
||||
- Xenix FS 16 64
|
||||
- SystemV FS 16 64
|
||||
- Coherent FS 16 32
|
||||
|
||||
* How to implement symbolic links such that the host fsck doesn't scream:
|
||||
- Minix FS normal
|
||||
- Xenix FS kludge: as regular files with chmod 1000
|
||||
- SystemV FS ??
|
||||
- Coherent FS kludge: as regular files with chmod 1000
|
||||
|
||||
|
||||
Notation: We often speak of a "block" but mean a zone (the allocation unit)
|
||||
and not the disk driver's notion of "block".
|
||||
124
Documentation/filesystems/tmpfs.txt
Normal file
124
Documentation/filesystems/tmpfs.txt
Normal file
@@ -0,0 +1,124 @@
|
||||
Tmpfs is a file system which keeps all files in virtual memory.
|
||||
|
||||
|
||||
Everything in tmpfs is temporary in the sense that no files will be
|
||||
created on your hard drive. If you unmount a tmpfs instance,
|
||||
everything stored therein is lost.
|
||||
|
||||
tmpfs puts everything into the kernel internal caches and grows and
|
||||
shrinks to accommodate the files it contains and is able to swap
|
||||
unneeded pages out to swap space. It has maximum size limits which can
|
||||
be adjusted on the fly via 'mount -o remount ...'
|
||||
|
||||
If you compare it to ramfs (which was the template to create tmpfs)
|
||||
you gain swapping and limit checking. Another similar thing is the RAM
|
||||
disk (/dev/ram*), which simulates a fixed size hard disk in physical
|
||||
RAM, where you have to create an ordinary filesystem on top. Ramdisks
|
||||
cannot swap and you do not have the possibility to resize them.
|
||||
|
||||
Since tmpfs lives completely in the page cache and on swap, all tmpfs
|
||||
pages currently in memory will show up as cached. It will not show up
|
||||
as shared or something like that. Further on you can check the actual
|
||||
RAM+swap use of a tmpfs instance with df(1) and du(1).
|
||||
|
||||
|
||||
tmpfs has the following uses:
|
||||
|
||||
1) There is always a kernel internal mount which you will not see at
|
||||
all. This is used for shared anonymous mappings and SYSV shared
|
||||
memory.
|
||||
|
||||
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
|
||||
set, the user visible part of tmpfs is not build. But the internal
|
||||
mechanisms are always present.
|
||||
|
||||
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
|
||||
POSIX shared memory (shm_open, shm_unlink). Adding the following
|
||||
line to /etc/fstab should take care of this:
|
||||
|
||||
tmpfs /dev/shm tmpfs defaults 0 0
|
||||
|
||||
Remember to create the directory that you intend to mount tmpfs on
|
||||
if necessary.
|
||||
|
||||
This mount is _not_ needed for SYSV shared memory. The internal
|
||||
mount is used for that. (In the 2.3 kernel versions it was
|
||||
necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
|
||||
shared memory)
|
||||
|
||||
3) Some people (including me) find it very convenient to mount it
|
||||
e.g. on /tmp and /var/tmp and have a big swap partition. And now
|
||||
loop mounts of tmpfs files do work, so mkinitrd shipped by most
|
||||
distributions should succeed with a tmpfs /tmp.
|
||||
|
||||
4) And probably a lot more I do not know about :-)
|
||||
|
||||
|
||||
tmpfs has three mount options for sizing:
|
||||
|
||||
size: The limit of allocated bytes for this tmpfs instance. The
|
||||
default is half of your physical RAM without swap. If you
|
||||
oversize your tmpfs instances the machine will deadlock
|
||||
since the OOM handler will not be able to free that memory.
|
||||
nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE.
|
||||
nr_inodes: The maximum number of inodes for this instance. The default
|
||||
is half of the number of your physical RAM pages, or (on a
|
||||
machine with highmem) the number of lowmem RAM pages,
|
||||
whichever is the lower.
|
||||
|
||||
These parameters accept a suffix k, m or g for kilo, mega and giga and
|
||||
can be changed on remount. The size parameter also accepts a suffix %
|
||||
to limit this tmpfs instance to that percentage of your physical RAM:
|
||||
the default, when neither size nor nr_blocks is specified, is size=50%
|
||||
|
||||
If nr_blocks=0 (or size=0), blocks will not be limited in that instance;
|
||||
if nr_inodes=0, inodes will not be limited. It is generally unwise to
|
||||
mount with such options, since it allows any user with write access to
|
||||
use up all the memory on the machine; but enhances the scalability of
|
||||
that instance in a system with many cpus making intensive use of it.
|
||||
|
||||
|
||||
tmpfs has a mount option to set the NUMA memory allocation policy for
|
||||
all files in that instance (if CONFIG_NUMA is enabled) - which can be
|
||||
adjusted on the fly via 'mount -o remount ...'
|
||||
|
||||
mpol=default prefers to allocate memory from the local node
|
||||
mpol=prefer:Node prefers to allocate memory from the given Node
|
||||
mpol=bind:NodeList allocates memory only from nodes in NodeList
|
||||
mpol=interleave prefers to allocate from each node in turn
|
||||
mpol=interleave:NodeList allocates from each node of NodeList in turn
|
||||
|
||||
NodeList format is a comma-separated list of decimal numbers and ranges,
|
||||
a range being two hyphen-separated decimal numbers, the smallest and
|
||||
largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15
|
||||
|
||||
Note that trying to mount a tmpfs with an mpol option will fail if the
|
||||
running kernel does not support NUMA; and will fail if its nodelist
|
||||
specifies a node >= MAX_NUMNODES. If your system relies on that tmpfs
|
||||
being mounted, but from time to time runs a kernel built without NUMA
|
||||
capability (perhaps a safe recovery kernel), or configured to support
|
||||
fewer nodes, then it is advisable to omit the mpol option from automatic
|
||||
mount options. It can be added later, when the tmpfs is already mounted
|
||||
on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'.
|
||||
|
||||
|
||||
To specify the initial root directory you can use the following mount
|
||||
options:
|
||||
|
||||
mode: The permissions as an octal number
|
||||
uid: The user id
|
||||
gid: The group id
|
||||
|
||||
These options do not have any effect on remount. You can change these
|
||||
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
|
||||
|
||||
|
||||
So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
|
||||
will give you tmpfs instance on /mytmpfs which can allocate 10GB
|
||||
RAM/SWAP in 10240 inodes and it is only accessible by root.
|
||||
|
||||
|
||||
Author:
|
||||
Christoph Rohland <cr@sap.com>, 1.12.01
|
||||
Updated:
|
||||
Hugh Dickins <hugh@veritas.com>, 19 February 2006
|
||||
80
Documentation/filesystems/udf.txt
Normal file
80
Documentation/filesystems/udf.txt
Normal file
@@ -0,0 +1,80 @@
|
||||
*
|
||||
* Documentation/filesystems/udf.txt
|
||||
*
|
||||
UDF Filesystem version 0.9.8.1
|
||||
|
||||
If you encounter problems with reading UDF discs using this driver,
|
||||
please report them to linux_udf@hpesjro.fc.hp.com, which is the
|
||||
developer's list.
|
||||
|
||||
Write support requires a block driver which supports writing. Currently
|
||||
dvd+rw drives and media support true random sector writes, and so a udf
|
||||
filesystem on such devices can be directly mounted read/write. CD-RW
|
||||
media however, does not support this. Instead the media can be formatted
|
||||
for packet mode using the utility cdrwtool, then the pktcdvd driver can
|
||||
be bound to the underlying cd device to provide the required buffering
|
||||
and read-modify-write cycles to allow the filesystem random sector writes
|
||||
while providing the hardware with only full packet writes. While not
|
||||
required for dvd+rw media, use of the pktcdvd driver often enhances
|
||||
performance due to very poor read-modify-write support supplied internally
|
||||
by drive firmware.
|
||||
|
||||
-------------------------------------------------------------------------------
|
||||
The following mount options are supported:
|
||||
|
||||
gid= Set the default group.
|
||||
umask= Set the default umask.
|
||||
uid= Set the default user.
|
||||
bs= Set the block size.
|
||||
unhide Show otherwise hidden files.
|
||||
undelete Show deleted files in lists.
|
||||
adinicb Embed data in the inode (default)
|
||||
noadinicb Don't embed data in the inode
|
||||
shortad Use short ad's
|
||||
longad Use long ad's (default)
|
||||
nostrict Unset strict conformance
|
||||
iocharset= Set the NLS character set
|
||||
|
||||
The uid= and gid= options need a bit more explaining. They will accept a
|
||||
decimal numeric value which will be used as the default ID for that mount.
|
||||
They will also accept the string "ignore" and "forget". For files on the disk
|
||||
that are owned by nobody ( -1 ), they will instead look as if they are owned
|
||||
by the default ID. The ignore option causes the default ID to override all
|
||||
IDs on the disk, not just -1. The forget option causes all IDs to be written
|
||||
to disk as -1, so when the media is later remounted, they will appear to be
|
||||
owned by whatever default ID it is mounted with at that time.
|
||||
|
||||
For typical desktop use of removable media, you should set the ID to that
|
||||
of the interactively logged on user, and also specify both the forget and
|
||||
ignore options. This way the interactive user will always see the files
|
||||
on the disk as belonging to him.
|
||||
|
||||
The remaining are for debugging and disaster recovery:
|
||||
|
||||
novrs Skip volume sequence recognition
|
||||
|
||||
The following expect a offset from 0.
|
||||
|
||||
session= Set the CDROM session (default= last session)
|
||||
anchor= Override standard anchor location. (default= 256)
|
||||
volume= Override the VolumeDesc location. (unused)
|
||||
partition= Override the PartitionDesc location. (unused)
|
||||
lastblock= Set the last block of the filesystem/
|
||||
|
||||
The following expect a offset from the partition root.
|
||||
|
||||
fileset= Override the fileset block location. (unused)
|
||||
rootdir= Override the root directory location. (unused)
|
||||
WARNING: overriding the rootdir to a non-directory may
|
||||
yield highly unpredictable results.
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
|
||||
For the latest version and toolset see:
|
||||
http://linux-udf.sourceforge.net/
|
||||
|
||||
Documentation on UDF and ECMA 167 is available FREE from:
|
||||
http://www.osta.org/
|
||||
http://www.ecma-international.org/
|
||||
|
||||
Ben Fennema <bfennema@falcon.csc.calpoly.edu>
|
||||
60
Documentation/filesystems/ufs.txt
Normal file
60
Documentation/filesystems/ufs.txt
Normal file
@@ -0,0 +1,60 @@
|
||||
USING UFS
|
||||
=========
|
||||
|
||||
mount -t ufs -o ufstype=type_of_ufs device dir
|
||||
|
||||
|
||||
UFS OPTIONS
|
||||
===========
|
||||
|
||||
ufstype=type_of_ufs
|
||||
UFS is a file system widely used in different operating systems.
|
||||
The problem are differences among implementations. Features of
|
||||
some implementations are undocumented, so its hard to recognize
|
||||
type of ufs automatically. That's why user must specify type of
|
||||
ufs manually by mount option ufstype. Possible values are:
|
||||
|
||||
old old format of ufs
|
||||
default value, supported as read-only
|
||||
|
||||
44bsd used in FreeBSD, NetBSD, OpenBSD
|
||||
supported as read-write
|
||||
|
||||
ufs2 used in FreeBSD 5.x
|
||||
supported as read-write
|
||||
|
||||
5xbsd synonym for ufs2
|
||||
|
||||
sun used in SunOS (Solaris)
|
||||
supported as read-write
|
||||
|
||||
sunx86 used in SunOS for Intel (Solarisx86)
|
||||
supported as read-write
|
||||
|
||||
hp used in HP-UX
|
||||
supported as read-only
|
||||
|
||||
nextstep
|
||||
used in NextStep
|
||||
supported as read-only
|
||||
|
||||
nextstep-cd
|
||||
used for NextStep CDROMs (block_size == 2048)
|
||||
supported as read-only
|
||||
|
||||
openstep
|
||||
used in OpenStep
|
||||
supported as read-only
|
||||
|
||||
|
||||
POSSIBLE PROBLEMS
|
||||
=================
|
||||
|
||||
See next section, if you have any.
|
||||
|
||||
|
||||
BUG REPORTS
|
||||
===========
|
||||
|
||||
Any ufs bug report you can send to daniel.pirkl@email.cz or
|
||||
to dushistov@mail.ru (do not send partition tables bug reports).
|
||||
231
Documentation/filesystems/vfat.txt
Normal file
231
Documentation/filesystems/vfat.txt
Normal file
@@ -0,0 +1,231 @@
|
||||
USING VFAT
|
||||
----------------------------------------------------------------------
|
||||
To use the vfat filesystem, use the filesystem type 'vfat'. i.e.
|
||||
mount -t vfat /dev/fd0 /mnt
|
||||
|
||||
No special partition formatter is required. mkdosfs will work fine
|
||||
if you want to format from within Linux.
|
||||
|
||||
VFAT MOUNT OPTIONS
|
||||
----------------------------------------------------------------------
|
||||
umask=### -- The permission mask (for files and directories, see umask(1)).
|
||||
The default is the umask of current process.
|
||||
|
||||
dmask=### -- The permission mask for the directory.
|
||||
The default is the umask of current process.
|
||||
|
||||
fmask=### -- The permission mask for files.
|
||||
The default is the umask of current process.
|
||||
|
||||
codepage=### -- Sets the codepage number for converting to shortname
|
||||
characters on FAT filesystem.
|
||||
By default, FAT_DEFAULT_CODEPAGE setting is used.
|
||||
|
||||
iocharset=name -- Character set to use for converting between the
|
||||
encoding is used for user visible filename and 16 bit
|
||||
Unicode characters. Long filenames are stored on disk
|
||||
in Unicode format, but Unix for the most part doesn't
|
||||
know how to deal with Unicode.
|
||||
By default, FAT_DEFAULT_IOCHARSET setting is used.
|
||||
|
||||
There is also an option of doing UTF-8 translations
|
||||
with the utf8 option.
|
||||
|
||||
NOTE: "iocharset=utf8" is not recommended. If unsure,
|
||||
you should consider the following option instead.
|
||||
|
||||
utf8=<bool> -- UTF-8 is the filesystem safe version of Unicode that
|
||||
is used by the console. It can be enabled for the
|
||||
filesystem with this option. If 'uni_xlate' gets set,
|
||||
UTF-8 gets disabled.
|
||||
|
||||
uni_xlate=<bool> -- Translate unhandled Unicode characters to special
|
||||
escaped sequences. This would let you backup and
|
||||
restore filenames that are created with any Unicode
|
||||
characters. Until Linux supports Unicode for real,
|
||||
this gives you an alternative. Without this option,
|
||||
a '?' is used when no translation is possible. The
|
||||
escape character is ':' because it is otherwise
|
||||
illegal on the vfat filesystem. The escape sequence
|
||||
that gets used is ':' and the four digits of hexadecimal
|
||||
unicode.
|
||||
|
||||
nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will
|
||||
end in '~1' or tilde followed by some number. If this
|
||||
option is set, then if the filename is
|
||||
"longfilename.txt" and "longfile.txt" does not
|
||||
currently exist in the directory, 'longfile.txt' will
|
||||
be the short alias instead of 'longfi~1.txt'.
|
||||
|
||||
quiet -- Stops printing certain warning messages.
|
||||
|
||||
check=s|r|n -- Case sensitivity checking setting.
|
||||
s: strict, case sensitive
|
||||
r: relaxed, case insensitive
|
||||
n: normal, default setting, currently case insensitive
|
||||
|
||||
shortname=lower|win95|winnt|mixed
|
||||
-- Shortname display/create setting.
|
||||
lower: convert to lowercase for display,
|
||||
emulate the Windows 95 rule for create.
|
||||
win95: emulate the Windows 95 rule for display/create.
|
||||
winnt: emulate the Windows NT rule for display/create.
|
||||
mixed: emulate the Windows NT rule for display,
|
||||
emulate the Windows 95 rule for create.
|
||||
Default setting is `lower'.
|
||||
|
||||
<bool>: 0,1,yes,no,true,false
|
||||
|
||||
TODO
|
||||
----------------------------------------------------------------------
|
||||
* Need to get rid of the raw scanning stuff. Instead, always use
|
||||
a get next directory entry approach. The only thing left that uses
|
||||
raw scanning is the directory renaming code.
|
||||
|
||||
|
||||
POSSIBLE PROBLEMS
|
||||
----------------------------------------------------------------------
|
||||
* vfat_valid_longname does not properly checked reserved names.
|
||||
* When a volume name is the same as a directory name in the root
|
||||
directory of the filesystem, the directory name sometimes shows
|
||||
up as an empty file.
|
||||
* autoconv option does not work correctly.
|
||||
|
||||
BUG REPORTS
|
||||
----------------------------------------------------------------------
|
||||
If you have trouble with the VFAT filesystem, mail bug reports to
|
||||
chaffee@bmrc.cs.berkeley.edu. Please specify the filename
|
||||
and the operation that gave you trouble.
|
||||
|
||||
TEST SUITE
|
||||
----------------------------------------------------------------------
|
||||
If you plan to make any modifications to the vfat filesystem, please
|
||||
get the test suite that comes with the vfat distribution at
|
||||
|
||||
http://bmrc.berkeley.edu/people/chaffee/vfat.html
|
||||
|
||||
This tests quite a few parts of the vfat filesystem and additional
|
||||
tests for new features or untested features would be appreciated.
|
||||
|
||||
NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
|
||||
----------------------------------------------------------------------
|
||||
(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
|
||||
and lightly annotated by Gordon Chaffee).
|
||||
|
||||
This document presents a very rough, technical overview of my
|
||||
knowledge of the extended FAT file system used in Windows NT 3.5 and
|
||||
Windows 95. I don't guarantee that any of the following is correct,
|
||||
but it appears to be so.
|
||||
|
||||
The extended FAT file system is almost identical to the FAT
|
||||
file system used in DOS versions up to and including 6.223410239847
|
||||
:-). The significant change has been the addition of long file names.
|
||||
These names support up to 255 characters including spaces and lower
|
||||
case characters as opposed to the traditional 8.3 short names.
|
||||
|
||||
Here is the description of the traditional FAT entry in the current
|
||||
Windows 95 filesystem:
|
||||
|
||||
struct directory { // Short 8.3 names
|
||||
unsigned char name[8]; // file name
|
||||
unsigned char ext[3]; // file extension
|
||||
unsigned char attr; // attribute byte
|
||||
unsigned char lcase; // Case for base and extension
|
||||
unsigned char ctime_ms; // Creation time, milliseconds
|
||||
unsigned char ctime[2]; // Creation time
|
||||
unsigned char cdate[2]; // Creation date
|
||||
unsigned char adate[2]; // Last access date
|
||||
unsigned char reserved[2]; // reserved values (ignored)
|
||||
unsigned char time[2]; // time stamp
|
||||
unsigned char date[2]; // date stamp
|
||||
unsigned char start[2]; // starting cluster number
|
||||
unsigned char size[4]; // size of the file
|
||||
};
|
||||
|
||||
The lcase field specifies if the base and/or the extension of an 8.3
|
||||
name should be capitalized. This field does not seem to be used by
|
||||
Windows 95 but it is used by Windows NT. The case of filenames is not
|
||||
completely compatible from Windows NT to Windows 95. It is not completely
|
||||
compatible in the reverse direction, however. Filenames that fit in
|
||||
the 8.3 namespace and are written on Windows NT to be lowercase will
|
||||
show up as uppercase on Windows 95.
|
||||
|
||||
Note that the "start" and "size" values are actually little
|
||||
endian integer values. The descriptions of the fields in this
|
||||
structure are public knowledge and can be found elsewhere.
|
||||
|
||||
With the extended FAT system, Microsoft has inserted extra
|
||||
directory entries for any files with extended names. (Any name which
|
||||
legally fits within the old 8.3 encoding scheme does not have extra
|
||||
entries.) I call these extra entries slots. Basically, a slot is a
|
||||
specially formatted directory entry which holds up to 13 characters of
|
||||
a file's extended name. Think of slots as additional labeling for the
|
||||
directory entry of the file to which they correspond. Microsoft
|
||||
prefers to refer to the 8.3 entry for a file as its alias and the
|
||||
extended slot directory entries as the file name.
|
||||
|
||||
The C structure for a slot directory entry follows:
|
||||
|
||||
struct slot { // Up to 13 characters of a long name
|
||||
unsigned char id; // sequence number for slot
|
||||
unsigned char name0_4[10]; // first 5 characters in name
|
||||
unsigned char attr; // attribute byte
|
||||
unsigned char reserved; // always 0
|
||||
unsigned char alias_checksum; // checksum for 8.3 alias
|
||||
unsigned char name5_10[12]; // 6 more characters in name
|
||||
unsigned char start[2]; // starting cluster number
|
||||
unsigned char name11_12[4]; // last 2 characters in name
|
||||
};
|
||||
|
||||
If the layout of the slots looks a little odd, it's only
|
||||
because of Microsoft's efforts to maintain compatibility with old
|
||||
software. The slots must be disguised to prevent old software from
|
||||
panicking. To this end, a number of measures are taken:
|
||||
|
||||
1) The attribute byte for a slot directory entry is always set
|
||||
to 0x0f. This corresponds to an old directory entry with
|
||||
attributes of "hidden", "system", "read-only", and "volume
|
||||
label". Most old software will ignore any directory
|
||||
entries with the "volume label" bit set. Real volume label
|
||||
entries don't have the other three bits set.
|
||||
|
||||
2) The starting cluster is always set to 0, an impossible
|
||||
value for a DOS file.
|
||||
|
||||
Because the extended FAT system is backward compatible, it is
|
||||
possible for old software to modify directory entries. Measures must
|
||||
be taken to ensure the validity of slots. An extended FAT system can
|
||||
verify that a slot does in fact belong to an 8.3 directory entry by
|
||||
the following:
|
||||
|
||||
1) Positioning. Slots for a file always immediately proceed
|
||||
their corresponding 8.3 directory entry. In addition, each
|
||||
slot has an id which marks its order in the extended file
|
||||
name. Here is a very abbreviated view of an 8.3 directory
|
||||
entry and its corresponding long name slots for the file
|
||||
"My Big File.Extension which is long":
|
||||
|
||||
<proceeding files...>
|
||||
<slot #3, id = 0x43, characters = "h is long">
|
||||
<slot #2, id = 0x02, characters = "xtension whic">
|
||||
<slot #1, id = 0x01, characters = "My Big File.E">
|
||||
<directory entry, name = "MYBIGFIL.EXT">
|
||||
|
||||
Note that the slots are stored from last to first. Slots
|
||||
are numbered from 1 to N. The Nth slot is or'ed with 0x40
|
||||
to mark it as the last one.
|
||||
|
||||
2) Checksum. Each slot has an "alias_checksum" value. The
|
||||
checksum is calculated from the 8.3 name using the
|
||||
following algorithm:
|
||||
|
||||
for (sum = i = 0; i < 11; i++) {
|
||||
sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
|
||||
}
|
||||
|
||||
3) If there is free space in the final slot, a Unicode NULL (0x0000)
|
||||
is stored after the final character. After that, all unused
|
||||
characters in the final slot are set to Unicode 0xFFFF.
|
||||
|
||||
Finally, note that the extended name is stored in Unicode. Each Unicode
|
||||
character takes two bytes.
|
||||
931
Documentation/filesystems/vfs.txt
Normal file
931
Documentation/filesystems/vfs.txt
Normal file
@@ -0,0 +1,931 @@
|
||||
|
||||
Overview of the Linux Virtual File System
|
||||
|
||||
Original author: Richard Gooch <rgooch@atnf.csiro.au>
|
||||
|
||||
Last updated on October 28, 2005
|
||||
|
||||
Copyright (C) 1999 Richard Gooch
|
||||
Copyright (C) 2005 Pekka Enberg
|
||||
|
||||
This file is released under the GPLv2.
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
The Virtual File System (also known as the Virtual Filesystem Switch)
|
||||
is the software layer in the kernel that provides the filesystem
|
||||
interface to userspace programs. It also provides an abstraction
|
||||
within the kernel which allows different filesystem implementations to
|
||||
coexist.
|
||||
|
||||
VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
|
||||
on are called from a process context. Filesystem locking is described
|
||||
in the document Documentation/filesystems/Locking.
|
||||
|
||||
|
||||
Directory Entry Cache (dcache)
|
||||
------------------------------
|
||||
|
||||
The VFS implements the open(2), stat(2), chmod(2), and similar system
|
||||
calls. The pathname argument that is passed to them is used by the VFS
|
||||
to search through the directory entry cache (also known as the dentry
|
||||
cache or dcache). This provides a very fast look-up mechanism to
|
||||
translate a pathname (filename) into a specific dentry. Dentries live
|
||||
in RAM and are never saved to disc: they exist only for performance.
|
||||
|
||||
The dentry cache is meant to be a view into your entire filespace. As
|
||||
most computers cannot fit all dentries in the RAM at the same time,
|
||||
some bits of the cache are missing. In order to resolve your pathname
|
||||
into a dentry, the VFS may have to resort to creating dentries along
|
||||
the way, and then loading the inode. This is done by looking up the
|
||||
inode.
|
||||
|
||||
|
||||
The Inode Object
|
||||
----------------
|
||||
|
||||
An individual dentry usually has a pointer to an inode. Inodes are
|
||||
filesystem objects such as regular files, directories, FIFOs and other
|
||||
beasts. They live either on the disc (for block device filesystems)
|
||||
or in the memory (for pseudo filesystems). Inodes that live on the
|
||||
disc are copied into the memory when required and changes to the inode
|
||||
are written back to disc. A single inode can be pointed to by multiple
|
||||
dentries (hard links, for example, do this).
|
||||
|
||||
To look up an inode requires that the VFS calls the lookup() method of
|
||||
the parent directory inode. This method is installed by the specific
|
||||
filesystem implementation that the inode lives in. Once the VFS has
|
||||
the required dentry (and hence the inode), we can do all those boring
|
||||
things like open(2) the file, or stat(2) it to peek at the inode
|
||||
data. The stat(2) operation is fairly simple: once the VFS has the
|
||||
dentry, it peeks at the inode data and passes some of it back to
|
||||
userspace.
|
||||
|
||||
|
||||
The File Object
|
||||
---------------
|
||||
|
||||
Opening a file requires another operation: allocation of a file
|
||||
structure (this is the kernel-side implementation of file
|
||||
descriptors). The freshly allocated file structure is initialized with
|
||||
a pointer to the dentry and a set of file operation member functions.
|
||||
These are taken from the inode data. The open() file method is then
|
||||
called so the specific filesystem implementation can do it's work. You
|
||||
can see that this is another switch performed by the VFS. The file
|
||||
structure is placed into the file descriptor table for the process.
|
||||
|
||||
Reading, writing and closing files (and other assorted VFS operations)
|
||||
is done by using the userspace file descriptor to grab the appropriate
|
||||
file structure, and then calling the required file structure method to
|
||||
do whatever is required. For as long as the file is open, it keeps the
|
||||
dentry in use, which in turn means that the VFS inode is still in use.
|
||||
|
||||
|
||||
Registering and Mounting a Filesystem
|
||||
=====================================
|
||||
|
||||
To register and unregister a filesystem, use the following API
|
||||
functions:
|
||||
|
||||
#include <linux/fs.h>
|
||||
|
||||
extern int register_filesystem(struct file_system_type *);
|
||||
extern int unregister_filesystem(struct file_system_type *);
|
||||
|
||||
The passed struct file_system_type describes your filesystem. When a
|
||||
request is made to mount a device onto a directory in your filespace,
|
||||
the VFS will call the appropriate get_sb() method for the specific
|
||||
filesystem. The dentry for the mount point will then be updated to
|
||||
point to the root inode for the new filesystem.
|
||||
|
||||
You can see all filesystems that are registered to the kernel in the
|
||||
file /proc/filesystems.
|
||||
|
||||
|
||||
struct file_system_type
|
||||
-----------------------
|
||||
|
||||
This describes the filesystem. As of kernel 2.6.13, the following
|
||||
members are defined:
|
||||
|
||||
struct file_system_type {
|
||||
const char *name;
|
||||
int fs_flags;
|
||||
int (*get_sb) (struct file_system_type *, int,
|
||||
const char *, void *, struct vfsmount *);
|
||||
void (*kill_sb) (struct super_block *);
|
||||
struct module *owner;
|
||||
struct file_system_type * next;
|
||||
struct list_head fs_supers;
|
||||
};
|
||||
|
||||
name: the name of the filesystem type, such as "ext2", "iso9660",
|
||||
"msdos" and so on
|
||||
|
||||
fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
|
||||
|
||||
get_sb: the method to call when a new instance of this
|
||||
filesystem should be mounted
|
||||
|
||||
kill_sb: the method to call when an instance of this filesystem
|
||||
should be unmounted
|
||||
|
||||
owner: for internal VFS use: you should initialize this to THIS_MODULE in
|
||||
most cases.
|
||||
|
||||
next: for internal VFS use: you should initialize this to NULL
|
||||
|
||||
The get_sb() method has the following arguments:
|
||||
|
||||
struct super_block *sb: the superblock structure. This is partially
|
||||
initialized by the VFS and the rest must be initialized by the
|
||||
get_sb() method
|
||||
|
||||
int flags: mount flags
|
||||
|
||||
const char *dev_name: the device name we are mounting.
|
||||
|
||||
void *data: arbitrary mount options, usually comes as an ASCII
|
||||
string
|
||||
|
||||
int silent: whether or not to be silent on error
|
||||
|
||||
The get_sb() method must determine if the block device specified
|
||||
in the superblock contains a filesystem of the type the method
|
||||
supports. On success the method returns the superblock pointer, on
|
||||
failure it returns NULL.
|
||||
|
||||
The most interesting member of the superblock structure that the
|
||||
get_sb() method fills in is the "s_op" field. This is a pointer to
|
||||
a "struct super_operations" which describes the next level of the
|
||||
filesystem implementation.
|
||||
|
||||
Usually, a filesystem uses one of the generic get_sb() implementations
|
||||
and provides a fill_super() method instead. The generic methods are:
|
||||
|
||||
get_sb_bdev: mount a filesystem residing on a block device
|
||||
|
||||
get_sb_nodev: mount a filesystem that is not backed by a device
|
||||
|
||||
get_sb_single: mount a filesystem which shares the instance between
|
||||
all mounts
|
||||
|
||||
A fill_super() method implementation has the following arguments:
|
||||
|
||||
struct super_block *sb: the superblock structure. The method fill_super()
|
||||
must initialize this properly.
|
||||
|
||||
void *data: arbitrary mount options, usually comes as an ASCII
|
||||
string
|
||||
|
||||
int silent: whether or not to be silent on error
|
||||
|
||||
|
||||
The Superblock Object
|
||||
=====================
|
||||
|
||||
A superblock object represents a mounted filesystem.
|
||||
|
||||
|
||||
struct super_operations
|
||||
-----------------------
|
||||
|
||||
This describes how the VFS can manipulate the superblock of your
|
||||
filesystem. As of kernel 2.6.13, the following members are defined:
|
||||
|
||||
struct super_operations {
|
||||
struct inode *(*alloc_inode)(struct super_block *sb);
|
||||
void (*destroy_inode)(struct inode *);
|
||||
|
||||
void (*read_inode) (struct inode *);
|
||||
|
||||
void (*dirty_inode) (struct inode *);
|
||||
int (*write_inode) (struct inode *, int);
|
||||
void (*put_inode) (struct inode *);
|
||||
void (*drop_inode) (struct inode *);
|
||||
void (*delete_inode) (struct inode *);
|
||||
void (*put_super) (struct super_block *);
|
||||
void (*write_super) (struct super_block *);
|
||||
int (*sync_fs)(struct super_block *sb, int wait);
|
||||
void (*write_super_lockfs) (struct super_block *);
|
||||
void (*unlockfs) (struct super_block *);
|
||||
int (*statfs) (struct dentry *, struct kstatfs *);
|
||||
int (*remount_fs) (struct super_block *, int *, char *);
|
||||
void (*clear_inode) (struct inode *);
|
||||
void (*umount_begin) (struct super_block *);
|
||||
|
||||
void (*sync_inodes) (struct super_block *sb,
|
||||
struct writeback_control *wbc);
|
||||
int (*show_options)(struct seq_file *, struct vfsmount *);
|
||||
|
||||
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
|
||||
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
|
||||
};
|
||||
|
||||
All methods are called without any locks being held, unless otherwise
|
||||
noted. This means that most methods can block safely. All methods are
|
||||
only called from a process context (i.e. not from an interrupt handler
|
||||
or bottom half).
|
||||
|
||||
alloc_inode: this method is called by inode_alloc() to allocate memory
|
||||
for struct inode and initialize it. If this function is not
|
||||
defined, a simple 'struct inode' is allocated. Normally
|
||||
alloc_inode will be used to allocate a larger structure which
|
||||
contains a 'struct inode' embedded within it.
|
||||
|
||||
destroy_inode: this method is called by destroy_inode() to release
|
||||
resources allocated for struct inode. It is only required if
|
||||
->alloc_inode was defined and simply undoes anything done by
|
||||
->alloc_inode.
|
||||
|
||||
read_inode: this method is called to read a specific inode from the
|
||||
mounted filesystem. The i_ino member in the struct inode is
|
||||
initialized by the VFS to indicate which inode to read. Other
|
||||
members are filled in by this method.
|
||||
|
||||
You can set this to NULL and use iget5_locked() instead of iget()
|
||||
to read inodes. This is necessary for filesystems for which the
|
||||
inode number is not sufficient to identify an inode.
|
||||
|
||||
dirty_inode: this method is called by the VFS to mark an inode dirty.
|
||||
|
||||
write_inode: this method is called when the VFS needs to write an
|
||||
inode to disc. The second parameter indicates whether the write
|
||||
should be synchronous or not, not all filesystems check this flag.
|
||||
|
||||
put_inode: called when the VFS inode is removed from the inode
|
||||
cache.
|
||||
|
||||
drop_inode: called when the last access to the inode is dropped,
|
||||
with the inode_lock spinlock held.
|
||||
|
||||
This method should be either NULL (normal UNIX filesystem
|
||||
semantics) or "generic_delete_inode" (for filesystems that do not
|
||||
want to cache inodes - causing "delete_inode" to always be
|
||||
called regardless of the value of i_nlink)
|
||||
|
||||
The "generic_delete_inode()" behavior is equivalent to the
|
||||
old practice of using "force_delete" in the put_inode() case,
|
||||
but does not have the races that the "force_delete()" approach
|
||||
had.
|
||||
|
||||
delete_inode: called when the VFS wants to delete an inode
|
||||
|
||||
put_super: called when the VFS wishes to free the superblock
|
||||
(i.e. unmount). This is called with the superblock lock held
|
||||
|
||||
write_super: called when the VFS superblock needs to be written to
|
||||
disc. This method is optional
|
||||
|
||||
sync_fs: called when VFS is writing out all dirty data associated with
|
||||
a superblock. The second parameter indicates whether the method
|
||||
should wait until the write out has been completed. Optional.
|
||||
|
||||
write_super_lockfs: called when VFS is locking a filesystem and
|
||||
forcing it into a consistent state. This method is currently
|
||||
used by the Logical Volume Manager (LVM).
|
||||
|
||||
unlockfs: called when VFS is unlocking a filesystem and making it writable
|
||||
again.
|
||||
|
||||
statfs: called when the VFS needs to get filesystem statistics. This
|
||||
is called with the kernel lock held
|
||||
|
||||
remount_fs: called when the filesystem is remounted. This is called
|
||||
with the kernel lock held
|
||||
|
||||
clear_inode: called then the VFS clears the inode. Optional
|
||||
|
||||
umount_begin: called when the VFS is unmounting a filesystem.
|
||||
|
||||
sync_inodes: called when the VFS is writing out dirty data associated with
|
||||
a superblock.
|
||||
|
||||
show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
|
||||
|
||||
quota_read: called by the VFS to read from filesystem quota file.
|
||||
|
||||
quota_write: called by the VFS to write to filesystem quota file.
|
||||
|
||||
The read_inode() method is responsible for filling in the "i_op"
|
||||
field. This is a pointer to a "struct inode_operations" which
|
||||
describes the methods that can be performed on individual inodes.
|
||||
|
||||
|
||||
The Inode Object
|
||||
================
|
||||
|
||||
An inode object represents an object within the filesystem.
|
||||
|
||||
|
||||
struct inode_operations
|
||||
-----------------------
|
||||
|
||||
This describes how the VFS can manipulate an inode in your
|
||||
filesystem. As of kernel 2.6.13, the following members are defined:
|
||||
|
||||
struct inode_operations {
|
||||
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
|
||||
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
|
||||
int (*link) (struct dentry *,struct inode *,struct dentry *);
|
||||
int (*unlink) (struct inode *,struct dentry *);
|
||||
int (*symlink) (struct inode *,struct dentry *,const char *);
|
||||
int (*mkdir) (struct inode *,struct dentry *,int);
|
||||
int (*rmdir) (struct inode *,struct dentry *);
|
||||
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
|
||||
int (*rename) (struct inode *, struct dentry *,
|
||||
struct inode *, struct dentry *);
|
||||
int (*readlink) (struct dentry *, char __user *,int);
|
||||
void * (*follow_link) (struct dentry *, struct nameidata *);
|
||||
void (*put_link) (struct dentry *, struct nameidata *, void *);
|
||||
void (*truncate) (struct inode *);
|
||||
int (*permission) (struct inode *, int, struct nameidata *);
|
||||
int (*setattr) (struct dentry *, struct iattr *);
|
||||
int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
|
||||
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
|
||||
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
|
||||
ssize_t (*listxattr) (struct dentry *, char *, size_t);
|
||||
int (*removexattr) (struct dentry *, const char *);
|
||||
};
|
||||
|
||||
Again, all methods are called without any locks being held, unless
|
||||
otherwise noted.
|
||||
|
||||
create: called by the open(2) and creat(2) system calls. Only
|
||||
required if you want to support regular files. The dentry you
|
||||
get should not have an inode (i.e. it should be a negative
|
||||
dentry). Here you will probably call d_instantiate() with the
|
||||
dentry and the newly created inode
|
||||
|
||||
lookup: called when the VFS needs to look up an inode in a parent
|
||||
directory. The name to look for is found in the dentry. This
|
||||
method must call d_add() to insert the found inode into the
|
||||
dentry. The "i_count" field in the inode structure should be
|
||||
incremented. If the named inode does not exist a NULL inode
|
||||
should be inserted into the dentry (this is called a negative
|
||||
dentry). Returning an error code from this routine must only
|
||||
be done on a real error, otherwise creating inodes with system
|
||||
calls like create(2), mknod(2), mkdir(2) and so on will fail.
|
||||
If you wish to overload the dentry methods then you should
|
||||
initialise the "d_dop" field in the dentry; this is a pointer
|
||||
to a struct "dentry_operations".
|
||||
This method is called with the directory inode semaphore held
|
||||
|
||||
link: called by the link(2) system call. Only required if you want
|
||||
to support hard links. You will probably need to call
|
||||
d_instantiate() just as you would in the create() method
|
||||
|
||||
unlink: called by the unlink(2) system call. Only required if you
|
||||
want to support deleting inodes
|
||||
|
||||
symlink: called by the symlink(2) system call. Only required if you
|
||||
want to support symlinks. You will probably need to call
|
||||
d_instantiate() just as you would in the create() method
|
||||
|
||||
mkdir: called by the mkdir(2) system call. Only required if you want
|
||||
to support creating subdirectories. You will probably need to
|
||||
call d_instantiate() just as you would in the create() method
|
||||
|
||||
rmdir: called by the rmdir(2) system call. Only required if you want
|
||||
to support deleting subdirectories
|
||||
|
||||
mknod: called by the mknod(2) system call to create a device (char,
|
||||
block) inode or a named pipe (FIFO) or socket. Only required
|
||||
if you want to support creating these types of inodes. You
|
||||
will probably need to call d_instantiate() just as you would
|
||||
in the create() method
|
||||
|
||||
rename: called by the rename(2) system call to rename the object to
|
||||
have the parent and name given by the second inode and dentry.
|
||||
|
||||
readlink: called by the readlink(2) system call. Only required if
|
||||
you want to support reading symbolic links
|
||||
|
||||
follow_link: called by the VFS to follow a symbolic link to the
|
||||
inode it points to. Only required if you want to support
|
||||
symbolic links. This method returns a void pointer cookie
|
||||
that is passed to put_link().
|
||||
|
||||
put_link: called by the VFS to release resources allocated by
|
||||
follow_link(). The cookie returned by follow_link() is passed
|
||||
to this method as the last parameter. It is used by
|
||||
filesystems such as NFS where page cache is not stable
|
||||
(i.e. page that was installed when the symbolic link walk
|
||||
started might not be in the page cache at the end of the
|
||||
walk).
|
||||
|
||||
truncate: called by the VFS to change the size of a file. The
|
||||
i_size field of the inode is set to the desired size by the
|
||||
VFS before this method is called. This method is called by
|
||||
the truncate(2) system call and related functionality.
|
||||
|
||||
permission: called by the VFS to check for access rights on a POSIX-like
|
||||
filesystem.
|
||||
|
||||
setattr: called by the VFS to set attributes for a file. This method
|
||||
is called by chmod(2) and related system calls.
|
||||
|
||||
getattr: called by the VFS to get attributes of a file. This method
|
||||
is called by stat(2) and related system calls.
|
||||
|
||||
setxattr: called by the VFS to set an extended attribute for a file.
|
||||
Extended attribute is a name:value pair associated with an
|
||||
inode. This method is called by setxattr(2) system call.
|
||||
|
||||
getxattr: called by the VFS to retrieve the value of an extended
|
||||
attribute name. This method is called by getxattr(2) function
|
||||
call.
|
||||
|
||||
listxattr: called by the VFS to list all extended attributes for a
|
||||
given file. This method is called by listxattr(2) system call.
|
||||
|
||||
removexattr: called by the VFS to remove an extended attribute from
|
||||
a file. This method is called by removexattr(2) system call.
|
||||
|
||||
|
||||
The Address Space Object
|
||||
========================
|
||||
|
||||
The address space object is used to group and manage pages in the page
|
||||
cache. It can be used to keep track of the pages in a file (or
|
||||
anything else) and also track the mapping of sections of the file into
|
||||
process address spaces.
|
||||
|
||||
There are a number of distinct yet related services that an
|
||||
address-space can provide. These include communicating memory
|
||||
pressure, page lookup by address, and keeping track of pages tagged as
|
||||
Dirty or Writeback.
|
||||
|
||||
The first can be used independently to the others. The VM can try to
|
||||
either write dirty pages in order to clean them, or release clean
|
||||
pages in order to reuse them. To do this it can call the ->writepage
|
||||
method on dirty pages, and ->releasepage on clean pages with
|
||||
PagePrivate set. Clean pages without PagePrivate and with no external
|
||||
references will be released without notice being given to the
|
||||
address_space.
|
||||
|
||||
To achieve this functionality, pages need to be placed on an LRU with
|
||||
lru_cache_add and mark_page_active needs to be called whenever the
|
||||
page is used.
|
||||
|
||||
Pages are normally kept in a radix tree index by ->index. This tree
|
||||
maintains information about the PG_Dirty and PG_Writeback status of
|
||||
each page, so that pages with either of these flags can be found
|
||||
quickly.
|
||||
|
||||
The Dirty tag is primarily used by mpage_writepages - the default
|
||||
->writepages method. It uses the tag to find dirty pages to call
|
||||
->writepage on. If mpage_writepages is not used (i.e. the address
|
||||
provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
|
||||
almost unused. write_inode_now and sync_inode do use it (through
|
||||
__sync_single_inode) to check if ->writepages has been successful in
|
||||
writing out the whole address_space.
|
||||
|
||||
The Writeback tag is used by filemap*wait* and sync_page* functions,
|
||||
via wait_on_page_writeback_range, to wait for all writeback to
|
||||
complete. While waiting ->sync_page (if defined) will be called on
|
||||
each page that is found to require writeback.
|
||||
|
||||
An address_space handler may attach extra information to a page,
|
||||
typically using the 'private' field in the 'struct page'. If such
|
||||
information is attached, the PG_Private flag should be set. This will
|
||||
cause various VM routines to make extra calls into the address_space
|
||||
handler to deal with that data.
|
||||
|
||||
An address space acts as an intermediate between storage and
|
||||
application. Data is read into the address space a whole page at a
|
||||
time, and provided to the application either by copying of the page,
|
||||
or by memory-mapping the page.
|
||||
Data is written into the address space by the application, and then
|
||||
written-back to storage typically in whole pages, however the
|
||||
address_space has finer control of write sizes.
|
||||
|
||||
The read process essentially only requires 'readpage'. The write
|
||||
process is more complicated and uses prepare_write/commit_write or
|
||||
set_page_dirty to write data into the address_space, and writepage,
|
||||
sync_page, and writepages to writeback data to storage.
|
||||
|
||||
Adding and removing pages to/from an address_space is protected by the
|
||||
inode's i_mutex.
|
||||
|
||||
When data is written to a page, the PG_Dirty flag should be set. It
|
||||
typically remains set until writepage asks for it to be written. This
|
||||
should clear PG_Dirty and set PG_Writeback. It can be actually
|
||||
written at any point after PG_Dirty is clear. Once it is known to be
|
||||
safe, PG_Writeback is cleared.
|
||||
|
||||
Writeback makes use of a writeback_control structure...
|
||||
|
||||
struct address_space_operations
|
||||
-------------------------------
|
||||
|
||||
This describes how the VFS can manipulate mapping of a file to page cache in
|
||||
your filesystem. As of kernel 2.6.16, the following members are defined:
|
||||
|
||||
struct address_space_operations {
|
||||
int (*writepage)(struct page *page, struct writeback_control *wbc);
|
||||
int (*readpage)(struct file *, struct page *);
|
||||
int (*sync_page)(struct page *);
|
||||
int (*writepages)(struct address_space *, struct writeback_control *);
|
||||
int (*set_page_dirty)(struct page *page);
|
||||
int (*readpages)(struct file *filp, struct address_space *mapping,
|
||||
struct list_head *pages, unsigned nr_pages);
|
||||
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
|
||||
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
|
||||
sector_t (*bmap)(struct address_space *, sector_t);
|
||||
int (*invalidatepage) (struct page *, unsigned long);
|
||||
int (*releasepage) (struct page *, int);
|
||||
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
|
||||
loff_t offset, unsigned long nr_segs);
|
||||
struct page* (*get_xip_page)(struct address_space *, sector_t,
|
||||
int);
|
||||
/* migrate the contents of a page to the specified target */
|
||||
int (*migratepage) (struct page *, struct page *);
|
||||
};
|
||||
|
||||
writepage: called by the VM to write a dirty page to backing store.
|
||||
This may happen for data integrity reasons (i.e. 'sync'), or
|
||||
to free up memory (flush). The difference can be seen in
|
||||
wbc->sync_mode.
|
||||
The PG_Dirty flag has been cleared and PageLocked is true.
|
||||
writepage should start writeout, should set PG_Writeback,
|
||||
and should make sure the page is unlocked, either synchronously
|
||||
or asynchronously when the write operation completes.
|
||||
|
||||
If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
|
||||
try too hard if there are problems, and may choose to write out
|
||||
other pages from the mapping if that is easier (e.g. due to
|
||||
internal dependencies). If it chooses not to start writeout, it
|
||||
should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
|
||||
calling ->writepage on that page.
|
||||
|
||||
See the file "Locking" for more details.
|
||||
|
||||
readpage: called by the VM to read a page from backing store.
|
||||
The page will be Locked when readpage is called, and should be
|
||||
unlocked and marked uptodate once the read completes.
|
||||
If ->readpage discovers that it needs to unlock the page for
|
||||
some reason, it can do so, and then return AOP_TRUNCATED_PAGE.
|
||||
In this case, the page will be relocated, relocked and if
|
||||
that all succeeds, ->readpage will be called again.
|
||||
|
||||
sync_page: called by the VM to notify the backing store to perform all
|
||||
queued I/O operations for a page. I/O operations for other pages
|
||||
associated with this address_space object may also be performed.
|
||||
|
||||
This function is optional and is called only for pages with
|
||||
PG_Writeback set while waiting for the writeback to complete.
|
||||
|
||||
writepages: called by the VM to write out pages associated with the
|
||||
address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then
|
||||
the writeback_control will specify a range of pages that must be
|
||||
written out. If it is WBC_SYNC_NONE, then a nr_to_write is given
|
||||
and that many pages should be written if possible.
|
||||
If no ->writepages is given, then mpage_writepages is used
|
||||
instead. This will choose pages from the address space that are
|
||||
tagged as DIRTY and will pass them to ->writepage.
|
||||
|
||||
set_page_dirty: called by the VM to set a page dirty.
|
||||
This is particularly needed if an address space attaches
|
||||
private data to a page, and that data needs to be updated when
|
||||
a page is dirtied. This is called, for example, when a memory
|
||||
mapped page gets modified.
|
||||
If defined, it should set the PageDirty flag, and the
|
||||
PAGECACHE_TAG_DIRTY tag in the radix tree.
|
||||
|
||||
readpages: called by the VM to read pages associated with the address_space
|
||||
object. This is essentially just a vector version of
|
||||
readpage. Instead of just one page, several pages are
|
||||
requested.
|
||||
readpages is only used for read-ahead, so read errors are
|
||||
ignored. If anything goes wrong, feel free to give up.
|
||||
|
||||
prepare_write: called by the generic write path in VM to set up a write
|
||||
request for a page. This indicates to the address space that
|
||||
the given range of bytes is about to be written. The
|
||||
address_space should check that the write will be able to
|
||||
complete, by allocating space if necessary and doing any other
|
||||
internal housekeeping. If the write will update parts of
|
||||
any basic-blocks on storage, then those blocks should be
|
||||
pre-read (if they haven't been read already) so that the
|
||||
updated blocks can be written out properly.
|
||||
The page will be locked. If prepare_write wants to unlock the
|
||||
page it, like readpage, may do so and return
|
||||
AOP_TRUNCATED_PAGE.
|
||||
In this case the prepare_write will be retried one the lock is
|
||||
regained.
|
||||
|
||||
Note: the page _must not_ be marked uptodate in this function
|
||||
(or anywhere else) unless it actually is uptodate right now. As
|
||||
soon as a page is marked uptodate, it is possible for a concurrent
|
||||
read(2) to copy it to userspace.
|
||||
|
||||
commit_write: If prepare_write succeeds, new data will be copied
|
||||
into the page and then commit_write will be called. It will
|
||||
typically update the size of the file (if appropriate) and
|
||||
mark the inode as dirty, and do any other related housekeeping
|
||||
operations. It should avoid returning an error if possible -
|
||||
errors should have been handled by prepare_write.
|
||||
|
||||
bmap: called by the VFS to map a logical block offset within object to
|
||||
physical block number. This method is used by the FIBMAP
|
||||
ioctl and for working with swap-files. To be able to swap to
|
||||
a file, the file must have a stable mapping to a block
|
||||
device. The swap system does not go through the filesystem
|
||||
but instead uses bmap to find out where the blocks in the file
|
||||
are and uses those addresses directly.
|
||||
|
||||
|
||||
invalidatepage: If a page has PagePrivate set, then invalidatepage
|
||||
will be called when part or all of the page is to be removed
|
||||
from the address space. This generally corresponds to either a
|
||||
truncation or a complete invalidation of the address space
|
||||
(in the latter case 'offset' will always be 0).
|
||||
Any private data associated with the page should be updated
|
||||
to reflect this truncation. If offset is 0, then
|
||||
the private data should be released, because the page
|
||||
must be able to be completely discarded. This may be done by
|
||||
calling the ->releasepage function, but in this case the
|
||||
release MUST succeed.
|
||||
|
||||
releasepage: releasepage is called on PagePrivate pages to indicate
|
||||
that the page should be freed if possible. ->releasepage
|
||||
should remove any private data from the page and clear the
|
||||
PagePrivate flag. It may also remove the page from the
|
||||
address_space. If this fails for some reason, it may indicate
|
||||
failure with a 0 return value.
|
||||
This is used in two distinct though related cases. The first
|
||||
is when the VM finds a clean page with no active users and
|
||||
wants to make it a free page. If ->releasepage succeeds, the
|
||||
page will be removed from the address_space and become free.
|
||||
|
||||
The second case if when a request has been made to invalidate
|
||||
some or all pages in an address_space. This can happen
|
||||
through the fadvice(POSIX_FADV_DONTNEED) system call or by the
|
||||
filesystem explicitly requesting it as nfs and 9fs do (when
|
||||
they believe the cache may be out of date with storage) by
|
||||
calling invalidate_inode_pages2().
|
||||
If the filesystem makes such a call, and needs to be certain
|
||||
that all pages are invalidated, then its releasepage will
|
||||
need to ensure this. Possibly it can clear the PageUptodate
|
||||
bit if it cannot free private data yet.
|
||||
|
||||
direct_IO: called by the generic read/write routines to perform
|
||||
direct_IO - that is IO requests which bypass the page cache
|
||||
and transfer data directly between the storage and the
|
||||
application's address space.
|
||||
|
||||
get_xip_page: called by the VM to translate a block number to a page.
|
||||
The page is valid until the corresponding filesystem is unmounted.
|
||||
Filesystems that want to use execute-in-place (XIP) need to implement
|
||||
it. An example implementation can be found in fs/ext2/xip.c.
|
||||
|
||||
migrate_page: This is used to compact the physical memory usage.
|
||||
If the VM wants to relocate a page (maybe off a memory card
|
||||
that is signalling imminent failure) it will pass a new page
|
||||
and an old page to this function. migrate_page should
|
||||
transfer any private data across and update any references
|
||||
that it has to the page.
|
||||
|
||||
The File Object
|
||||
===============
|
||||
|
||||
A file object represents a file opened by a process.
|
||||
|
||||
|
||||
struct file_operations
|
||||
----------------------
|
||||
|
||||
This describes how the VFS can manipulate an open file. As of kernel
|
||||
2.6.17, the following members are defined:
|
||||
|
||||
struct file_operations {
|
||||
loff_t (*llseek) (struct file *, loff_t, int);
|
||||
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
|
||||
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
|
||||
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
|
||||
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
|
||||
int (*readdir) (struct file *, void *, filldir_t);
|
||||
unsigned int (*poll) (struct file *, struct poll_table_struct *);
|
||||
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
|
||||
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
|
||||
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
|
||||
int (*mmap) (struct file *, struct vm_area_struct *);
|
||||
int (*open) (struct inode *, struct file *);
|
||||
int (*flush) (struct file *);
|
||||
int (*release) (struct inode *, struct file *);
|
||||
int (*fsync) (struct file *, struct dentry *, int datasync);
|
||||
int (*aio_fsync) (struct kiocb *, int datasync);
|
||||
int (*fasync) (int, struct file *, int);
|
||||
int (*lock) (struct file *, int, struct file_lock *);
|
||||
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
|
||||
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
|
||||
ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
|
||||
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
|
||||
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
|
||||
int (*check_flags)(int);
|
||||
int (*dir_notify)(struct file *filp, unsigned long arg);
|
||||
int (*flock) (struct file *, int, struct file_lock *);
|
||||
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned
|
||||
int);
|
||||
ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned
|
||||
int);
|
||||
};
|
||||
|
||||
Again, all methods are called without any locks being held, unless
|
||||
otherwise noted.
|
||||
|
||||
llseek: called when the VFS needs to move the file position index
|
||||
|
||||
read: called by read(2) and related system calls
|
||||
|
||||
aio_read: called by io_submit(2) and other asynchronous I/O operations
|
||||
|
||||
write: called by write(2) and related system calls
|
||||
|
||||
aio_write: called by io_submit(2) and other asynchronous I/O operations
|
||||
|
||||
readdir: called when the VFS needs to read the directory contents
|
||||
|
||||
poll: called by the VFS when a process wants to check if there is
|
||||
activity on this file and (optionally) go to sleep until there
|
||||
is activity. Called by the select(2) and poll(2) system calls
|
||||
|
||||
ioctl: called by the ioctl(2) system call
|
||||
|
||||
unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
|
||||
require the BKL should use this method instead of the ioctl() above.
|
||||
|
||||
compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
|
||||
are used on 64 bit kernels.
|
||||
|
||||
mmap: called by the mmap(2) system call
|
||||
|
||||
open: called by the VFS when an inode should be opened. When the VFS
|
||||
opens a file, it creates a new "struct file". It then calls the
|
||||
open method for the newly allocated file structure. You might
|
||||
think that the open method really belongs in
|
||||
"struct inode_operations", and you may be right. I think it's
|
||||
done the way it is because it makes filesystems simpler to
|
||||
implement. The open() method is a good place to initialize the
|
||||
"private_data" member in the file structure if you want to point
|
||||
to a device structure
|
||||
|
||||
flush: called by the close(2) system call to flush a file
|
||||
|
||||
release: called when the last reference to an open file is closed
|
||||
|
||||
fsync: called by the fsync(2) system call
|
||||
|
||||
fasync: called by the fcntl(2) system call when asynchronous
|
||||
(non-blocking) mode is enabled for a file
|
||||
|
||||
lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
|
||||
commands
|
||||
|
||||
readv: called by the readv(2) system call
|
||||
|
||||
writev: called by the writev(2) system call
|
||||
|
||||
sendfile: called by the sendfile(2) system call
|
||||
|
||||
get_unmapped_area: called by the mmap(2) system call
|
||||
|
||||
check_flags: called by the fcntl(2) system call for F_SETFL command
|
||||
|
||||
dir_notify: called by the fcntl(2) system call for F_NOTIFY command
|
||||
|
||||
flock: called by the flock(2) system call
|
||||
|
||||
splice_write: called by the VFS to splice data from a pipe to a file. This
|
||||
method is used by the splice(2) system call
|
||||
|
||||
splice_read: called by the VFS to splice data from file to a pipe. This
|
||||
method is used by the splice(2) system call
|
||||
|
||||
Note that the file operations are implemented by the specific
|
||||
filesystem in which the inode resides. When opening a device node
|
||||
(character or block special) most filesystems will call special
|
||||
support routines in the VFS which will locate the required device
|
||||
driver information. These support routines replace the filesystem file
|
||||
operations with those for the device driver, and then proceed to call
|
||||
the new open() method for the file. This is how opening a device file
|
||||
in the filesystem eventually ends up calling the device driver open()
|
||||
method.
|
||||
|
||||
|
||||
Directory Entry Cache (dcache)
|
||||
==============================
|
||||
|
||||
|
||||
struct dentry_operations
|
||||
------------------------
|
||||
|
||||
This describes how a filesystem can overload the standard dentry
|
||||
operations. Dentries and the dcache are the domain of the VFS and the
|
||||
individual filesystem implementations. Device drivers have no business
|
||||
here. These methods may be set to NULL, as they are either optional or
|
||||
the VFS uses a default. As of kernel 2.6.13, the following members are
|
||||
defined:
|
||||
|
||||
struct dentry_operations {
|
||||
int (*d_revalidate)(struct dentry *, struct nameidata *);
|
||||
int (*d_hash) (struct dentry *, struct qstr *);
|
||||
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
|
||||
int (*d_delete)(struct dentry *);
|
||||
void (*d_release)(struct dentry *);
|
||||
void (*d_iput)(struct dentry *, struct inode *);
|
||||
};
|
||||
|
||||
d_revalidate: called when the VFS needs to revalidate a dentry. This
|
||||
is called whenever a name look-up finds a dentry in the
|
||||
dcache. Most filesystems leave this as NULL, because all their
|
||||
dentries in the dcache are valid
|
||||
|
||||
d_hash: called when the VFS adds a dentry to the hash table
|
||||
|
||||
d_compare: called when a dentry should be compared with another
|
||||
|
||||
d_delete: called when the last reference to a dentry is
|
||||
deleted. This means no-one is using the dentry, however it is
|
||||
still valid and in the dcache
|
||||
|
||||
d_release: called when a dentry is really deallocated
|
||||
|
||||
d_iput: called when a dentry loses its inode (just prior to its
|
||||
being deallocated). The default when this is NULL is that the
|
||||
VFS calls iput(). If you define this method, you must call
|
||||
iput() yourself
|
||||
|
||||
Each dentry has a pointer to its parent dentry, as well as a hash list
|
||||
of child dentries. Child dentries are basically like files in a
|
||||
directory.
|
||||
|
||||
|
||||
Directory Entry Cache API
|
||||
--------------------------
|
||||
|
||||
There are a number of functions defined which permit a filesystem to
|
||||
manipulate dentries:
|
||||
|
||||
dget: open a new handle for an existing dentry (this just increments
|
||||
the usage count)
|
||||
|
||||
dput: close a handle for a dentry (decrements the usage count). If
|
||||
the usage count drops to 0, the "d_delete" method is called
|
||||
and the dentry is placed on the unused list if the dentry is
|
||||
still in its parents hash list. Putting the dentry on the
|
||||
unused list just means that if the system needs some RAM, it
|
||||
goes through the unused list of dentries and deallocates them.
|
||||
If the dentry has already been unhashed and the usage count
|
||||
drops to 0, in this case the dentry is deallocated after the
|
||||
"d_delete" method is called
|
||||
|
||||
d_drop: this unhashes a dentry from its parents hash list. A
|
||||
subsequent call to dput() will deallocate the dentry if its
|
||||
usage count drops to 0
|
||||
|
||||
d_delete: delete a dentry. If there are no other open references to
|
||||
the dentry then the dentry is turned into a negative dentry
|
||||
(the d_iput() method is called). If there are other
|
||||
references, then d_drop() is called instead
|
||||
|
||||
d_add: add a dentry to its parents hash list and then calls
|
||||
d_instantiate()
|
||||
|
||||
d_instantiate: add a dentry to the alias hash list for the inode and
|
||||
updates the "d_inode" member. The "i_count" member in the
|
||||
inode structure should be set/incremented. If the inode
|
||||
pointer is NULL, the dentry is called a "negative
|
||||
dentry". This function is commonly called when an inode is
|
||||
created for an existing negative dentry
|
||||
|
||||
d_lookup: look up a dentry given its parent and path name component
|
||||
It looks up the child of that given name from the dcache
|
||||
hash table. If it is found, the reference count is incremented
|
||||
and the dentry is returned. The caller must use d_put()
|
||||
to free the dentry when it finishes using it.
|
||||
|
||||
For further information on dentry locking, please refer to the document
|
||||
Documentation/filesystems/dentry-locking.txt.
|
||||
|
||||
|
||||
Resources
|
||||
=========
|
||||
|
||||
(Note some of these resources are not up-to-date with the latest kernel
|
||||
version.)
|
||||
|
||||
Creating Linux virtual filesystems. 2002
|
||||
<http://lwn.net/Articles/13325/>
|
||||
|
||||
The Linux Virtual File-system Layer by Neil Brown. 1999
|
||||
<http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
|
||||
|
||||
A tour of the Linux VFS by Michael K. Johnson. 1996
|
||||
<http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
|
||||
|
||||
A small trail through the Linux kernel by Andries Brouwer. 2001
|
||||
<http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
|
||||
262
Documentation/filesystems/xfs.txt
Normal file
262
Documentation/filesystems/xfs.txt
Normal file
@@ -0,0 +1,262 @@
|
||||
|
||||
The SGI XFS Filesystem
|
||||
======================
|
||||
|
||||
XFS is a high performance journaling filesystem which originated
|
||||
on the SGI IRIX platform. It is completely multi-threaded, can
|
||||
support large files and large filesystems, extended attributes,
|
||||
variable block sizes, is extent based, and makes extensive use of
|
||||
Btrees (directories, extents, free space) to aid both performance
|
||||
and scalability.
|
||||
|
||||
Refer to the documentation at http://oss.sgi.com/projects/xfs/
|
||||
for further details. This implementation is on-disk compatible
|
||||
with the IRIX version of XFS.
|
||||
|
||||
|
||||
Mount Options
|
||||
=============
|
||||
|
||||
When mounting an XFS filesystem, the following options are accepted.
|
||||
|
||||
allocsize=size
|
||||
Sets the buffered I/O end-of-file preallocation size when
|
||||
doing delayed allocation writeout (default size is 64KiB).
|
||||
Valid values for this option are page size (typically 4KiB)
|
||||
through to 1GiB, inclusive, in power-of-2 increments.
|
||||
|
||||
attr2/noattr2
|
||||
The options enable/disable (default is disabled for backward
|
||||
compatibility on-disk) an "opportunistic" improvement to be
|
||||
made in the way inline extended attributes are stored on-disk.
|
||||
When the new form is used for the first time (by setting or
|
||||
removing extended attributes) the on-disk superblock feature
|
||||
bit field will be updated to reflect this format being in use.
|
||||
|
||||
barrier
|
||||
Enables the use of block layer write barriers for writes into
|
||||
the journal and unwritten extent conversion. This allows for
|
||||
drive level write caching to be enabled, for devices that
|
||||
support write barriers.
|
||||
|
||||
dmapi
|
||||
Enable the DMAPI (Data Management API) event callouts.
|
||||
Use with the "mtpt" option.
|
||||
|
||||
grpid/bsdgroups and nogrpid/sysvgroups
|
||||
These options define what group ID a newly created file gets.
|
||||
When grpid is set, it takes the group ID of the directory in
|
||||
which it is created; otherwise (the default) it takes the fsgid
|
||||
of the current process, unless the directory has the setgid bit
|
||||
set, in which case it takes the gid from the parent directory,
|
||||
and also gets the setgid bit set if it is a directory itself.
|
||||
|
||||
ihashsize=value
|
||||
Sets the number of hash buckets available for hashing the
|
||||
in-memory inodes of the specified mount point. If a value
|
||||
of zero is used, the value selected by the default algorithm
|
||||
will be displayed in /proc/mounts.
|
||||
|
||||
ikeep/noikeep
|
||||
When inode clusters are emptied of inodes, keep them around
|
||||
on the disk (ikeep) - this is the traditional XFS behaviour
|
||||
and is still the default for now. Using the noikeep option,
|
||||
inode clusters are returned to the free space pool.
|
||||
|
||||
inode64
|
||||
Indicates that XFS is allowed to create inodes at any location
|
||||
in the filesystem, including those which will result in inode
|
||||
numbers occupying more than 32 bits of significance. This is
|
||||
provided for backwards compatibility, but causes problems for
|
||||
backup applications that cannot handle large inode numbers.
|
||||
|
||||
largeio/nolargeio
|
||||
If "nolargeio" is specified, the optimal I/O reported in
|
||||
st_blksize by stat(2) will be as small as possible to allow user
|
||||
applications to avoid inefficient read/modify/write I/O.
|
||||
If "largeio" specified, a filesystem that has a "swidth" specified
|
||||
will return the "swidth" value (in bytes) in st_blksize. If the
|
||||
filesystem does not have a "swidth" specified but does specify
|
||||
an "allocsize" then "allocsize" (in bytes) will be returned
|
||||
instead.
|
||||
If neither of these two options are specified, then filesystem
|
||||
will behave as if "nolargeio" was specified.
|
||||
|
||||
logbufs=value
|
||||
Set the number of in-memory log buffers. Valid numbers range
|
||||
from 2-8 inclusive.
|
||||
The default value is 8 buffers for filesystems with a
|
||||
blocksize of 64KiB, 4 buffers for filesystems with a blocksize
|
||||
of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB
|
||||
and 2 buffers for all other configurations. Increasing the
|
||||
number of buffers may increase performance on some workloads
|
||||
at the cost of the memory used for the additional log buffers
|
||||
and their associated control structures.
|
||||
|
||||
logbsize=value
|
||||
Set the size of each in-memory log buffer.
|
||||
Size may be specified in bytes, or in kilobytes with a "k" suffix.
|
||||
Valid sizes for version 1 and version 2 logs are 16384 (16k) and
|
||||
32768 (32k). Valid sizes for version 2 logs also include
|
||||
65536 (64k), 131072 (128k) and 262144 (256k).
|
||||
The default value for machines with more than 32MiB of memory
|
||||
is 32768, machines with less memory use 16384 by default.
|
||||
|
||||
logdev=device and rtdev=device
|
||||
Use an external log (metadata journal) and/or real-time device.
|
||||
An XFS filesystem has up to three parts: a data section, a log
|
||||
section, and a real-time section. The real-time section is
|
||||
optional, and the log section can be separate from the data
|
||||
section or contained within it.
|
||||
|
||||
mtpt=mountpoint
|
||||
Use with the "dmapi" option. The value specified here will be
|
||||
included in the DMAPI mount event, and should be the path of
|
||||
the actual mountpoint that is used.
|
||||
|
||||
noalign
|
||||
Data allocations will not be aligned at stripe unit boundaries.
|
||||
|
||||
noatime
|
||||
Access timestamps are not updated when a file is read.
|
||||
|
||||
norecovery
|
||||
The filesystem will be mounted without running log recovery.
|
||||
If the filesystem was not cleanly unmounted, it is likely to
|
||||
be inconsistent when mounted in "norecovery" mode.
|
||||
Some files or directories may not be accessible because of this.
|
||||
Filesystems mounted "norecovery" must be mounted read-only or
|
||||
the mount will fail.
|
||||
|
||||
nouuid
|
||||
Don't check for double mounted file systems using the file system uuid.
|
||||
This is useful to mount LVM snapshot volumes.
|
||||
|
||||
osyncisosync
|
||||
Make O_SYNC writes implement true O_SYNC. WITHOUT this option,
|
||||
Linux XFS behaves as if an "osyncisdsync" option is used,
|
||||
which will make writes to files opened with the O_SYNC flag set
|
||||
behave as if the O_DSYNC flag had been used instead.
|
||||
This can result in better performance without compromising
|
||||
data safety.
|
||||
However if this option is not in effect, timestamp updates from
|
||||
O_SYNC writes can be lost if the system crashes.
|
||||
If timestamp updates are critical, use the osyncisosync option.
|
||||
|
||||
uquota/usrquota/uqnoenforce/quota
|
||||
User disk quota accounting enabled, and limits (optionally)
|
||||
enforced. Refer to xfs_quota(8) for further details.
|
||||
|
||||
gquota/grpquota/gqnoenforce
|
||||
Group disk quota accounting enabled and limits (optionally)
|
||||
enforced. Refer to xfs_quota(8) for further details.
|
||||
|
||||
pquota/prjquota/pqnoenforce
|
||||
Project disk quota accounting enabled and limits (optionally)
|
||||
enforced. Refer to xfs_quota(8) for further details.
|
||||
|
||||
sunit=value and swidth=value
|
||||
Used to specify the stripe unit and width for a RAID device or
|
||||
a stripe volume. "value" must be specified in 512-byte block
|
||||
units.
|
||||
If this option is not specified and the filesystem was made on
|
||||
a stripe volume or the stripe width or unit were specified for
|
||||
the RAID device at mkfs time, then the mount system call will
|
||||
restore the value from the superblock. For filesystems that
|
||||
are made directly on RAID devices, these options can be used
|
||||
to override the information in the superblock if the underlying
|
||||
disk layout changes after the filesystem has been created.
|
||||
The "swidth" option is required if the "sunit" option has been
|
||||
specified, and must be a multiple of the "sunit" value.
|
||||
|
||||
swalloc
|
||||
Data allocations will be rounded up to stripe width boundaries
|
||||
when the current end of file is being extended and the file
|
||||
size is larger than the stripe width size.
|
||||
|
||||
|
||||
sysctls
|
||||
=======
|
||||
|
||||
The following sysctls are available for the XFS filesystem:
|
||||
|
||||
fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)
|
||||
Setting this to "1" clears accumulated XFS statistics
|
||||
in /proc/fs/xfs/stat. It then immediately resets to "0".
|
||||
|
||||
fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)
|
||||
The interval at which the xfssyncd thread flushes metadata
|
||||
out to disk. This thread will flush log activity out, and
|
||||
do some processing on unlinked inodes.
|
||||
|
||||
fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000)
|
||||
The interval at which xfsbufd scans the dirty metadata buffers list.
|
||||
|
||||
fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000)
|
||||
The age at which xfsbufd flushes dirty metadata buffers to disk.
|
||||
|
||||
fs.xfs.error_level (Min: 0 Default: 3 Max: 11)
|
||||
A volume knob for error reporting when internal errors occur.
|
||||
This will generate detailed messages & backtraces for filesystem
|
||||
shutdowns, for example. Current threshold values are:
|
||||
|
||||
XFS_ERRLEVEL_OFF: 0
|
||||
XFS_ERRLEVEL_LOW: 1
|
||||
XFS_ERRLEVEL_HIGH: 5
|
||||
|
||||
fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127)
|
||||
Causes certain error conditions to call BUG(). Value is a bitmask;
|
||||
AND together the tags which represent errors which should cause panics:
|
||||
|
||||
XFS_NO_PTAG 0
|
||||
XFS_PTAG_IFLUSH 0x00000001
|
||||
XFS_PTAG_LOGRES 0x00000002
|
||||
XFS_PTAG_AILDELETE 0x00000004
|
||||
XFS_PTAG_ERROR_REPORT 0x00000008
|
||||
XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010
|
||||
XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
|
||||
XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
|
||||
|
||||
This option is intended for debugging only.
|
||||
|
||||
fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)
|
||||
Controls whether symlinks are created with mode 0777 (default)
|
||||
or whether their mode is affected by the umask (irix mode).
|
||||
|
||||
fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)
|
||||
Controls files created in SGID directories.
|
||||
If the group ID of the new file does not match the effective group
|
||||
ID or one of the supplementary group IDs of the parent dir, the
|
||||
ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
|
||||
is set.
|
||||
|
||||
fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1)
|
||||
Controls whether unprivileged users can use chown to "give away"
|
||||
a file to another user.
|
||||
|
||||
fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1)
|
||||
Setting this to "1" will cause the "sync" flag set
|
||||
by the xfs_io(8) chattr command on a directory to be
|
||||
inherited by files in that directory.
|
||||
|
||||
fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1)
|
||||
Setting this to "1" will cause the "nodump" flag set
|
||||
by the xfs_io(8) chattr command on a directory to be
|
||||
inherited by files in that directory.
|
||||
|
||||
fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1)
|
||||
Setting this to "1" will cause the "noatime" flag set
|
||||
by the xfs_io(8) chattr command on a directory to be
|
||||
inherited by files in that directory.
|
||||
|
||||
fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1)
|
||||
Setting this to "1" will cause the "nosymlinks" flag set
|
||||
by the xfs_io(8) chattr command on a directory to be
|
||||
inherited by files in that directory.
|
||||
|
||||
fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256)
|
||||
In "inode32" allocation mode, this option determines how many
|
||||
files the allocator attempts to allocate in the same allocation
|
||||
group before moving to the next allocation group. The intent
|
||||
is to control the rate at which the allocator moves between
|
||||
allocation groups when allocating extents for new files.
|
||||
67
Documentation/filesystems/xip.txt
Normal file
67
Documentation/filesystems/xip.txt
Normal file
@@ -0,0 +1,67 @@
|
||||
Execute-in-place for file mappings
|
||||
----------------------------------
|
||||
|
||||
Motivation
|
||||
----------
|
||||
File mappings are performed by mapping page cache pages to userspace. In
|
||||
addition, read&write type file operations also transfer data from/to the page
|
||||
cache.
|
||||
|
||||
For memory backed storage devices that use the block device interface, the page
|
||||
cache pages are in fact copies of the original storage. Various approaches
|
||||
exist to work around the need for an extra copy. The ramdisk driver for example
|
||||
does read the data into the page cache, keeps a reference, and discards the
|
||||
original data behind later on.
|
||||
|
||||
Execute-in-place solves this issue the other way around: instead of keeping
|
||||
data in the page cache, the need to have a page cache copy is eliminated
|
||||
completely. With execute-in-place, read&write type operations are performed
|
||||
directly from/to the memory backed storage device. For file mappings, the
|
||||
storage device itself is mapped directly into userspace.
|
||||
|
||||
This implementation was initialy written for shared memory segments between
|
||||
different virtual machines on s390 hardware to allow multiple machines to
|
||||
share the same binaries and libraries.
|
||||
|
||||
Implementation
|
||||
--------------
|
||||
Execute-in-place is implemented in three steps: block device operation,
|
||||
address space operation, and file operations.
|
||||
|
||||
A block device operation named direct_access is used to retrieve a
|
||||
reference (pointer) to a block on-disk. The reference is supposed to be
|
||||
cpu-addressable, physical address and remain valid until the release operation
|
||||
is performed. A struct block_device reference is used to address the device,
|
||||
and a sector_t argument is used to identify the individual block. As an
|
||||
alternative, memory technology devices can be used for this.
|
||||
|
||||
The block device operation is optional, these block devices support it as of
|
||||
today:
|
||||
- dcssblk: s390 dcss block device driver
|
||||
|
||||
An address space operation named get_xip_page is used to retrieve reference
|
||||
to a struct page. To address the target page, a reference to an address_space,
|
||||
and a sector number is provided. A 3rd argument indicates whether the
|
||||
function should allocate blocks if needed.
|
||||
|
||||
This address space operation is mutually exclusive with readpage&writepage that
|
||||
do page cache read/write operations.
|
||||
The following filesystems support it as of today:
|
||||
- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
|
||||
|
||||
A set of file operations that do utilize get_xip_page can be found in
|
||||
mm/filemap_xip.c . The following file operation implementations are provided:
|
||||
- aio_read/aio_write
|
||||
- readv/writev
|
||||
- sendfile
|
||||
|
||||
The generic file operations do_sync_read/do_sync_write can be used to implement
|
||||
classic synchronous IO calls.
|
||||
|
||||
Shortcomings
|
||||
------------
|
||||
This implementation is limited to storage devices that are cpu addressable at
|
||||
all times (no highmem or such). It works well on rom/ram, but enhancements are
|
||||
needed to make it work with flash in read+write mode.
|
||||
Putting the Linux kernel and/or its modules on a xip filesystem does not mean
|
||||
they are not copied.
|
||||
Reference in New Issue
Block a user