Creation of Cybook 2416 (actually Gen4) repository

This commit is contained in:
mlt
2009-12-18 17:10:00 +00:00
committed by godzil
commit 76f20f4d40
13791 changed files with 6812321 additions and 0 deletions

View File

@@ -0,0 +1,94 @@
00-INDEX
- this file (info on some of the filesystems supported by linux).
Exporting
- explanation of how to make filesystems exportable.
Locking
- info on locking rules as they pertain to Linux VFS.
9p.txt
- 9p (v9fs) is an implementation of the Plan 9 remote fs protocol.
adfs.txt
- info and mount options for the Acorn Advanced Disc Filing System.
afs.txt
- info and examples for the distributed AFS (Andrew File System) fs.
affs.txt
- info and mount options for the Amiga Fast File System.
automount-support.txt
- information about filesystem automount support.
befs.txt
- information about the BeOS filesystem for Linux.
bfs.txt
- info for the SCO UnixWare Boot Filesystem (BFS).
cifs.txt
- description of the CIFS filesystem.
coda.txt
- description of the CODA filesystem.
configfs/
- directory containing configfs documentation and example code.
cramfs.txt
- info on the cram filesystem for small storage (ROMs etc).
dentry-locking.txt
- info on the RCU-based dcache locking model.
directory-locking
- info about the locking scheme used for directory operations.
dlmfs.txt
- info on the userspace interface to the OCFS2 DLM.
ext2.txt
- info, mount options and specifications for the Ext2 filesystem.
ext3.txt
- info, mount options and specifications for the Ext3 filesystem.
ext4.txt
- info, mount options and specifications for the Ext4 filesystem.
files.txt
- info on file management in the Linux kernel.
fuse.txt
- info on the Filesystem in User SpacE including mount options.
hfs.txt
- info on the Macintosh HFS Filesystem for Linux.
hpfs.txt
- info and mount options for the OS/2 HPFS.
isofs.txt
- info and mount options for the ISO 9660 (CDROM) filesystem.
jfs.txt
- info and mount options for the JFS filesystem.
ncpfs.txt
- info on Novell Netware(tm) filesystem using NCP protocol.
ntfs.txt
- info and mount options for the NTFS filesystem (Windows NT).
ocfs2.txt
- info and mount options for the OCFS2 clustered filesystem.
porting
- various information on filesystem porting.
proc.txt
- info on Linux's /proc filesystem.
ramfs-rootfs-initramfs.txt
- info on the 'in memory' filesystems ramfs, rootfs and initramfs.
reiser4.txt
- info on the Reiser4 filesystem based on dancing tree algorithms.
relay.txt
- info on relay, for efficient streaming from kernel to user space.
romfs.txt
- description of the ROMFS filesystem.
smbfs.txt
- info on using filesystems with the SMB protocol (Win 3.11 and NT).
spufs.txt
- info and mount options for the SPU filesystem used on Cell.
sysfs-pci.txt
- info on accessing PCI device resources through sysfs.
sysfs.txt
- info on sysfs, a ram-based filesystem for exporting kernel objects.
sysv-fs.txt
- info on the SystemV/V7/Xenix/Coherent filesystem.
tmpfs.txt
- info on tmpfs, a filesystem that holds all files in virtual memory.
udf.txt
- info and mount options for the UDF filesystem.
ufs.txt
- info on the ufs filesystem.
vfat.txt
- info on using the VFAT filesystem used in Windows NT and Windows 95
vfs.txt
- overview of the Virtual File System
xfs.txt
- info and mount options for the XFS filesystem.
xip.txt
- info on execute-in-place for file mappings.

View File

@@ -0,0 +1,118 @@
v9fs: Plan 9 Resource Sharing for Linux
=======================================
ABOUT
=====
v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
This software was originally developed by Ron Minnich <rminnich@lanl.gov>
and Maya Gokhale <maya@lanl.gov>. Additional development by Greg Watson
<gwatson@lanl.gov> and most recently Eric Van Hensbergen
<ericvh@gmail.com>, Latchesar Ionkov <lucho@ionkov.net> and Russ Cox
<rsc@swtch.com>.
USAGE
=====
For remote file server:
mount -t 9p 10.10.1.2 /mnt/9
For Plan 9 From User Space applications (http://swtch.com/plan9)
mount -t 9p `namespace`/acme /mnt/9 -o proto=unix,uname=$USER
OPTIONS
=======
proto=name select an alternative transport. Valid options are
currently:
unix - specifying a named pipe mount point
tcp - specifying a normal TCP/IP connection
fd - used passed file descriptors for connection
(see rfdno and wfdno)
uname=name user name to attempt mount as on the remote server. The
server may override or ignore this value. Certain user
names may require authentication.
aname=name aname specifies the file tree to access when the server is
offering several exported file systems.
cache=mode specifies a cacheing policy. By default, no caches are used.
loose = no attempts are made at consistency,
intended for exclusive, read-only mounts
debug=n specifies debug level. The debug level is a bitmask.
0x01 = display verbose error messages
0x02 = developer debug (DEBUG_CURRENT)
0x04 = display 9p trace
0x08 = display VFS trace
0x10 = display Marshalling debug
0x20 = display RPC debug
0x40 = display transport debug
0x80 = display allocation debug
rfdno=n the file descriptor for reading with proto=fd
wfdno=n the file descriptor for writing with proto=fd
maxdata=n the number of bytes to use for 9p packet payload (msize)
port=n port to connect to on the remote server
noextend force legacy mode (no 9p2000.u semantics)
uid attempt to mount as a particular uid
gid attempt to mount with a particular gid
afid security channel - used by Plan 9 authentication protocols
nodevmap do not map special files - represent them as normal files.
This can be used to share devices/named pipes/sockets between
hosts. This functionality will be expanded in later versions.
RESOURCES
=========
Our current recommendation is to use Inferno (http://www.vitanuova.com/inferno)
as the 9p server. You can start a 9p server under Inferno by issuing the
following command:
; styxlisten -A tcp!*!564 export '#U*'
The -A specifies an unauthenticated export. The 564 is the port # (you may
have to choose a higher port number if running as a normal user). The '#U*'
specifies exporting the root of the Linux name space. You may specify a
subset of the namespace by extending the path: '#U*'/tmp would just export
/tmp. For more information, see the Inferno manual pages covering styxlisten
and export.
A Linux version of the 9p server is now maintained under the npfs project
on sourceforge (http://sourceforge.net/projects/npfs). There is also a
more stable single-threaded version of the server (named spfs) available from
the same CVS repository.
There are user and developer mailing lists available through the v9fs project
on sourceforge (http://sourceforge.net/projects/v9fs).
News and other information is maintained on SWiK (http://swik.net/v9fs).
Bug reports may be issued through the kernel.org bugzilla
(http://bugzilla.kernel.org)
For more information on the Plan 9 Operating System check out
http://plan9.bell-labs.com/plan9
For information on Plan 9 from User Space (Plan 9 applications and libraries
ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
STATUS
======
The 2.6 kernel support is working on PPC and x86.
PLEASE USE THE KERNEL BUGZILLA TO REPORT PROBLEMS. (http://bugzilla.kernel.org)

View File

@@ -0,0 +1,176 @@
Making Filesystems Exportable
=============================
Most filesystem operations require a dentry (or two) as a starting
point. Local applications have a reference-counted hold on suitable
dentrys via open file descriptors or cwd/root. However remote
applications that access a filesystem via a remote filesystem protocol
such as NFS may not be able to hold such a reference, and so need a
different way to refer to a particular dentry. As the alternative
form of reference needs to be stable across renames, truncates, and
server-reboot (among other things, though these tend to be the most
problematic), there is no simple answer like 'filename'.
The mechanism discussed here allows each filesystem implementation to
specify how to generate an opaque (out side of the filesystem) byte
string for any dentry, and how to find an appropriate dentry for any
given opaque byte string.
This byte string will be called a "filehandle fragment" as it
corresponds to part of an NFS filehandle.
A filesystem which supports the mapping between filehandle fragments
and dentrys will be termed "exportable".
Dcache Issues
-------------
The dcache normally contains a proper prefix of any given filesystem
tree. This means that if any filesystem object is in the dcache, then
all of the ancestors of that filesystem object are also in the dcache.
As normal access is by filename this prefix is created naturally and
maintained easily (by each object maintaining a reference count on
its parent).
However when objects are included into the dcache by interpreting a
filehandle fragment, there is no automatic creation of a path prefix
for the object. This leads to two related but distinct features of
the dcache that are not needed for normal filesystem access.
1/ The dcache must sometimes contain objects that are not part of the
proper prefix. i.e that are not connected to the root.
2/ The dcache must be prepared for a newly found (via ->lookup) directory
to already have a (non-connected) dentry, and must be able to move
that dentry into place (based on the parent and name in the
->lookup). This is particularly needed for directories as
it is a dcache invariant that directories only have one dentry.
To implement these features, the dcache has:
a/ A dentry flag DCACHE_DISCONNECTED which is set on
any dentry that might not be part of the proper prefix.
This is set when anonymous dentries are created, and cleared when a
dentry is noticed to be a child of a dentry which is in the proper
prefix.
b/ A per-superblock list "s_anon" of dentries which are the roots of
subtrees that are not in the proper prefix. These dentries, as
well as the proper prefix, need to be released at unmount time. As
these dentries will not be hashed, they are linked together on the
d_hash list_head.
c/ Helper routines to allocate anonymous dentries, and to help attach
loose directory dentries at lookup time. They are:
d_alloc_anon(inode) will return a dentry for the given inode.
If the inode already has a dentry, one of those is returned.
If it doesn't, a new anonymous (IS_ROOT and
DCACHE_DISCONNECTED) dentry is allocated and attached.
In the case of a directory, care is taken that only one dentry
can ever be attached.
d_splice_alias(inode, dentry) will make sure that there is a
dentry with the same name and parent as the given dentry, and
which refers to the given inode.
If the inode is a directory and already has a dentry, then that
dentry is d_moved over the given dentry.
If the passed dentry gets attached, care is taken that this is
mutually exclusive to a d_alloc_anon operation.
If the passed dentry is used, NULL is returned, else the used
dentry is returned. This corresponds to the calling pattern of
->lookup.
Filesystem Issues
-----------------
For a filesystem to be exportable it must:
1/ provide the filehandle fragment routines described below.
2/ make sure that d_splice_alias is used rather than d_add
when ->lookup finds an inode for a given parent and name.
Typically the ->lookup routine will end:
if (inode)
return d_splice(inode, dentry);
d_add(dentry, inode);
return NULL;
}
A file system implementation declares that instances of the filesystem
are exportable by setting the s_export_op field in the struct
super_block. This field must point to a "struct export_operations"
struct which could potentially be full of NULLs, though normally at
least get_parent will be set.
The primary operations are decode_fh and encode_fh.
decode_fh takes a filehandle fragment and tries to find or create a
dentry for the object referred to by the filehandle.
encode_fh takes a dentry and creates a filehandle fragment which can
later be used to find/create a dentry for the same object.
decode_fh will probably make use of "find_exported_dentry".
This function lives in the "exportfs" module which a filesystem does
not need unless it is being exported. So rather that calling
find_exported_dentry directly, each filesystem should call it through
the find_exported_dentry pointer in it's export_operations table.
This field is set correctly by the exporting agent (e.g. nfsd) when a
filesystem is exported, and before any export operations are called.
find_exported_dentry needs three support functions from the
filesystem:
get_name. When given a parent dentry and a child dentry, this
should find a name in the directory identified by the parent
dentry, which leads to the object identified by the child dentry.
If no get_name function is supplied, a default implementation is
provided which uses vfs_readdir to find potential names, and
matches inode numbers to find the correct match.
get_parent. When given a dentry for a directory, this should return
a dentry for the parent. Quite possibly the parent dentry will
have been allocated by d_alloc_anon.
The default get_parent function just returns an error so any
filehandle lookup that requires finding a parent will fail.
->lookup("..") is *not* used as a default as it can leave ".."
entries in the dcache which are too messy to work with.
get_dentry. When given an opaque datum, this should find the
implied object and create a dentry for it (possibly with
d_alloc_anon).
The opaque datum is whatever is passed down by the decode_fh
function, and is often simply a fragment of the filehandle
fragment.
decode_fh passes two datums through find_exported_dentry. One that
should be used to identify the target object, and one that can be
used to identify the object's parent, should that be necessary.
The default get_dentry function assumes that the datum contains an
inode number and a generation number, and it attempts to get the
inode using "iget" and check it's validity by matching the
generation number. A filesystem should only depend on the default
if iget can safely be used this way.
If decode_fh and/or encode_fh are left as NULL, then default
implementations are used. These defaults are suitable for ext2 and
extremely similar filesystems (like ext3).
The default encode_fh creates a filehandle fragment from the inode
number and generation number of the target together with the inode
number and generation number of the parent (if the parent is
required).
The default decode_fh extract the target and parent datums from the
filehandle assuming the format used by the default encode_fh and
passed them to find_exported_dentry.
A filehandle fragment consists of an array of 1 or more 4byte words,
together with a one byte "type".
The decode_fh routine should not depend on the stated size that is
passed to it. This size may be larger than the original filehandle
generated by encode_fh, in which case it will have been padded with
nuls. Rather, the encode_fh routine should choose a "type" which
indicates the decode_fh how much of the filehandle is valid, and how
it should be interpreted.

View File

@@ -0,0 +1,527 @@
The text below describes the locking rules for VFS-related methods.
It is (believed to be) up-to-date. *Please*, if you change anything in
prototypes or locking protocols - update this file. And update the relevant
instances in the tree, don't leave that to maintainers of filesystems/devices/
etc. At the very least, put the list of dubious cases in the end of this file.
Don't turn it into log - maintainers of out-of-the-tree code are supposed to
be able to use diff(1).
Thing currently missing here: socket operations. Alexey?
--------------------------- dentry_operations --------------------------
prototypes:
int (*d_revalidate)(struct dentry *, int);
int (*d_hash) (struct dentry *, struct qstr *);
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
locking rules:
none have BKL
dcache_lock rename_lock ->d_lock may block
d_revalidate: no no no yes
d_hash no no no yes
d_compare: no yes no no
d_delete: yes no yes no
d_release: no no no yes
d_iput: no no no yes
--------------------------- inode_operations ---------------------------
prototypes:
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid
ata *);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
int (*follow_link) (struct dentry *, struct nameidata *);
void (*truncate) (struct inode *);
int (*permission) (struct inode *, int, struct nameidata *);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
locking rules:
all may block, none have BKL
i_sem(inode)
lookup: yes
create: yes
link: yes (both)
mknod: yes
symlink: yes
mkdir: yes
unlink: yes (both)
rmdir: yes (both) (see below)
rename: yes (all) (see below)
readlink: no
follow_link: no
truncate: yes (see below)
setattr: yes
permission: no
getattr: no
setxattr: yes
getxattr: no
listxattr: no
removexattr: yes
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on
victim.
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
->truncate() is never called directly - it's a callback, not a
method. It's called by vmtruncate() - library function normally used by
->setattr(). Locking information above applies to that call (i.e. is
inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
passed).
See Documentation/filesystems/directory-locking for more detailed discussion
of the locking scheme for directory operations.
--------------------------- super_operations ---------------------------
prototypes:
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*read_inode) (struct inode *);
void (*dirty_inode) (struct inode *);
int (*write_inode) (struct inode *, int);
void (*put_inode) (struct inode *);
void (*drop_inode) (struct inode *);
void (*delete_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
void (*write_super_lockfs) (struct super_block *);
void (*unlockfs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
int (*show_options)(struct seq_file *, struct vfsmount *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
locking rules:
All may block.
BKL s_lock s_umount
alloc_inode: no no no
destroy_inode: no
read_inode: no (see below)
dirty_inode: no (must not sleep)
write_inode: no
put_inode: no
drop_inode: no !!!inode_lock!!!
delete_inode: no
put_super: yes yes no
write_super: no yes read
sync_fs: no no read
write_super_lockfs: ?
unlockfs: ?
statfs: no no no
remount_fs: yes yes maybe (see below)
clear_inode: no
umount_begin: yes no no
show_options: no (vfsmount->sem)
quota_read: no no no (see below)
quota_write: no no no (see below)
->read_inode() is not a method - it's a callback used in iget().
->remount_fs() will have the s_umount lock if it's already mounted.
When called from get_sb_single, it does NOT have the s_umount lock.
->quota_read() and ->quota_write() functions are both guaranteed to
be the only ones operating on the quota file by the quota code (via
dqio_sem) (unless an admin really wants to screw up something and
writes to quota files with quotas on). For other details about locking
see also dquot_operations section.
--------------------------- file_system_type ---------------------------
prototypes:
int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
locking rules:
may block BKL
get_sb yes yes
kill_sb yes yes
->get_sb() returns error or 0 with locked superblock attached to the vfsmount
(exclusive on ->s_umount).
->kill_sb() takes a write-locked superblock, does all shutdown work on it,
unlocks and drops the reference.
--------------------------- address_space_operations --------------------------
prototypes:
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
int (*sync_page)(struct page *);
int (*writepages)(struct address_space *, struct writeback_control *);
int (*set_page_dirty)(struct page *page);
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
int (*launder_page) (struct page *);
locking rules:
All except set_page_dirty may block
BKL PageLocked(page)
writepage: no yes, unlocks (see below)
readpage: no yes, unlocks
sync_page: no maybe
writepages: no
set_page_dirty no no
readpages: no
prepare_write: no yes
commit_write: no yes
bmap: yes
invalidatepage: no yes
releasepage: no yes
direct_IO: no
launder_page: no yes
->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
may be called from the request handler (/dev/loop).
->readpage() unlocks the page, either synchronously or via I/O
completion.
->readpages() populates the pagecache with the passed pages and starts
I/O against them. They come unlocked upon I/O completion.
->writepage() is used for two purposes: for "memory cleansing" and for
"sync". These are quite different operations and the behaviour may differ
depending upon the mode.
If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
it *must* start I/O against the page, even if that would involve
blocking on in-progress I/O.
If writepage is called for memory cleansing (sync_mode ==
WBC_SYNC_NONE) then its role is to get as much writeout underway as
possible. So writepage should try to avoid blocking against
currently-in-progress I/O.
If the filesystem is not called for "sync" and it determines that it
would need to block against in-progress I/O to be able to start new I/O
against the page the filesystem should redirty the page with
redirty_page_for_writepage(), then unlock the page and return zero.
This may also be done to avoid internal deadlocks, but rarely.
If the filesytem is called for sync then it must wait on any
in-progress I/O and then start new I/O.
The filesystem should unlock the page synchronously, before returning to the
caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
value. WRITEPAGE_ACTIVATE means that page cannot really be written out
currently, and VM should stop calling ->writepage() on this page for some
time. VM does this by moving page to the head of the active list, hence the
name.
Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
and return zero, writepage *must* run set_page_writeback() against the page,
followed by unlocking it. Once set_page_writeback() has been run against the
page, write I/O can be submitted and the write I/O completion handler must run
end_page_writeback() once the I/O is complete. If no I/O is submitted, the
filesystem must run end_page_writeback() against the page before returning from
writepage.
That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
if the filesystem needs the page to be locked during writeout, that is ok, too,
the page is allowed to be unlocked at any point in time between the calls to
set_page_writeback() and end_page_writeback().
Note, failure to run either redirty_page_for_writepage() or the combination of
set_page_writeback()/end_page_writeback() on a page submitted to writepage
will leave the page itself marked clean but it will be tagged as dirty in the
radix tree. This incoherency can lead to all sorts of hard-to-debug problems
in the filesystem like having dirty inodes at umount and losing written data.
->sync_page() locking rules are not well-defined - usually it is called
with lock on page, but that is not guaranteed. Considering the currently
existing instances of this method ->sync_page() itself doesn't look
well-defined...
->writepages() is used for periodic writeback and for syscall-initiated
sync operations. The address_space should start I/O against at least
*nr_to_write pages. *nr_to_write must be decremented for each page which is
written. The address_space implementation may write more (or less) pages
than *nr_to_write asks for, but it should try to be reasonably close. If
nr_to_write is NULL, all dirty pages must be written.
writepages should _only_ write pages which are present on
mapping->io_pages.
->set_page_dirty() is called from various places in the kernel
when the target page is marked as needing writeback. It may be called
under spinlock (it cannot block) and is sometimes called with the page
not locked.
->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
filesystems and by the swapper. The latter will eventually go away. All
instances do not actually need the BKL. Please, keep it that way and don't
breed new callers.
->invalidatepage() is called when the filesystem must attempt to drop
some or all of the buffers from the page when it is being truncated. It
returns zero on success. If ->invalidatepage is zero, the kernel uses
block_invalidatepage() instead.
->releasepage() is called when the kernel is about to try to drop the
buffers from the page in preparation for freeing it. It returns zero to
indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
the kernel assumes that the fs has no private interest in the buffers.
->launder_page() may be called prior to releasing a page if
it is still found to be dirty. It returns zero if the page was successfully
cleaned, or an error value if not. Note that in order to prevent the page
getting mapped back in and redirtied, it needs to be kept locked
across the entire operation.
Note: currently almost all instances of address_space methods are
using BKL for internal serialization and that's one of the worst sources
of contention. Normally they are calling library functions (in fs/buffer.c)
and pass foo_get_block() as a callback (on local block-based filesystems,
indeed). BKL is not needed for library stuff and is usually taken by
foo_get_block(). It's an overkill, since block bitmaps can be protected by
internal fs locking and real critical areas are much smaller than the areas
filesystems protect now.
----------------------- file_lock_operations ------------------------------
prototypes:
void (*fl_insert)(struct file_lock *); /* lock insertion callback */
void (*fl_remove)(struct file_lock *); /* lock removal callback */
void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
void (*fl_release_private)(struct file_lock *);
locking rules:
BKL may block
fl_insert: yes no
fl_remove: yes no
fl_copy_lock: yes no
fl_release_private: yes yes
----------------------- lock_manager_operations ---------------------------
prototypes:
int (*fl_compare_owner)(struct file_lock *, struct file_lock *);
void (*fl_notify)(struct file_lock *); /* unblock callback */
void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
void (*fl_release_private)(struct file_lock *);
void (*fl_break)(struct file_lock *); /* break_lease callback */
locking rules:
BKL may block
fl_compare_owner: yes no
fl_notify: yes no
fl_copy_lock: yes no
fl_release_private: yes yes
fl_break: yes no
Currently only NFSD and NLM provide instances of this class. None of the
them block. If you have out-of-tree instances - please, show up. Locking
in that area will change.
--------------------------- buffer_head -----------------------------------
prototypes:
void (*b_end_io)(struct buffer_head *bh, int uptodate);
locking rules:
called from interrupts. In other words, extreme care is needed here.
bh is locked, but that's all warranties we have here. Currently only RAID1,
highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
call this method upon the IO completion.
--------------------------- block_device_operations -----------------------
prototypes:
int (*open) (struct inode *, struct file *);
int (*release) (struct inode *, struct file *);
int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
int (*media_changed) (struct gendisk *);
int (*revalidate_disk) (struct gendisk *);
locking rules:
BKL bd_sem
open: yes yes
release: yes yes
ioctl: yes no
media_changed: no no
revalidate_disk: no no
The last two are called only from check_disk_change().
--------------------------- file_operations -------------------------------
prototypes:
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
int (*ioctl) (struct inode *, struct file *, unsigned int,
unsigned long);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, struct dentry *, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
loff_t *);
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
loff_t *);
ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
void __user *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long,
unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*dir_notify)(struct file *, unsigned long);
};
locking rules:
All except ->poll() may block.
BKL
llseek: no (see below)
read: no
aio_read: no
write: no
aio_write: no
readdir: no
poll: no
ioctl: yes (see below)
unlocked_ioctl: no (see below)
compat_ioctl: no
mmap: no
open: maybe (see below)
flush: no
release: no
fsync: no (see below)
aio_fsync: no
fasync: yes (see below)
lock: yes
readv: no
writev: no
sendfile: no
sendpage: no
get_unmapped_area: no
check_flags: no
dir_notify: no
->llseek() locking has moved from llseek to the individual llseek
implementations. If your fs is not using generic_file_llseek, you
need to acquire and release the appropriate locks in your ->llseek().
For many filesystems, it is probably safe to acquire the inode
semaphore. Note some filesystems (i.e. remote ones) provide no
protection for i_size so you will need to use the BKL.
->open() locking is in-transit: big lock partially moved into the methods.
The only exception is ->open() in the instances of file_operations that never
end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices
(chrdev_open() takes lock before replacing ->f_op and calling the secondary
method. As soon as we fix the handling of module reference counters all
instances of ->open() will be called without the BKL.
Note: ext2_release() was *the* source of contention on fs-intensive
loads and dropping BKL on ->release() helps to get rid of that (we still
grab BKL for cases when we close a file that had been opened r/w, but that
can and should be done using the internal locking with smaller critical areas).
Current worst offender is ext2_get_block()...
->fasync() is a mess. This area needs a big cleanup and that will probably
affect locking.
->readdir() and ->ioctl() on directories must be changed. Ideally we would
move ->readdir() to inode_operations and use a separate method for directory
->ioctl() or kill the latter completely. One of the problems is that for
anything that resembles union-mount we won't have a struct file for all
components. And there are other reasons why the current interface is a mess...
->ioctl() on regular files is superceded by the ->unlocked_ioctl() that
doesn't take the BKL.
->read on directories probably must go away - we should just enforce -EISDIR
in sys_read() and friends.
->fsync() has i_sem on inode.
--------------------------- dquot_operations -------------------------------
prototypes:
int (*initialize) (struct inode *, int);
int (*drop) (struct inode *);
int (*alloc_space) (struct inode *, qsize_t, int);
int (*alloc_inode) (const struct inode *, unsigned long);
int (*free_space) (struct inode *, qsize_t);
int (*free_inode) (const struct inode *, unsigned long);
int (*transfer) (struct inode *, struct iattr *);
int (*write_dquot) (struct dquot *);
int (*acquire_dquot) (struct dquot *);
int (*release_dquot) (struct dquot *);
int (*mark_dirty) (struct dquot *);
int (*write_info) (struct super_block *, int);
These operations are intended to be more or less wrapping functions that ensure
a proper locking wrt the filesystem and call the generic quota operations.
What filesystem should expect from the generic quota functions:
FS recursion Held locks when called
initialize: yes maybe dqonoff_sem
drop: yes -
alloc_space: ->mark_dirty() -
alloc_inode: ->mark_dirty() -
free_space: ->mark_dirty() -
free_inode: ->mark_dirty() -
transfer: yes -
write_dquot: yes dqonoff_sem or dqptr_sem
acquire_dquot: yes dqonoff_sem or dqptr_sem
release_dquot: yes dqonoff_sem or dqptr_sem
mark_dirty: no -
write_info: yes dqonoff_sem
FS recursion means calling ->quota_read() and ->quota_write() from superblock
operations.
->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called
only directly by the filesystem and do not call any fs functions only
the ->mark_dirty() operation.
More details about quota locking can be found in fs/dquot.c.
--------------------------- vm_operations_struct -----------------------------
prototypes:
void (*open)(struct vm_area_struct*);
void (*close)(struct vm_area_struct*);
struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *);
locking rules:
BKL mmap_sem
open: no yes
close: no yes
nopage: no yes
================================================================================
Dubious stuff
(if you break something or notice that it is broken and do not fix it yourself
- at least put it here)
ipc/shm.c::shm_delete() - may need BKL.
->read() and ->write() in many drivers are (probably) missing BKL.
drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL.

View File

@@ -0,0 +1,57 @@
Mount options for ADFS
----------------------
uid=nnn All files in the partition will be owned by
user id nnn. Default 0 (root).
gid=nnn All files in the partition will be in group
nnn. Default 0 (root).
ownmask=nnn The permission mask for ADFS 'owner' permissions
will be nnn. Default 0700.
othmask=nnn The permission mask for ADFS 'other' permissions
will be nnn. Default 0077.
Mapping of ADFS permissions to Linux permissions
------------------------------------------------
ADFS permissions consist of the following:
Owner read
Owner write
Other read
Other write
(In older versions, an 'execute' permission did exist, but this
does not hold the same meaning as the Linux 'execute' permission
and is now obsolete).
The mapping is performed as follows:
Owner read -> -r--r--r--
Owner write -> --w--w---w
Owner read and filetype UnixExec -> ---x--x--x
These are then masked by ownmask, eg 700 -> -rwx------
Possible owner mode permissions -> -rwx------
Other read -> -r--r--r--
Other write -> --w--w--w-
Other read and filetype UnixExec -> ---x--x--x
These are then masked by othmask, eg 077 -> ----rwxrwx
Possible other mode permissions -> ----rwxrwx
Hence, with the default masks, if a file is owner read/write, and
not a UnixExec filetype, then the permissions will be:
-rw-------
However, if the masks were ownmask=0770,othmask=0007, then this would
be modified to:
-rw-rw----
There is no restriction on what you can do with these masks. You may
wish that either read bits give read access to the file for all, but
keep the default write protection (ownmask=0755,othmask=0577):
-rw-r--r--
You can therefore tailor the permission translation to whatever you
desire the permissions should be under Linux.

View File

@@ -0,0 +1,219 @@
Overview of Amiga Filesystems
=============================
Not all varieties of the Amiga filesystems are supported for reading and
writing. The Amiga currently knows six different filesystems:
DOS\0 The old or original filesystem, not really suited for
hard disks and normally not used on them, either.
Supported read/write.
DOS\1 The original Fast File System. Supported read/write.
DOS\2 The old "international" filesystem. International means that
a bug has been fixed so that accented ("international") letters
in file names are case-insensitive, as they ought to be.
Supported read/write.
DOS\3 The "international" Fast File System. Supported read/write.
DOS\4 The original filesystem with directory cache. The directory
cache speeds up directory accesses on floppies considerably,
but slows down file creation/deletion. Doesn't make much
sense on hard disks. Supported read only.
DOS\5 The Fast File System with directory cache. Supported read only.
All of the above filesystems allow block sizes from 512 to 32K bytes.
Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
speed up almost everything at the expense of wasted disk space. The speed
gain above 4K seems not really worth the price, so you don't lose too
much here, either.
The muFS (multi user File System) equivalents of the above file systems
are supported, too.
Mount options for the AFFS
==========================
protect If this option is set, the protection bits cannot be altered.
setuid[=uid] This sets the owner of all files and directories in the file
system to uid or the uid of the current user, respectively.
setgid[=gid] Same as above, but for gid.
mode=mode Sets the mode flags to the given (octal) value, regardless
of the original permissions. Directories will get an x
permission if the corresponding r bit is set.
This is useful since most of the plain AmigaOS files
will map to 600.
reserved=num Sets the number of reserved blocks at the start of the
partition to num. You should never need this option.
Default is 2.
root=block Sets the block number of the root block. This should never
be necessary.
bs=blksize Sets the blocksize to blksize. Valid block sizes are 512,
1024, 2048 and 4096. Like the root option, this should
never be necessary, as the affs can figure it out itself.
quiet The file system will not return an error for disallowed
mode changes.
verbose The volume name, file system type and block size will
be written to the syslog when the filesystem is mounted.
mufs The filesystem is really a muFS, also it doesn't
identify itself as one. This option is necessary if
the filesystem wasn't formatted as muFS, but is used
as one.
prefix=path Path will be prefixed to every absolute path name of
symbolic links on an AFFS partition. Default = "/".
(See below.)
volume=name When symbolic links with an absolute path are created
on an AFFS partition, name will be prepended as the
volume name. Default = "" (empty string).
(See below.)
Handling of the Users/Groups and protection flags
=================================================
Amiga -> Linux:
The Amiga protection flags RWEDRWEDHSPARWED are handled as follows:
- R maps to r for user, group and others. On directories, R implies x.
- If both W and D are allowed, w will be set.
- E maps to x.
- H and P are always retained and ignored under Linux.
- A is always reset when a file is written to.
User id and group id will be used unless set[gu]id are given as mount
options. Since most of the Amiga file systems are single user systems
they will be owned by root. The root directory (the mount point) of the
Amiga filesystem will be owned by the user who actually mounts the
filesystem (the root directory doesn't have uid/gid fields).
Linux -> Amiga:
The Linux rwxrwxrwx file mode is handled as follows:
- r permission will set R for user, group and others.
- w permission will set W and D for user, group and others.
- x permission of the user will set E for plain files.
- All other flags (suid, sgid, ...) are ignored and will
not be retained.
Newly created files and directories will get the user and group ID
of the current user and a mode according to the umask.
Symbolic links
==============
Although the Amiga and Linux file systems resemble each other, there
are some, not always subtle, differences. One of them becomes apparent
with symbolic links. While Linux has a file system with exactly one
root directory, the Amiga has a separate root directory for each
file system (for example, partition, floppy disk, ...). With the Amiga,
these entities are called "volumes". They have symbolic names which
can be used to access them. Thus, symbolic links can point to a
different volume. AFFS turns the volume name into a directory name
and prepends the prefix path (see prefix option) to it.
Example:
You mount all your Amiga partitions under /amiga/<volume> (where
<volume> is the name of the volume), and you give the option
"prefix=/amiga/" when mounting all your AFFS partitions. (They
might be "User", "WB" and "Graphics", the mount points /amiga/User,
/amiga/WB and /amiga/Graphics). A symbolic link referring to
"User:sc/include/dos/dos.h" will be followed to
"/amiga/User/sc/include/dos/dos.h".
Examples
========
Command line:
mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
mount /dev/sda3 /Amiga -t affs
/etc/fstab entry:
/dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0
IMPORTANT NOTE
==============
If you boot Windows 95 (don't know about 3.x, 98 and NT) while you
have an Amiga harddisk connected to your PC, it will overwrite
the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating
the Rigid Disk Block. Sheer luck has it that this is an unused
area of the RDB, so only the checksum doesn't match anymore.
Linux will ignore this garbage and recognize the RDB anyway, but
before you connect that drive to your Amiga again, you must
restore or repair your RDB. So please do make a backup copy of it
before booting Windows!
If the damage is already done, the following should fix the RDB
(where <disk> is the device name).
DO AT YOUR OWN RISK:
dd if=/dev/<disk> of=rdb.tmp count=1
cp rdb.tmp rdb.fixed
dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4
dd if=rdb.fixed of=/dev/<disk>
Bugs, Restrictions, Caveats
===========================
Quite a few things may not work as advertised. Not everything is
tested, though several hundred MB have been read and written using
this fs. For a most up-to-date list of bugs please consult
fs/affs/Changes.
Filenames are truncated to 30 characters without warning (this
can be changed by setting the compile-time option AFFS_NO_TRUNCATE
in include/linux/amigaffs.h).
Case is ignored by the affs in filename matching, but Linux shells
do care about the case. Example (with /wb being an affs mounted fs):
rm /wb/WRONGCASE
will remove /mnt/wrongcase, but
rm /wb/WR*
will not since the names are matched by the shell.
The block allocation is designed for hard disk partitions. If more
than 1 process writes to a (small) diskette, the blocks are allocated
in an ugly way (but the real AFFS doesn't do much better). This
is also true when space gets tight.
You cannot execute programs on an OFS (Old File System), since the
program files cannot be memory mapped due to the 488 byte blocks.
For the same reason you cannot mount an image on such a filesystem
via the loopback device.
The bitmap valid flag in the root block may not be accurate when the
system crashes while an affs partition is mounted. There's currently
no way to fix a garbled filesystem without an Amiga (disk validator)
or manually (who would do this?). Maybe later.
If you mount affs partitions on system startup, you may want to tell
fsck that the fs should not be checked (place a '0' in the sixth field
of /etc/fstab).
It's not possible to read floppy disks with a normal PC or workstation
due to an incompatibility with the Amiga floppy controller.
If you are interested in an Amiga Emulator for Linux, look at
http://www.freiburg.linux.de/~uae/

View File

@@ -0,0 +1,155 @@
kAFS: AFS FILESYSTEM
====================
ABOUT
=====
This filesystem provides a fairly simple AFS filesystem driver. It is under
development and only provides very basic facilities. It does not yet support
the following AFS features:
(*) Write support.
(*) Communications security.
(*) Local caching.
(*) pioctl() system call.
(*) Automatic mounting of embedded mountpoints.
USAGE
=====
When inserting the driver modules the root cell must be specified along with a
list of volume location server IP addresses:
insmod rxrpc.o
insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
The first module is a driver for the RxRPC remote operation protocol, and the
second is the actual filesystem driver for the AFS filesystem.
Once the module has been loaded, more modules can be added by the following
procedure:
echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
Where the parameters to the "add" command are the name of a cell and a list of
volume location servers within that cell.
Filesystems can be mounted anywhere by commands similar to the following:
mount -t afs "%cambridge.redhat.com:root.afs." /afs
mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
mount -t afs "#root.afs." /afs
mount -t afs "#root.cell." /afs/cambridge
NB: When using this on Linux 2.4, the mount command has to be different,
since the filesystem doesn't have access to the device name argument:
mount -t afs none /afs -ovol="#root.afs."
Where the initial character is either a hash or a percent symbol depending on
whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
volume, but are willing to use a R/W volume instead (percent).
The name of the volume can be suffixes with ".backup" or ".readonly" to
specify connection to only volumes of those types.
The name of the cell is optional, and if not given during a mount, then the
named volume will be looked up in the cell specified during insmod.
Additional cells can be added through /proc (see later section).
MOUNTPOINTS
===========
AFS has a concept of mountpoints. These are specially formatted symbolic links
(of the same form as the "device name" passed to mount). kAFS presents these
to the user as directories that have special properties:
(*) They cannot be listed. Running a program like "ls" on them will incur an
EREMOTE error (Object is remote).
(*) Other objects can't be looked up inside of them. This also incurs an
EREMOTE error.
(*) They can be queried with the readlink() system call, which will return
the name of the mountpoint to which they point. The "readlink" program
will also work.
(*) They can be mounted on (which symbolic links can't).
PROC FILESYSTEM
===============
The rxrpc module creates a number of files in various places in the /proc
filesystem:
(*) Firstly, some information files are made available in a directory called
"/proc/net/rxrpc/". These list the extant transport endpoint, peer,
connection and call records.
(*) Secondly, some control files are made available in a directory called
"/proc/sys/rxrpc/". Currently, all these files can be used for is to
turn on various levels of tracing.
The AFS modules creates a "/proc/fs/afs/" directory and populates it:
(*) A "cells" file that lists cells currently known to the afs module.
(*) A directory per cell that contains files that list volume location
servers, volumes, and active servers known within that cell.
THE CELL DATABASE
=================
The filesystem maintains an internal database of all the cells it knows and
the IP addresses of the volume location servers for those cells. The cell to
which the computer belongs is added to the database when insmod is performed
by the "rootcell=" argument.
Further cells can be added by commands similar to the following:
echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
No other cell database operations are available at this time.
EXAMPLES
========
Here's what I use to test this. Some of the names and IP addresses are local
to my internal DNS. My "root.afs" partition has a mount point within it for
some public volumes volumes.
insmod -S /tmp/rxrpc.o
insmod -S /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
mount -t afs \%root.afs. /afs
mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/
echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells
mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/
mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive
mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib
mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc
mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project
mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service
mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software
mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user
umount /afs/grand.central.org/user
umount /afs/grand.central.org/software
umount /afs/grand.central.org/service
umount /afs/grand.central.org/project
umount /afs/grand.central.org/doc
umount /afs/grand.central.org/contrib
umount /afs/grand.central.org/archive
umount /afs/grand.central.org
umount /afs/cambridge.redhat.com
umount /afs
rmmod kafs
rmmod rxrpc

View File

@@ -0,0 +1,118 @@
Support is available for filesystems that wish to do automounting support (such
as kAFS which can be found in fs/afs/). This facility includes allowing
in-kernel mounts to be performed and mountpoint degradation to be
requested. The latter can also be requested by userspace.
======================
IN-KERNEL AUTOMOUNTING
======================
A filesystem can now mount another filesystem on one of its directories by the
following procedure:
(1) Give the directory a follow_link() operation.
When the directory is accessed, the follow_link op will be called, and
it will be provided with the location of the mountpoint in the nameidata
structure (vfsmount and dentry).
(2) Have the follow_link() op do the following steps:
(a) Call vfs_kern_mount() to call the appropriate filesystem to set up a
superblock and gain a vfsmount structure representing it.
(b) Copy the nameidata provided as an argument and substitute the dentry
argument into it the copy.
(c) Call do_add_mount() to install the new vfsmount into the namespace's
mountpoint tree, thus making it accessible to userspace. Use the
nameidata set up in (b) as the destination.
If the mountpoint will be automatically expired, then do_add_mount()
should also be given the location of an expiration list (see further
down).
(d) Release the path in the nameidata argument and substitute in the new
vfsmount and its root dentry. The ref counts on these will need
incrementing.
Then from userspace, you can just do something like:
[root@andromeda root]# mount -t afs \#root.afs. /afs
[root@andromeda root]# ls /afs
asd cambridge cambridge.redhat.com grand.central.org
[root@andromeda root]# ls /afs/cambridge
afsdoc
[root@andromeda root]# ls /afs/cambridge/afsdoc/
ChangeLog html LICENSE pdf RELNOTES-1.2.2
And then if you look in the mountpoint catalogue, you'll see something like:
[root@andromeda root]# cat /proc/mounts
...
#root.afs. /afs afs rw 0 0
#root.cell. /afs/cambridge.redhat.com afs rw 0 0
#afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
===========================
AUTOMATIC MOUNTPOINT EXPIRY
===========================
Automatic expiration of mountpoints is easy, provided you've mounted the
mountpoint to be expired in the automounting procedure outlined above.
To do expiration, you need to follow these steps:
(3) Create at least one list off which the vfsmounts to be expired can be
hung. Access to this list will be governed by the vfsmount_lock.
(4) In step (2c) above, the call to do_add_mount() should be provided with a
pointer to this list. It will hang the vfsmount off of it if it succeeds.
(5) When you want mountpoints to be expired, call mark_mounts_for_expiry()
with a pointer to this list. This will process the list, marking every
vfsmount thereon for potential expiry on the next call.
If a vfsmount was already flagged for expiry, and if its usage count is 1
(it's only referenced by its parent vfsmount), then it will be deleted
from the namespace and thrown away (effectively unmounted).
It may prove simplest to simply call this at regular intervals, using
some sort of timed event to drive it.
The expiration flag is cleared by calls to mntput. This means that expiration
will only happen on the second expiration request after the last time the
mountpoint was accessed.
If a mountpoint is moved, it gets removed from the expiration list. If a bind
mount is made on an expirable mount, the new vfsmount will not be on the
expiration list and will not expire.
If a namespace is copied, all mountpoints contained therein will be copied,
and the copies of those that are on an expiration list will be added to the
same expiration list.
=======================
USERSPACE DRIVEN EXPIRY
=======================
As an alternative, it is possible for userspace to request expiry of any
mountpoint (though some will be rejected - the current process's idea of the
rootfs for example). It does this by passing the MNT_EXPIRE flag to
umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH.
If the mountpoint in question is in referenced by something other than
umount() or its parent mountpoint, an EBUSY error will be returned and the
mountpoint will not be marked for expiration or unmounted.
If the mountpoint was not already marked for expiry at that time, an EAGAIN
error will be given and it won't be unmounted.
Otherwise if it was already marked and it wasn't referenced, unmounting will
take place as usual.
Again, the expiration flag is cleared every time anything other than umount()
looks at a mountpoint.

View File

@@ -0,0 +1,117 @@
BeOS filesystem for Linux
Document last updated: Dec 6, 2001
WARNING
=======
Make sure you understand that this is alpha software. This means that the
implementation is neither complete nor well-tested.
I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
LICENSE
=====
This software is covered by the GNU General Public License.
See the file COPYING for the complete text of the license.
Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
AUTHOR
=====
The largest part of the code written by Will Dyson <will_dyson@pobox.com>
He has been working on the code since Aug 13, 2001. See the changelog for
details.
Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
His original code can still be found at:
<http://hp.vector.co.jp/authors/VA008030/bfs/>
Does anyone know of a more current email address for Makoto? He doesn't
respond to the address given above...
Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru>
WHAT IS THIS DRIVER?
==================
This module implements the native filesystem of BeOS <http://www.be.com/>
for the linux 2.4.1 and later kernels. Currently it is a read-only
implementation.
Which is it, BFS or BEFS?
================
Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
But Unixware Boot Filesystem is called bfs, too. And they are already in
the kernel. Because of this naming conflict, on Linux the BeOS
filesystem is called befs.
HOW TO INSTALL
==============
step 1. Install the BeFS patch into the source code tree of linux.
Apply the patchfile to your kernel source tree.
Assuming that your kernel source is in /foo/bar/linux and the patchfile
is called patch-befs-xxx, you would do the following:
cd /foo/bar/linux
patch -p1 < /path/to/patch-befs-xxx
if the patching step fails (i.e. there are rejected hunks), you can try to
figure it out yourself (it shouldn't be hard), or mail the maintainer
(Will Dyson <will_dyson@pobox.com>) for help.
step 2. Configuration & make kernel
The linux kernel has many compile-time options. Most of them are beyond the
scope of this document. I suggest the Kernel-HOWTO document as a good general
reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html>
However, to use the BeFS module, you must enable it at configure time.
cd /foo/bar/linux
make menuconfig (or xconfig)
The BeFS module is not a standard part of the linux kernel, so you must first
enable support for experimental code under the "Code maturity level" menu.
Then, under the "Filesystems" menu will be an option called "BeFS
filesystem (experimental)", or something like that. Enable that option
(it is fine to make it a module).
Save your kernel configuration and then build your kernel.
step 3. Install
See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
instructions on this critical step.
USING BFS
=========
To use the BeOS filesystem, use filesystem type 'befs'.
ex)
mount -t befs /dev/fd0 /beos
MOUNT OPTIONS
=============
uid=nnn All files in the partition will be owned by user id nnn.
gid=nnn All files in the partition will be in group nnn.
iocharset=xxx Use xxx as the name of the NLS translation table.
debug The driver will output debugging information to the syslog.
HOW TO GET LASTEST VERSION
==========================
The latest version is currently available at:
<http://befs-driver.sourceforge.net/>
ANY KNOWN BUGS?
===========
As of Jan 20, 2002:
None
SPECIAL THANKS
==============
Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
Hiroyuki Yamada ... Testing LinuxPPC.

View File

@@ -0,0 +1,57 @@
BFS FILESYSTEM FOR LINUX
========================
The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
usually contains the kernel image and a few other files required for the
boot process.
In order to access /stand partition under Linux you obviously need to
know the partition number and the kernel must support UnixWare disk slices
(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
depend on having UnixWare disklabel support because one can also mount
BFS filesystem via loopback:
# losetup /dev/loop0 stand.img
# mount -t bfs /dev/loop0 /mnt/stand
where stand.img is a file containing the image of BFS filesystem.
When you have finished using it and umounted you need to also deallocate
/dev/loop0 device by:
# losetup -d /dev/loop0
You can simplify mounting by just typing:
# mount -t bfs -o loop stand.img /mnt/stand
this will allocate the first available loopback device (and load loop.o
kernel module if necessary) automatically. If the loopback driver is not
loaded automatically, make sure that your kernel is compiled with kmod
support (CONFIG_KMOD) enabled. Beware that umount will not
deallocate /dev/loopN device if /etc/mtab file on your system is a
symbolic link to /proc/mounts. You will need to do it manually using
"-d" switch of losetup(8). Read losetup(8) manpage for more info.
To create the BFS image under UnixWare you need to find out first which
slice contains it. The command prtvtoc(1M) is your friend:
# prtvtoc /dev/rdsk/c0b0t0d0s0
(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
look for the slice with tag "STAND", which is usually slice 10. With this
information you can use dd(1) to create the BFS image:
# umount /stand
# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
Just in case, you can verify that you have done the right thing by checking
the magic number:
# od -Ad -tx4 stand.img | more
The first 4 bytes should be 0x1badface.
If you have any patches, questions or suggestions regarding this BFS
implementation please contact the author:
Tigran Aivazian <tigran@aivazian.fsnet.co.uk>

View File

@@ -0,0 +1,51 @@
This is the client VFS module for the Common Internet File System
(CIFS) protocol which is the successor to the Server Message Block
(SMB) protocol, the native file sharing mechanism for most early
PC operating systems. CIFS is fully supported by current network
file servers such as Windows 2000, Windows 2003 (including
Windows XP) as well by Samba (which provides excellent CIFS
server support for Linux and many other operating systems), so
this network filesystem client can mount to a wide variety of
servers. The smbfs module should be used instead of this cifs module
for mounting to older SMB servers such as OS/2. The smbfs and cifs
modules can coexist and do not conflict. The CIFS VFS filesystem
module is designed to work well with servers that implement the
newer versions (dialects) of the SMB/CIFS protocol such as Samba,
the program written by Andrew Tridgell that turns any Unix host
into a SMB/CIFS file server.
The intent of this module is to provide the most advanced network
file system function for CIFS compliant servers, including better
POSIX compliance, secure per-user session establishment, high
performance safe distributed caching (oplock), optional packet
signing, large files, Unicode support and other internationalization
improvements. Since both Samba server and this filesystem client support
the CIFS Unix extensions, the combination can provide a reasonable
alternative to NFSv4 for fileserving in some Linux to Linux environments,
not just in Linux to Windows environments.
This filesystem has an optional mount utility (mount.cifs) that can
be obtained from the project page and installed in the path in the same
directory with the other mount helpers (such as mount.smbfs).
Mounting using the cifs filesystem without installing the mount helper
requires specifying the server's ip address.
For Linux 2.4:
mount //anything/here /mnt_target -o
user=username,pass=password,unc=//ip_address_of_server/sharename
For Linux 2.5:
mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password
For more information on the module see the project page at
http://us1.samba.org/samba/Linux_CIFS_client.html
For more information on CIFS see:
http://www.snia.org/tech_activities/CIFS
or the Samba site:
http://www.samba.org

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,434 @@
configfs - Userspace-driven kernel object configuration.
Joel Becker <joel.becker@oracle.com>
Updated: 31 March 2005
Copyright (c) 2005 Oracle Corporation,
Joel Becker <joel.becker@oracle.com>
[What is configfs?]
configfs is a ram-based filesystem that provides the converse of
sysfs's functionality. Where sysfs is a filesystem-based view of
kernel objects, configfs is a filesystem-based manager of kernel
objects, or config_items.
With sysfs, an object is created in kernel (for example, when a device
is discovered) and it is registered with sysfs. Its attributes then
appear in sysfs, allowing userspace to read the attributes via
readdir(3)/read(2). It may allow some attributes to be modified via
write(2). The important point is that the object is created and
destroyed in kernel, the kernel controls the lifecycle of the sysfs
representation, and sysfs is merely a window on all this.
A configfs config_item is created via an explicit userspace operation:
mkdir(2). It is destroyed via rmdir(2). The attributes appear at
mkdir(2) time, and can be read or modified via read(2) and write(2).
As with sysfs, readdir(3) queries the list of items and/or attributes.
symlink(2) can be used to group items together. Unlike sysfs, the
lifetime of the representation is completely driven by userspace. The
kernel modules backing the items must respond to this.
Both sysfs and configfs can and should exist together on the same
system. One is not a replacement for the other.
[Using configfs]
configfs can be compiled as a module or into the kernel. You can access
it by doing
mount -t configfs none /config
The configfs tree will be empty unless client modules are also loaded.
These are modules that register their item types with configfs as
subsystems. Once a client subsystem is loaded, it will appear as a
subdirectory (or more than one) under /config. Like sysfs, the
configfs tree is always there, whether mounted on /config or not.
An item is created via mkdir(2). The item's attributes will also
appear at this time. readdir(3) can determine what the attributes are,
read(2) can query their default values, and write(2) can store new
values. Like sysfs, attributes should be ASCII text files, preferably
with only one value per file. The same efficiency caveats from sysfs
apply. Don't mix more than one attribute in one attribute file.
Like sysfs, configfs expects write(2) to store the entire buffer at
once. When writing to configfs attributes, userspace processes should
first read the entire file, modify the portions they wish to change, and
then write the entire buffer back. Attribute files have a maximum size
of one page (PAGE_SIZE, 4096 on i386).
When an item needs to be destroyed, remove it with rmdir(2). An
item cannot be destroyed if any other item has a link to it (via
symlink(2)). Links can be removed via unlink(2).
[Configuring FakeNBD: an Example]
Imagine there's a Network Block Device (NBD) driver that allows you to
access remote block devices. Call it FakeNBD. FakeNBD uses configfs
for its configuration. Obviously, there will be a nice program that
sysadmins use to configure FakeNBD, but somehow that program has to tell
the driver about it. Here's where configfs comes in.
When the FakeNBD driver is loaded, it registers itself with configfs.
readdir(3) sees this just fine:
# ls /config
fakenbd
A fakenbd connection can be created with mkdir(2). The name is
arbitrary, but likely the tool will make some use of the name. Perhaps
it is a uuid or a disk name:
# mkdir /config/fakenbd/disk1
# ls /config/fakenbd/disk1
target device rw
The target attribute contains the IP address of the server FakeNBD will
connect to. The device attribute is the device on the server.
Predictably, the rw attribute determines whether the connection is
read-only or read-write.
# echo 10.0.0.1 > /config/fakenbd/disk1/target
# echo /dev/sda1 > /config/fakenbd/disk1/device
# echo 1 > /config/fakenbd/disk1/rw
That's it. That's all there is. Now the device is configured, via the
shell no less.
[Coding With configfs]
Every object in configfs is a config_item. A config_item reflects an
object in the subsystem. It has attributes that match values on that
object. configfs handles the filesystem representation of that object
and its attributes, allowing the subsystem to ignore all but the
basic show/store interaction.
Items are created and destroyed inside a config_group. A group is a
collection of items that share the same attributes and operations.
Items are created by mkdir(2) and removed by rmdir(2), but configfs
handles that. The group has a set of operations to perform these tasks
A subsystem is the top level of a client module. During initialization,
the client module registers the subsystem with configfs, the subsystem
appears as a directory at the top of the configfs filesystem. A
subsystem is also a config_group, and can do everything a config_group
can.
[struct config_item]
struct config_item {
char *ci_name;
char ci_namebuf[UOBJ_NAME_LEN];
struct kref ci_kref;
struct list_head ci_entry;
struct config_item *ci_parent;
struct config_group *ci_group;
struct config_item_type *ci_type;
struct dentry *ci_dentry;
};
void config_item_init(struct config_item *);
void config_item_init_type_name(struct config_item *,
const char *name,
struct config_item_type *type);
struct config_item *config_item_get(struct config_item *);
void config_item_put(struct config_item *);
Generally, struct config_item is embedded in a container structure, a
structure that actually represents what the subsystem is doing. The
config_item portion of that structure is how the object interacts with
configfs.
Whether statically defined in a source file or created by a parent
config_group, a config_item must have one of the _init() functions
called on it. This initializes the reference count and sets up the
appropriate fields.
All users of a config_item should have a reference on it via
config_item_get(), and drop the reference when they are done via
config_item_put().
By itself, a config_item cannot do much more than appear in configfs.
Usually a subsystem wants the item to display and/or store attributes,
among other things. For that, it needs a type.
[struct config_item_type]
struct configfs_item_operations {
void (*release)(struct config_item *);
ssize_t (*show_attribute)(struct config_item *,
struct configfs_attribute *,
char *);
ssize_t (*store_attribute)(struct config_item *,
struct configfs_attribute *,
const char *, size_t);
int (*allow_link)(struct config_item *src,
struct config_item *target);
int (*drop_link)(struct config_item *src,
struct config_item *target);
};
struct config_item_type {
struct module *ct_owner;
struct configfs_item_operations *ct_item_ops;
struct configfs_group_operations *ct_group_ops;
struct configfs_attribute **ct_attrs;
};
The most basic function of a config_item_type is to define what
operations can be performed on a config_item. All items that have been
allocated dynamically will need to provide the ct_item_ops->release()
method. This method is called when the config_item's reference count
reaches zero. Items that wish to display an attribute need to provide
the ct_item_ops->show_attribute() method. Similarly, storing a new
attribute value uses the store_attribute() method.
[struct configfs_attribute]
struct configfs_attribute {
char *ca_name;
struct module *ca_owner;
mode_t ca_mode;
};
When a config_item wants an attribute to appear as a file in the item's
configfs directory, it must define a configfs_attribute describing it.
It then adds the attribute to the NULL-terminated array
config_item_type->ct_attrs. When the item appears in configfs, the
attribute file will appear with the configfs_attribute->ca_name
filename. configfs_attribute->ca_mode specifies the file permissions.
If an attribute is readable and the config_item provides a
ct_item_ops->show_attribute() method, that method will be called
whenever userspace asks for a read(2) on the attribute. The converse
will happen for write(2).
[struct config_group]
A config_item cannot live in a vacuum. The only way one can be created
is via mkdir(2) on a config_group. This will trigger creation of a
child item.
struct config_group {
struct config_item cg_item;
struct list_head cg_children;
struct configfs_subsystem *cg_subsys;
struct config_group **default_groups;
};
void config_group_init(struct config_group *group);
void config_group_init_type_name(struct config_group *group,
const char *name,
struct config_item_type *type);
The config_group structure contains a config_item. Properly configuring
that item means that a group can behave as an item in its own right.
However, it can do more: it can create child items or groups. This is
accomplished via the group operations specified on the group's
config_item_type.
struct configfs_group_operations {
struct config_item *(*make_item)(struct config_group *group,
const char *name);
struct config_group *(*make_group)(struct config_group *group,
const char *name);
int (*commit_item)(struct config_item *item);
void (*drop_item)(struct config_group *group,
struct config_item *item);
};
A group creates child items by providing the
ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new
config_item (or more likely, its container structure), initializes it,
and returns it to configfs. Configfs will then populate the filesystem
tree to reflect the new item.
If the subsystem wants the child to be a group itself, the subsystem
provides ct_group_ops->make_group(). Everything else behaves the same,
using the group _init() functions on the group.
Finally, when userspace calls rmdir(2) on the item or group,
ct_group_ops->drop_item() is called. As a config_group is also a
config_item, it is not necessary for a separate drop_group() method.
The subsystem must config_item_put() the reference that was initialized
upon item allocation. If a subsystem has no work to do, it may omit
the ct_group_ops->drop_item() method, and configfs will call
config_item_put() on the item on behalf of the subsystem.
IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2)
is called, configfs WILL remove the item from the filesystem tree
(assuming that it has no children to keep it busy). The subsystem is
responsible for responding to this. If the subsystem has references to
the item in other threads, the memory is safe. It may take some time
for the item to actually disappear from the subsystem's usage. But it
is gone from configfs.
A config_group cannot be removed while it still has child items. This
is implemented in the configfs rmdir(2) code. ->drop_item() will not be
called, as the item has not been dropped. rmdir(2) will fail, as the
directory is not empty.
[struct configfs_subsystem]
A subsystem must register itself, usually at module_init time. This
tells configfs to make the subsystem appear in the file tree.
struct configfs_subsystem {
struct config_group su_group;
struct semaphore su_sem;
};
int configfs_register_subsystem(struct configfs_subsystem *subsys);
void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
A subsystem consists of a toplevel config_group and a semaphore.
The group is where child config_items are created. For a subsystem,
this group is usually defined statically. Before calling
configfs_register_subsystem(), the subsystem must have initialized the
group via the usual group _init() functions, and it must also have
initialized the semaphore.
When the register call returns, the subsystem is live, and it
will be visible via configfs. At that point, mkdir(2) can be called and
the subsystem must be ready for it.
[An Example]
The best example of these basic concepts is the simple_children
subsystem/group and the simple_child item in configfs_example.c It
shows a trivial object displaying and storing an attribute, and a simple
group creating and destroying these children.
[Hierarchy Navigation and the Subsystem Semaphore]
There is an extra bonus that configfs provides. The config_groups and
config_items are arranged in a hierarchy due to the fact that they
appear in a filesystem. A subsystem is NEVER to touch the filesystem
parts, but the subsystem might be interested in this hierarchy. For
this reason, the hierarchy is mirrored via the config_group->cg_children
and config_item->ci_parent structure members.
A subsystem can navigate the cg_children list and the ci_parent pointer
to see the tree created by the subsystem. This can race with configfs'
management of the hierarchy, so configfs uses the subsystem semaphore to
protect modifications. Whenever a subsystem wants to navigate the
hierarchy, it must do so under the protection of the subsystem
semaphore.
A subsystem will be prevented from acquiring the semaphore while a newly
allocated item has not been linked into this hierarchy. Similarly, it
will not be able to acquire the semaphore while a dropping item has not
yet been unlinked. This means that an item's ci_parent pointer will
never be NULL while the item is in configfs, and that an item will only
be in its parent's cg_children list for the same duration. This allows
a subsystem to trust ci_parent and cg_children while they hold the
semaphore.
[Item Aggregation Via symlink(2)]
configfs provides a simple group via the group->item parent/child
relationship. Often, however, a larger environment requires aggregation
outside of the parent/child connection. This is implemented via
symlink(2).
A config_item may provide the ct_item_ops->allow_link() and
ct_item_ops->drop_link() methods. If the ->allow_link() method exists,
symlink(2) may be called with the config_item as the source of the link.
These links are only allowed between configfs config_items. Any
symlink(2) attempt outside the configfs filesystem will be denied.
When symlink(2) is called, the source config_item's ->allow_link()
method is called with itself and a target item. If the source item
allows linking to target item, it returns 0. A source item may wish to
reject a link if it only wants links to a certain type of object (say,
in its own subsystem).
When unlink(2) is called on the symbolic link, the source item is
notified via the ->drop_link() method. Like the ->drop_item() method,
this is a void function and cannot return failure. The subsystem is
responsible for responding to the change.
A config_item cannot be removed while it links to any other item, nor
can it be removed while an item links to it. Dangling symlinks are not
allowed in configfs.
[Automatically Created Subgroups]
A new config_group may want to have two types of child config_items.
While this could be codified by magic names in ->make_item(), it is much
more explicit to have a method whereby userspace sees this divergence.
Rather than have a group where some items behave differently than
others, configfs provides a method whereby one or many subgroups are
automatically created inside the parent at its creation. Thus,
mkdir("parent) results in "parent", "parent/subgroup1", up through
"parent/subgroupN". Items of type 1 can now be created in
"parent/subgroup1", and items of type N can be created in
"parent/subgroupN".
These automatic subgroups, or default groups, do not preclude other
children of the parent group. If ct_group_ops->make_group() exists,
other child groups can be created on the parent group directly.
A configfs subsystem specifies default groups by filling in the
NULL-terminated array default_groups on the config_group structure.
Each group in that array is populated in the configfs tree at the same
time as the parent group. Similarly, they are removed at the same time
as the parent. No extra notification is provided. When a ->drop_item()
method call notifies the subsystem the parent group is going away, it
also means every default group child associated with that parent group.
As a consequence of this, default_groups cannot be removed directly via
rmdir(2). They also are not considered when rmdir(2) on the parent
group is checking for children.
[Committable Items]
NOTE: Committable items are currently unimplemented.
Some config_items cannot have a valid initial state. That is, no
default values can be specified for the item's attributes such that the
item can do its work. Userspace must configure one or more attributes,
after which the subsystem can start whatever entity this item
represents.
Consider the FakeNBD device from above. Without a target address *and*
a target device, the subsystem has no idea what block device to import.
The simple example assumes that the subsystem merely waits until all the
appropriate attributes are configured, and then connects. This will,
indeed, work, but now every attribute store must check if the attributes
are initialized. Every attribute store must fire off the connection if
that condition is met.
Far better would be an explicit action notifying the subsystem that the
config_item is ready to go. More importantly, an explicit action allows
the subsystem to provide feedback as to whether the attributes are
initialized in a way that makes sense. configfs provides this as
committable items.
configfs still uses only normal filesystem operations. An item is
committed via rename(2). The item is moved from a directory where it
can be modified to a directory where it cannot.
Any group that provides the ct_group_ops->commit_item() method has
committable items. When this group appears in configfs, mkdir(2) will
not work directly in the group. Instead, the group will have two
subdirectories: "live" and "pending". The "live" directory does not
support mkdir(2) or rmdir(2) either. It only allows rename(2). The
"pending" directory does allow mkdir(2) and rmdir(2). An item is
created in the "pending" directory. Its attributes can be modified at
will. Userspace commits the item by renaming it into the "live"
directory. At this point, the subsystem receives the ->commit_item()
callback. If all required attributes are filled to satisfaction, the
method returns zero and the item is moved to the "live" directory.
As rmdir(2) does not work in the "live" directory, an item must be
shutdown, or "uncommitted". Again, this is done via rename(2), this
time from the "live" directory back to the "pending" one. The subsystem
is notified by the ct_group_ops->uncommit_object() method.

View File

@@ -0,0 +1,487 @@
/*
* vim: noexpandtab ts=8 sts=0 sw=8:
*
* configfs_example.c - This file is a demonstration module containing
* a number of configfs subsystems.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*
* Based on sysfs:
* sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
*
* configfs Copyright (C) 2005 Oracle. All rights reserved.
*/
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/configfs.h>
/*
* 01-childless
*
* This first example is a childless subsystem. It cannot create
* any config_items. It just has attributes.
*
* Note that we are enclosing the configfs_subsystem inside a container.
* This is not necessary if a subsystem has no attributes directly
* on the subsystem. See the next example, 02-simple-children, for
* such a subsystem.
*/
struct childless {
struct configfs_subsystem subsys;
int showme;
int storeme;
};
struct childless_attribute {
struct configfs_attribute attr;
ssize_t (*show)(struct childless *, char *);
ssize_t (*store)(struct childless *, const char *, size_t);
};
static inline struct childless *to_childless(struct config_item *item)
{
return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL;
}
static ssize_t childless_showme_read(struct childless *childless,
char *page)
{
ssize_t pos;
pos = sprintf(page, "%d\n", childless->showme);
childless->showme++;
return pos;
}
static ssize_t childless_storeme_read(struct childless *childless,
char *page)
{
return sprintf(page, "%d\n", childless->storeme);
}
static ssize_t childless_storeme_write(struct childless *childless,
const char *page,
size_t count)
{
unsigned long tmp;
char *p = (char *) page;
tmp = simple_strtoul(p, &p, 10);
if (!p || (*p && (*p != '\n')))
return -EINVAL;
if (tmp > INT_MAX)
return -ERANGE;
childless->storeme = tmp;
return count;
}
static ssize_t childless_description_read(struct childless *childless,
char *page)
{
return sprintf(page,
"[01-childless]\n"
"\n"
"The childless subsystem is the simplest possible subsystem in\n"
"configfs. It does not support the creation of child config_items.\n"
"It only has a few attributes. In fact, it isn't much different\n"
"than a directory in /proc.\n");
}
static struct childless_attribute childless_attr_showme = {
.attr = { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO },
.show = childless_showme_read,
};
static struct childless_attribute childless_attr_storeme = {
.attr = { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR },
.show = childless_storeme_read,
.store = childless_storeme_write,
};
static struct childless_attribute childless_attr_description = {
.attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO },
.show = childless_description_read,
};
static struct configfs_attribute *childless_attrs[] = {
&childless_attr_showme.attr,
&childless_attr_storeme.attr,
&childless_attr_description.attr,
NULL,
};
static ssize_t childless_attr_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
struct childless *childless = to_childless(item);
struct childless_attribute *childless_attr =
container_of(attr, struct childless_attribute, attr);
ssize_t ret = 0;
if (childless_attr->show)
ret = childless_attr->show(childless, page);
return ret;
}
static ssize_t childless_attr_store(struct config_item *item,
struct configfs_attribute *attr,
const char *page, size_t count)
{
struct childless *childless = to_childless(item);
struct childless_attribute *childless_attr =
container_of(attr, struct childless_attribute, attr);
ssize_t ret = -EINVAL;
if (childless_attr->store)
ret = childless_attr->store(childless, page, count);
return ret;
}
static struct configfs_item_operations childless_item_ops = {
.show_attribute = childless_attr_show,
.store_attribute = childless_attr_store,
};
static struct config_item_type childless_type = {
.ct_item_ops = &childless_item_ops,
.ct_attrs = childless_attrs,
.ct_owner = THIS_MODULE,
};
static struct childless childless_subsys = {
.subsys = {
.su_group = {
.cg_item = {
.ci_namebuf = "01-childless",
.ci_type = &childless_type,
},
},
},
};
/* ----------------------------------------------------------------- */
/*
* 02-simple-children
*
* This example merely has a simple one-attribute child. Note that
* there is no extra attribute structure, as the child's attribute is
* known from the get-go. Also, there is no container for the
* subsystem, as it has no attributes of its own.
*/
struct simple_child {
struct config_item item;
int storeme;
};
static inline struct simple_child *to_simple_child(struct config_item *item)
{
return item ? container_of(item, struct simple_child, item) : NULL;
}
static struct configfs_attribute simple_child_attr_storeme = {
.ca_owner = THIS_MODULE,
.ca_name = "storeme",
.ca_mode = S_IRUGO | S_IWUSR,
};
static struct configfs_attribute *simple_child_attrs[] = {
&simple_child_attr_storeme,
NULL,
};
static ssize_t simple_child_attr_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
ssize_t count;
struct simple_child *simple_child = to_simple_child(item);
count = sprintf(page, "%d\n", simple_child->storeme);
return count;
}
static ssize_t simple_child_attr_store(struct config_item *item,
struct configfs_attribute *attr,
const char *page, size_t count)
{
struct simple_child *simple_child = to_simple_child(item);
unsigned long tmp;
char *p = (char *) page;
tmp = simple_strtoul(p, &p, 10);
if (!p || (*p && (*p != '\n')))
return -EINVAL;
if (tmp > INT_MAX)
return -ERANGE;
simple_child->storeme = tmp;
return count;
}
static void simple_child_release(struct config_item *item)
{
kfree(to_simple_child(item));
}
static struct configfs_item_operations simple_child_item_ops = {
.release = simple_child_release,
.show_attribute = simple_child_attr_show,
.store_attribute = simple_child_attr_store,
};
static struct config_item_type simple_child_type = {
.ct_item_ops = &simple_child_item_ops,
.ct_attrs = simple_child_attrs,
.ct_owner = THIS_MODULE,
};
struct simple_children {
struct config_group group;
};
static inline struct simple_children *to_simple_children(struct config_item *item)
{
return item ? container_of(to_config_group(item), struct simple_children, group) : NULL;
}
static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
{
struct simple_child *simple_child;
simple_child = kmalloc(sizeof(struct simple_child), GFP_KERNEL);
if (!simple_child)
return NULL;
memset(simple_child, 0, sizeof(struct simple_child));
config_item_init_type_name(&simple_child->item, name,
&simple_child_type);
simple_child->storeme = 0;
return &simple_child->item;
}
static struct configfs_attribute simple_children_attr_description = {
.ca_owner = THIS_MODULE,
.ca_name = "description",
.ca_mode = S_IRUGO,
};
static struct configfs_attribute *simple_children_attrs[] = {
&simple_children_attr_description,
NULL,
};
static ssize_t simple_children_attr_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
return sprintf(page,
"[02-simple-children]\n"
"\n"
"This subsystem allows the creation of child config_items. These\n"
"items have only one attribute that is readable and writeable.\n");
}
static void simple_children_release(struct config_item *item)
{
kfree(to_simple_children(item));
}
static struct configfs_item_operations simple_children_item_ops = {
.release = simple_children_release,
.show_attribute = simple_children_attr_show,
};
/*
* Note that, since no extra work is required on ->drop_item(),
* no ->drop_item() is provided.
*/
static struct configfs_group_operations simple_children_group_ops = {
.make_item = simple_children_make_item,
};
static struct config_item_type simple_children_type = {
.ct_item_ops = &simple_children_item_ops,
.ct_group_ops = &simple_children_group_ops,
.ct_attrs = simple_children_attrs,
.ct_owner = THIS_MODULE,
};
static struct configfs_subsystem simple_children_subsys = {
.su_group = {
.cg_item = {
.ci_namebuf = "02-simple-children",
.ci_type = &simple_children_type,
},
},
};
/* ----------------------------------------------------------------- */
/*
* 03-group-children
*
* This example reuses the simple_children group from above. However,
* the simple_children group is not the subsystem itself, it is a
* child of the subsystem. Creation of a group in the subsystem creates
* a new simple_children group. That group can then have simple_child
* children of its own.
*/
static struct config_group *group_children_make_group(struct config_group *group, const char *name)
{
struct simple_children *simple_children;
simple_children = kmalloc(sizeof(struct simple_children),
GFP_KERNEL);
if (!simple_children)
return NULL;
memset(simple_children, 0, sizeof(struct simple_children));
config_group_init_type_name(&simple_children->group, name,
&simple_children_type);
return &simple_children->group;
}
static struct configfs_attribute group_children_attr_description = {
.ca_owner = THIS_MODULE,
.ca_name = "description",
.ca_mode = S_IRUGO,
};
static struct configfs_attribute *group_children_attrs[] = {
&group_children_attr_description,
NULL,
};
static ssize_t group_children_attr_show(struct config_item *item,
struct configfs_attribute *attr,
char *page)
{
return sprintf(page,
"[03-group-children]\n"
"\n"
"This subsystem allows the creation of child config_groups. These\n"
"groups are like the subsystem simple-children.\n");
}
static struct configfs_item_operations group_children_item_ops = {
.show_attribute = group_children_attr_show,
};
/*
* Note that, since no extra work is required on ->drop_item(),
* no ->drop_item() is provided.
*/
static struct configfs_group_operations group_children_group_ops = {
.make_group = group_children_make_group,
};
static struct config_item_type group_children_type = {
.ct_item_ops = &group_children_item_ops,
.ct_group_ops = &group_children_group_ops,
.ct_attrs = group_children_attrs,
.ct_owner = THIS_MODULE,
};
static struct configfs_subsystem group_children_subsys = {
.su_group = {
.cg_item = {
.ci_namebuf = "03-group-children",
.ci_type = &group_children_type,
},
},
};
/* ----------------------------------------------------------------- */
/*
* We're now done with our subsystem definitions.
* For convenience in this module, here's a list of them all. It
* allows the init function to easily register them. Most modules
* will only have one subsystem, and will only call register_subsystem
* on it directly.
*/
static struct configfs_subsystem *example_subsys[] = {
&childless_subsys.subsys,
&simple_children_subsys,
&group_children_subsys,
NULL,
};
static int __init configfs_example_init(void)
{
int ret;
int i;
struct configfs_subsystem *subsys;
for (i = 0; example_subsys[i]; i++) {
subsys = example_subsys[i];
config_group_init(&subsys->su_group);
init_MUTEX(&subsys->su_sem);
ret = configfs_register_subsystem(subsys);
if (ret) {
printk(KERN_ERR "Error %d while registering subsystem %s\n",
ret,
subsys->su_group.cg_item.ci_namebuf);
goto out_unregister;
}
}
return 0;
out_unregister:
for (; i >= 0; i--) {
configfs_unregister_subsystem(example_subsys[i]);
}
return ret;
}
static void __exit configfs_example_exit(void)
{
int i;
for (i = 0; example_subsys[i]; i++) {
configfs_unregister_subsystem(example_subsys[i]);
}
}
module_init(configfs_example_init);
module_exit(configfs_example_exit);
MODULE_LICENSE("GPL");

View File

@@ -0,0 +1,76 @@
Cramfs - cram a filesystem onto a small ROM
cramfs is designed to be simple and small, and to compress things well.
It uses the zlib routines to compress a file one page at a time, and
allows random page access. The meta-data is not compressed, but is
expressed in a very terse representation to make it use much less
diskspace than traditional filesystems.
You can't write to a cramfs filesystem (making it compressible and
compact also makes it _very_ hard to update on-the-fly), so you have to
create the disk image with the "mkcramfs" utility.
Usage Notes
-----------
File sizes are limited to less than 16MB.
Maximum filesystem size is a little over 256MB. (The last file on the
filesystem is allowed to extend past 256MB.)
Only the low 8 bits of gid are stored. The current version of
mkcramfs simply truncates to 8 bits, which is a potential security
issue.
Hard links are supported, but hard linked files
will still have a link count of 1 in the cramfs image.
Cramfs directories have no `.' or `..' entries. Directories (like
every other file on cramfs) always have a link count of 1. (There's
no need to use -noleaf in `find', btw.)
No timestamps are stored in a cramfs, so these default to the epoch
(1970 GMT). Recently-accessed files may have updated timestamps, but
the update lasts only as long as the inode is cached in memory, after
which the timestamp reverts to 1970, i.e. moves backwards in time.
Currently, cramfs must be written and read with architectures of the
same endianness, and can be read only by kernels with PAGE_CACHE_SIZE
== 4096. At least the latter of these is a bug, but it hasn't been
decided what the best fix is. For the moment if you have larger pages
you can just change the #define in mkcramfs.c, so long as you don't
mind the filesystem becoming unreadable to future kernels.
For /usr/share/magic
--------------------
0 ulelong 0x28cd3d45 Linux cramfs offset 0
>4 ulelong x size %d
>8 ulelong x flags 0x%x
>12 ulelong x future 0x%x
>16 string >\0 signature "%.16s"
>32 ulelong x fsid.crc 0x%x
>36 ulelong x fsid.edition %d
>40 ulelong x fsid.blocks %d
>44 ulelong x fsid.files %d
>48 string >\0 name "%.16s"
512 ulelong 0x28cd3d45 Linux cramfs offset 512
>516 ulelong x size %d
>520 ulelong x flags 0x%x
>524 ulelong x future 0x%x
>528 string >\0 signature "%.16s"
>544 ulelong x fsid.crc 0x%x
>548 ulelong x fsid.edition %d
>552 ulelong x fsid.blocks %d
>556 ulelong x fsid.files %d
>560 string >\0 name "%.16s"
Hacker Notes
------------
See fs/cramfs/README for filesystem layout and implementation notes.

View File

@@ -0,0 +1,173 @@
RCU-based dcache locking model
==============================
On many workloads, the most common operation on dcache is to look up a
dentry, given a parent dentry and the name of the child. Typically,
for every open(), stat() etc., the dentry corresponding to the
pathname will be looked up by walking the tree starting with the first
component of the pathname and using that dentry along with the next
component to look up the next level and so on. Since it is a frequent
operation for workloads like multiuser environments and web servers,
it is important to optimize this path.
Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in
every component during path look-up. Since 2.5.10 onwards, fast-walk
algorithm changed this by holding the dcache_lock at the beginning and
walking as many cached path component dentries as possible. This
significantly decreases the number of acquisition of
dcache_lock. However it also increases the lock hold time
significantly and affects performance in large SMP machines. Since
2.5.62 kernel, dcache has been using a new locking model that uses RCU
to make dcache look-up lock-free.
The current dcache locking model is not very different from the
existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock
protected the hash chain, d_child, d_alias, d_lru lists as well as
d_inode and several other things like mount look-up. RCU-based changes
affect only the way the hash chain is protected. For everything else
the dcache_lock must be taken for both traversing as well as
updating. The hash chain updates too take the dcache_lock. The
significant change is the way d_lookup traverses the hash chain, it
doesn't acquire the dcache_lock for this and rely on RCU to ensure
that the dentry has not been *freed*.
Dcache locking details
======================
For many multi-user workloads, open() and stat() on files are very
frequently occurring operations. Both involve walking of path names to
find the dentry corresponding to the concerned file. In 2.4 kernel,
dcache_lock was held during look-up of each path component. Contention
and cache-line bouncing of this global lock caused significant
scalability problems. With the introduction of RCU in Linux kernel,
this was worked around by making the look-up of path components during
path walking lock-free.
Safe lock-free look-up of dcache hash table
===========================================
Dcache is a complex data structure with the hash table entries also
linked together in other lists. In 2.4 kernel, dcache_lock protected
all the lists. We applied RCU only on hash chain walking. The rest of
the lists are still protected by dcache_lock. Some of the important
changes are :
1. The deletion from hash chain is done using hlist_del_rcu() macro
which doesn't initialize next pointer of the deleted dentry and
this allows us to walk safely lock-free while a deletion is
happening.
2. Insertion of a dentry into the hash table is done using
hlist_add_head_rcu() which take care of ordering the writes - the
writes to the dentry must be visible before the dentry is
inserted. This works in conjunction with hlist_for_each_rcu() while
walking the hash chain. The only requirement is that all
initialization to the dentry must be done before
hlist_add_head_rcu() since we don't have dcache_lock protection
while traversing the hash chain. This isn't different from the
existing code.
3. The dentry looked up without holding dcache_lock by cannot be
returned for walking if it is unhashed. It then may have a NULL
d_inode or other bogosity since RCU doesn't protect the other
fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
indicate unhashed dentries and use this in conjunction with a
per-dentry lock (d_lock). Once looked up without the dcache_lock,
we acquire the per-dentry lock (d_lock) and check if the dentry is
unhashed. If so, the look-up is failed. If not, the reference count
of the dentry is increased and the dentry is returned.
4. Once a dentry is looked up, it must be ensured during the path walk
for that component it doesn't go away. In pre-2.5.10 code, this was
done holding a reference to the dentry. dcache_rcu does the same.
In some sense, dcache_rcu path walking looks like the pre-2.5.10
version.
5. All dentry hash chain updates must take the dcache_lock as well as
the per-dentry lock in that order. dput() does this to ensure that
a dentry that has just been looked up in another CPU doesn't get
deleted before dget() can be done on it.
6. There are several ways to do reference counting of RCU protected
objects. One such example is in ipv4 route cache where deferred
freeing (using call_rcu()) is done as soon as the reference count
goes to zero. This cannot be done in the case of dentries because
tearing down of dentries require blocking (dentry_iput()) which
isn't supported from RCU callbacks. Instead, tearing down of
dentries happen synchronously in dput(), but actual freeing happens
later when RCU grace period is over. This allows safe lock-free
walking of the hash chains, but a matched dentry may have been
partially torn down. The checking of DCACHE_UNHASHED flag with
d_lock held detects such dentries and prevents them from being
returned from look-up.
Maintaining POSIX rename semantics
==================================
Since look-up of dentries is lock-free, it can race against a
concurrent rename operation. For example, during rename of file A to
B, look-up of either A or B must succeed. So, if look-up of B happens
after A has been removed from the hash chain but not added to the new
hash chain, it may fail. Also, a comparison while the name is being
written concurrently by a rename may result in false positive matches
violating rename semantics. Issues related to race with rename are
handled as described below :
1. Look-up can be done in two ways - d_lookup() which is safe from
simultaneous renames and __d_lookup() which is not. If
__d_lookup() fails, it must be followed up by a d_lookup() to
correctly determine whether a dentry is in the hash table or
not. d_lookup() protects look-ups using a sequence lock
(rename_lock).
2. The name associated with a dentry (d_name) may be changed if a
rename is allowed to happen simultaneously. To avoid memcmp() in
__d_lookup() go out of bounds due to a rename and false positive
comparison, the name comparison is done while holding the
per-dentry lock. This prevents concurrent renames during this
operation.
3. Hash table walking during look-up may move to a different bucket as
the current dentry is moved to a different bucket due to rename.
But we use hlists in dcache hash table and they are
null-terminated. So, even if a dentry moves to a different bucket,
hash chain walk will terminate. [with a list_head list, it may not
since termination is when the list_head in the original bucket is
reached]. Since we redo the d_parent check and compare name while
holding d_lock, lock-free look-up will not race against d_move().
4. There can be a theoretical race when a dentry keeps coming back to
original bucket due to double moves. Due to this look-up may
consider that it has never moved and can end up in a infinite loop.
But this is not any worse that theoretical livelocks we already
have in the kernel.
Important guidelines for filesystem developers related to dcache_rcu
====================================================================
1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
don't change. Only dcache internal implementation changes. However
filesystems *must not* delete from the dentry hash chains directly
using the list macros like allowed earlier. They must use dcache
APIs like d_drop() or __d_drop() depending on the situation.
2. d_flags is now protected by a per-dentry lock (d_lock). All access
to d_flags must be protected by it.
3. For a hashed dentry, checking of d_count needs to be protected by
d_lock.
Papers and other documentation on dcache locking
================================================
1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
2. http://lse.sourceforge.net/locking/dcache/dcache.html

View File

@@ -0,0 +1,113 @@
Locking scheme used for directory operations is based on two
kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem).
For our purposes all operations fall in 5 classes:
1) read access. Locking rules: caller locks directory we are accessing.
2) object creation. Locking rules: same as above.
3) object removal. Locking rules: caller locks parent, finds victim,
locks victim and calls the method.
4) rename() that is _not_ cross-directory. Locking rules: caller locks
the parent, finds source and target, if target already exists - locks it
and then calls the method.
5) link creation. Locking rules:
* lock parent
* check that source is not a directory
* lock source
* call the method.
6) cross-directory rename. The trickiest in the whole bunch. Locking
rules:
* lock the filesystem
* lock parents in "ancestors first" order.
* find source and target.
* if old parent is equal to or is a descendent of target
fail with -ENOTEMPTY
* if new parent is equal to or is a descendent of source
fail with -ELOOP
* if target exists - lock it.
* call the method.
The rules above obviously guarantee that all directories that are going to be
read, modified or removed by method will be locked by caller.
If no directory is its own ancestor, the scheme above is deadlock-free.
Proof:
First of all, at any moment we have a partial ordering of the
objects - A < B iff A is an ancestor of B.
That ordering can change. However, the following is true:
(1) if object removal or non-cross-directory rename holds lock on A and
attempts to acquire lock on B, A will remain the parent of B until we
acquire the lock on B. (Proof: only cross-directory rename can change
the parent of object and it would have to lock the parent).
(2) if cross-directory rename holds the lock on filesystem, order will not
change until rename acquires all locks. (Proof: other cross-directory
renames will be blocked on filesystem lock and we don't start changing
the order until we had acquired all locks).
(3) any operation holds at most one lock on non-directory object and
that lock is acquired after all other locks. (Proof: see descriptions
of operations).
Now consider the minimal deadlock. Each process is blocked on
attempt to acquire some lock and already holds at least one lock. Let's
consider the set of contended locks. First of all, filesystem lock is
not contended, since any process blocked on it is not holding any locks.
Thus all processes are blocked on ->i_sem.
Non-directory objects are not contended due to (3). Thus link
creation can't be a part of deadlock - it can't be blocked on source
and it means that it doesn't hold any locks.
Any contended object is either held by cross-directory rename or
has a child that is also contended. Indeed, suppose that it is held by
operation other than cross-directory rename. Then the lock this operation
is blocked on belongs to child of that object due to (1).
It means that one of the operations is cross-directory rename.
Otherwise the set of contended objects would be infinite - each of them
would have a contended child and we had assumed that no object is its
own descendent. Moreover, there is exactly one cross-directory rename
(see above).
Consider the object blocking the cross-directory rename. One
of its descendents is locked by cross-directory rename (otherwise we
would again have an infinite set of contended objects). But that
means that cross-directory rename is taking locks out of order. Due
to (2) the order hadn't changed since we had acquired filesystem lock.
But locking rules for cross-directory rename guarantee that we do not
try to acquire lock on descendent before the lock on ancestor.
Contradiction. I.e. deadlock is impossible. Q.E.D.
These operations are guaranteed to avoid loop creation. Indeed,
the only operation that could introduce loops is cross-directory rename.
Since the only new (parent, child) pair added by rename() is (new parent,
source), such loop would have to contain these objects and the rest of it
would have to exist before rename(). I.e. at the moment of loop creation
rename() responsible for that would be holding filesystem lock and new parent
would have to be equal to or a descendent of source. But that means that
new parent had been equal to or a descendent of source since the moment when
we had acquired filesystem lock and rename() would fail with -ELOOP in that
case.
While this locking scheme works for arbitrary DAGs, it relies on
ability to check that directory is a descendent of another object. Current
implementation assumes that directory graph is a tree. This assumption is
also preserved by all operations (cross-directory rename on a tree that would
not introduce a cycle will leave it a tree and link() fails for directories).
Notice that "directory" in the above == "anything that might have
children", so if we are going to introduce hybrid objects we will need
either to make sure that link(2) doesn't work for them or to make changes
in is_subdir() that would make it work even in presence of such beasts.

View File

@@ -0,0 +1,130 @@
dlmfs
==================
A minimal DLM userspace interface implemented via a virtual file
system.
dlmfs is built with OCFS2 as it requires most of its infrastructure.
Project web page: http://oss.oracle.com/projects/ocfs2
Tools web page: http://oss.oracle.com/projects/ocfs2-tools
OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
All code copyright 2005 Oracle except when otherwise noted.
CREDITS
=======
Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds
and Transmeta Corp.
Mark Fasheh <mark.fasheh@oracle.com>
Caveats
=======
- Right now it only works with the OCFS2 DLM, though support for other
DLM implementations should not be a major issue.
Mount options
=============
None
Usage
=====
If you're just interested in OCFS2, then please see ocfs2.txt. The
rest of this document will be geared towards those who want to use
dlmfs for easy to setup and easy to use clustered locking in
userspace.
Setup
=====
dlmfs requires that the OCFS2 cluster infrastructure be in
place. Please download ocfs2-tools from the above url and configure a
cluster.
You'll want to start heartbeating on a volume which all the nodes in
your lockspace can access. The easiest way to do this is via
ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires
that an OCFS2 file system be in place so that it can automatically
find it's heartbeat area, though it will eventually support heartbeat
against raw disks.
Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed
with ocfs2-tools.
Once you're heartbeating, DLM lock 'domains' can be easily created /
destroyed and locks within them accessed.
Locking
=======
Users may access dlmfs via standard file system calls, or they can use
'libo2dlm' (distributed with ocfs2-tools) which abstracts the file
system calls and presents a more traditional locking api.
dlmfs handles lock caching automatically for the user, so a lock
request for an already acquired lock will not generate another DLM
call. Userspace programs are assumed to handle their own local
locking.
Two levels of locks are supported - Shared Read, and Exclusive.
Also supported is a Trylock operation.
For information on the libo2dlm interface, please see o2dlm.h,
distributed with ocfs2-tools.
Lock value blocks can be read and written to a resource via read(2)
and write(2) against the fd obtained via your open(2) call. The
maximum currently supported LVB length is 64 bytes (though that is an
OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share
small amounts of data amongst their nodes.
mkdir(2) signals dlmfs to join a domain (which will have the same name
as the resulting directory)
rmdir(2) signals dlmfs to leave the domain
Locks for a given domain are represented by regular inodes inside the
domain directory. Locking against them is done via the open(2) system
call.
The open(2) call will not return until your lock has been granted or
an error has occurred, unless it has been instructed to do a trylock
operation. If the lock succeeds, you'll get an fd.
open(2) with O_CREAT to ensure the resource inode is created - dlmfs does
not automatically create inodes for existing lock resources.
Open Flag Lock Request Type
--------- -----------------
O_RDONLY Shared Read
O_RDWR Exclusive
Open Flag Resulting Locking Behavior
--------- --------------------------
O_NONBLOCK Trylock operation
You must provide exactly one of O_RDONLY or O_RDWR.
If O_NONBLOCK is also provided and the trylock operation was valid but
could not lock the resource then open(2) will return ETXTBUSY.
close(2) drops the lock associated with your fd.
Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is
supported locally as well. This means you can use them to restrict
access to the resources via dlmfs on your local node only.
The resource LVB may be read from the fd in either Shared Read or
Exclusive modes via the read(2) system call. It can be written via
write(2) only when open in Exclusive mode.
Once written, an LVB will be visible to other nodes who obtain Read
Only or higher level locks on the resource.
See Also
========
http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
For more information on the VMS distributed locking API.

View File

@@ -0,0 +1,382 @@
The Second Extended Filesystem
==============================
ext2 was originally released in January 1993. Written by R\'emy Card,
Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the
Extended Filesystem. It is currently still (April 2001) the predominant
filesystem in use by Linux. There are also implementations available
for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS.
Options
=======
Most defaults are determined by the filesystem superblock, and can be
set using tune2fs(8). Kernel-determined defaults are indicated by (*).
bsddf (*) Makes `df' act like BSD.
minixdf Makes `df' act like Minix.
check=none, nocheck (*) Don't do extra checking of bitmaps on mount
(check=normal and check=strict options removed)
debug Extra debugging information is sent to the
kernel syslog. Useful for developers.
errors=continue Keep going on a filesystem error.
errors=remount-ro Remount the filesystem read-only on an error.
errors=panic Panic and halt the machine if an error occurs.
grpid, bsdgroups Give objects the same group ID as their parent.
nogrpid, sysvgroups New objects have the group ID of their creator.
nouid32 Use 16-bit UIDs and GIDs.
oldalloc Enable the old block allocator. Orlov should
have better performance, we'd like to get some
feedback if it's the contrary for you.
orlov (*) Use the Orlov block allocator.
(See http://lwn.net/Articles/14633/ and
http://lwn.net/Articles/14446/.)
resuid=n The user ID which may use the reserved blocks.
resgid=n The group ID which may use the reserved blocks.
sb=n Use alternate superblock at this location.
user_xattr Enable "user." POSIX Extended Attributes
(requires CONFIG_EXT2_FS_XATTR).
See also http://acl.bestbits.at
nouser_xattr Don't support "user." extended attributes.
acl Enable POSIX Access Control Lists support
(requires CONFIG_EXT2_FS_POSIX_ACL).
See also http://acl.bestbits.at
noacl Don't support POSIX ACLs.
nobh Do not attach buffer_heads to file pagecache.
xip Use execute in place (no caching) if possible
grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2.
Specification
=============
ext2 shares many properties with traditional Unix filesystems. It has
the concepts of blocks, inodes and directories. It has space in the
specification for Access Control Lists (ACLs), fragments, undeletion and
compression though these are not yet implemented (some are available as
separate patches). There is also a versioning mechanism to allow new
features (such as journalling) to be added in a maximally compatible
manner.
Blocks
------
The space in the device or file is split up into blocks. These are
a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems),
which is decided when the filesystem is created. Smaller blocks mean
less wasted space per file, but require slightly more accounting overhead,
and also impose other limits on the size of files and the filesystem.
Block Groups
------------
Blocks are clustered into block groups in order to reduce fragmentation
and minimise the amount of head seeking when reading a large amount
of consecutive data. Information about each block group is kept in a
descriptor table stored in the block(s) immediately after the superblock.
Two blocks near the start of each group are reserved for the block usage
bitmap and the inode usage bitmap which show which blocks and inodes
are in use. Since each bitmap is limited to a single block, this means
that the maximum size of a block group is 8 times the size of a block.
The block(s) following the bitmaps in each block group are designated
as the inode table for that block group and the remainder are the data
blocks. The block allocation algorithm attempts to allocate data blocks
in the same block group as the inode which contains them.
The Superblock
--------------
The superblock contains all the information about the configuration of
the filing system. The primary copy of the superblock is stored at an
offset of 1024 bytes from the start of the device, and it is essential
to mounting the filesystem. Since it is so important, backup copies of
the superblock are stored in block groups throughout the filesystem.
The first version of ext2 (revision 0) stores a copy at the start of
every block group, along with backups of the group descriptor block(s).
Because this can consume a considerable amount of space for large
filesystems, later revisions can optionally reduce the number of backup
copies by only putting backups in specific groups (this is the sparse
superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7.
The information in the superblock contains fields such as the total
number of inodes and blocks in the filesystem and how many are free,
how many inodes and blocks are in each block group, when the filesystem
was mounted (and if it was cleanly unmounted), when it was modified,
what version of the filesystem it is (see the Revisions section below)
and which OS created it.
If the filesystem is revision 1 or higher, then there are extra fields,
such as a volume name, a unique identification number, the inode size,
and space for optional filesystem features to store configuration info.
All fields in the superblock (as in all other ext2 structures) are stored
on the disc in little endian format, so a filesystem is portable between
machines without having to know what machine it was created on.
Inodes
------
The inode (index node) is a fundamental concept in the ext2 filesystem.
Each object in the filesystem is represented by an inode. The inode
structure contains pointers to the filesystem blocks which contain the
data held in the object and all of the metadata about an object except
its name. The metadata about an object includes the permissions, owner,
group, flags, size, number of blocks used, access time, change time,
modification time, deletion time, number of links, fragments, version
(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
There are some reserved fields which are currently unused in the inode
structure and several which are overloaded. One field is reserved for the
directory ACL if the inode is a directory and alternately for the top 32
bits of the file size if the inode is a regular file (allowing file sizes
larger than 2GB). The translator field is unused under Linux, but is used
by the HURD to reference the inode of a program which will be used to
interpret this object. Most of the remaining reserved fields have been
used up for both Linux and the HURD for larger owner and group fields,
The HURD also has a larger mode field so it uses another of the remaining
fields to store the extra more bits.
There are pointers to the first 12 blocks which contain the file's data
in the inode. There is a pointer to an indirect block (which contains
pointers to the next set of blocks), a pointer to a doubly-indirect
block (which contains pointers to indirect blocks) and a pointer to a
trebly-indirect block (which contains pointers to doubly-indirect blocks).
The flags field contains some ext2-specific flags which aren't catered
for by the standard chmod flags. These flags can be listed with lsattr
and changed with the chattr command, and allow specific filesystem
behaviour on a per-file basis. There are flags for secure deletion,
undeletable, compression, synchronous updates, immutability, append-only,
dumpable, no-atime, indexed directories, and data-journaling. Not all
of these are supported yet.
Directories
-----------
A directory is a filesystem object and has an inode just like a file.
It is a specially formatted file containing records which associate
each name with an inode number. Later revisions of the filesystem also
encode the type of the object (file, directory, symlink, device, fifo,
socket) to avoid the need to check the inode itself for this information
(support for taking advantage of this feature does not yet exist in
Glibc 2.2).
The inode allocation code tries to assign inodes which are in the same
block group as the directory in which they are first created.
The current implementation of ext2 uses a singly-linked list to store
the filenames in the directory; a pending enhancement uses hashing of the
filenames to allow lookup without the need to scan the entire directory.
The current implementation never removes empty directory blocks once they
have been allocated to hold more files.
Special files
-------------
Symbolic links are also filesystem objects with inodes. They deserve
special mention because the data for them is stored within the inode
itself if the symlink is less than 60 bytes long. It uses the fields
which would normally be used to store the pointers to data blocks.
This is a worthwhile optimisation as it we avoid allocating a full
block for the symlink, and most symlinks are less than 60 characters long.
Character and block special devices never have data blocks assigned to
them. Instead, their device number is stored in the inode, again reusing
the fields which would be used to point to the data blocks.
Reserved Space
--------------
In ext2, there is a mechanism for reserving a certain number of blocks
for a particular user (normally the super-user). This is intended to
allow for the system to continue functioning even if non-privileged users
fill up all the space available to them (this is independent of filesystem
quotas). It also keeps the filesystem from filling up entirely which
helps combat fragmentation.
Filesystem check
----------------
At boot time, most systems run a consistency check (e2fsck) on their
filesystems. The superblock of the ext2 filesystem contains several
fields which indicate whether fsck should actually run (since checking
the filesystem at boot can take a long time if it is large). fsck will
run if the filesystem was not cleanly unmounted, if the maximum mount
count has been exceeded or if the maximum time between checks has been
exceeded.
Feature Compatibility
---------------------
The compatibility feature mechanism used in ext2 is sophisticated.
It safely allows features to be added to the filesystem, without
unnecessarily sacrificing compatibility with older versions of the
filesystem code. The feature compatibility mechanism is not supported by
the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in
revision 1. There are three 32-bit fields, one for compatible features
(COMPAT), one for read-only compatible (RO_COMPAT) features and one for
incompatible (INCOMPAT) features.
These feature flags have specific meanings for the kernel as follows:
A COMPAT flag indicates that a feature is present in the filesystem,
but the on-disk format is 100% compatible with older on-disk formats, so
a kernel which didn't know anything about this feature could read/write
the filesystem without any chance of corrupting the filesystem (or even
making it inconsistent). This is essentially just a flag which says
"this filesystem has a (hidden) feature" that the kernel or e2fsck may
want to be aware of (more on e2fsck and feature flags later). The ext3
HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
a regular file with data blocks in it so the kernel does not need to
take any special notice of it if it doesn't understand ext3 journaling.
An RO_COMPAT flag indicates that the on-disk format is 100% compatible
with older on-disk formats for reading (i.e. the feature does not change
the visible on-disk format). However, an old kernel writing to such a
filesystem would/could corrupt the filesystem, so this is prevented. The
most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
sparse groups allow file data blocks where superblock/group descriptor
backups used to live, and ext2_free_blocks() refuses to free these blocks,
which would leading to inconsistent bitmaps. An old kernel would also
get an error if it tried to free a series of blocks which crossed a group
boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.
An INCOMPAT flag indicates the on-disk format has changed in some
way that makes it unreadable by older kernels, or would otherwise
cause a problem if an old kernel tried to mount it. FILETYPE is an
INCOMPAT flag because older kernels would think a filename was longer
than 256 characters, which would lead to corrupt directory listings.
The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
doesn't understand compression, you would just get garbage back from
read() instead of it automatically decompressing your data. The ext3
RECOVER flag is needed to prevent a kernel which does not understand the
ext3 journal from mounting the filesystem without replaying the journal.
For e2fsck, it needs to be more strict with the handling of these
flags than the kernel. If it doesn't understand ANY of the COMPAT,
RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem,
because it has no way of verifying whether a given feature is valid
or not. Allowing e2fsck to succeed on a filesystem with an unknown
feature is a false sense of security for the user. Refusing to check
a filesystem with unknown features is a good incentive for the user to
update to the latest e2fsck. This also means that anyone adding feature
flags to ext2 also needs to update e2fsck to verify these features.
Metadata
--------
It is frequently claimed that the ext2 implementation of writing
asynchronous metadata is faster than the ffs synchronous metadata
scheme but less reliable. Both methods are equally resolvable by their
respective fsck programs.
If you're exceptionally paranoid, there are 3 ways of making metadata
writes synchronous on ext2:
per-file if you have the program source: use the O_SYNC flag to open()
per-file if you don't have the source: use "chattr +S" on the file
per-filesystem: add the "sync" option to mount (or in /etc/fstab)
the first and last are not ext2 specific but do force the metadata to
be written synchronously. See also Journaling below.
Limitations
-----------
There are various limits imposed by the on-disk layout of ext2. Other
limits are imposed by the current implementation of the kernel code.
Many of the limits are determined at the time the filesystem is first
created, and depend upon the block size chosen. The ratio of inodes to
data blocks is fixed at filesystem creation time, so the only way to
increase the number of inodes is to increase the size of the filesystem.
No tools currently exist which can change the ratio of inodes to blocks.
Most of these limits could be overcome with slight changes in the on-disk
format and using a compatibility flag to signal the format change (at
the expense of some compatibility).
Filesystem block size: 1kB 2kB 4kB 8kB
File size limit: 16GB 256GB 2048GB 2048GB
Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
There is a 2.4 kernel limit of 2048GB for a single block device, so no
filesystem larger than that can be created at this time. There is also
an upper limit on the block size imposed by the page size of the kernel,
so 8kB blocks are only allowed on Alpha systems (and other architectures
which support larger pages).
There is an upper limit of 32768 subdirectories in a single directory.
There is a "soft" upper limit of about 10-15k files in a single directory
with the current linear linked-list directory implementation. This limit
stems from performance problems when creating and deleting (and also
finding) files in such large directories. Using a hashed directory index
(under development) allows 100k-1M+ files in a single directory without
performance problems (although RAM size becomes an issue at this point).
The (meaningless) absolute upper limit of files in a single directory
(imposed by the file size, the realistic limit is obviously much less)
is over 130 trillion files. It would be higher except there are not
enough 4-character names to make up unique directory entries, so they
have to be 8 character filenames, even then we are fairly close to
running out of unique filenames.
Journaling
----------
A journaling extension to the ext2 code has been developed by Stephen
Tweedie. It avoids the risks of metadata corruption and the need to
wait for e2fsck to complete after a crash, without requiring a change
to the on-disk ext2 layout. In a nutshell, the journal is a regular
file which stores whole metadata (and optionally data) blocks that have
been modified, prior to writing them into the filesystem. This means
it is possible to add a journal to an existing ext2 filesystem without
the need for data conversion.
When changes to the filesystem (e.g. a file is renamed) they are stored in
a transaction in the journal and can either be complete or incomplete at
the time of a crash. If a transaction is complete at the time of a crash
(or in the normal case where the system does not crash), then any blocks
in that transaction are guaranteed to represent a valid filesystem state,
and are copied into the filesystem. If a transaction is incomplete at
the time of the crash, then there is no guarantee of consistency for
the blocks in that transaction so they are discarded (which means any
filesystem changes they represent are also lost).
Check Documentation/filesystems/ext3.txt if you want to read more about
ext3 and journaling.
References
==========
The kernel source file:/usr/src/linux/fs/ext2/
e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
Filesystem Resizing http://ext2resize.sourceforge.net/
Compression (*) http://e2compr.sourceforge.net/
Implementations for:
Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm
Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2
DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/
RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/
(*) no longer actively developed/supported (as of Apr 2001)

View File

@@ -0,0 +1,198 @@
Ext3 Filesystem
===============
Ext3 was originally released in September 1999. Written by Stephen Tweedie
for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
Ext3 is the ext2 filesystem enhanced with journalling capabilities.
Options
=======
When mounting an ext3 filesystem, the following option are accepted:
(*) == default
journal=update Update the ext3 file system's journal to the current
format.
journal=inum When a journal already exists, this option is ignored.
Otherwise, it specifies the number of the inode which
will represent the ext3 file system's journal file.
journal_dev=devnum When the external journal device's major/minor numbers
have changed, this option allows the user to specify
the new journal location. The journal device is
identified through its new major/minor numbers encoded
in devnum.
noload Don't load the journal on mounting.
data=journal All data are committed into the journal prior to being
written into the main file system.
data=ordered (*) All data are forced directly out to the main file
system prior to its metadata being committed to the
journal.
data=writeback Data ordering is not preserved, data may be written
into the main file system after its metadata has been
committed to the journal.
commit=nrsec (*) Ext3 can be told to sync all its data and metadata
every 'nrsec' seconds. The default value is 5 seconds.
This means that if you lose your power, you will lose
as much as the latest 5 seconds of work (your
filesystem will not be damaged though, thanks to the
journaling). This default value (or any low value)
will hurt performance, but it's good for data-safety.
Setting it to 0 will have the same effect as leaving
it at the default (5 seconds).
Setting it to very large values will improve
performance.
barrier=1 This enables/disables barriers. barrier=0 disables
it, barrier=1 enables it.
orlov (*) This enables the new Orlov block allocator. It is
enabled by default.
oldalloc This disables the Orlov block allocator and enables
the old block allocator. Orlov should have better
performance - we'd like to get some feedback if it's
the contrary for you.
user_xattr Enables Extended User Attributes. Additionally, you
need to have extended attribute support enabled in the
kernel configuration (CONFIG_EXT3_FS_XATTR). See the
attr(5) manual page and http://acl.bestbits.at/ to
learn more about extended attributes.
nouser_xattr Disables Extended User Attributes.
acl Enables POSIX Access Control Lists support.
Additionally, you need to have ACL support enabled in
the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL).
See the acl(5) manual page and http://acl.bestbits.at/
for more information.
noacl This option disables POSIX Access Control List
support.
reservation
noreservation
bsddf (*) Make 'df' act like BSD.
minixdf Make 'df' act like Minix.
check=none Don't do extra checking of bitmaps on mount.
nocheck
debug Extra debugging information is sent to syslog.
errors=remount-ro(*) Remount the filesystem read-only on an error.
errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.
grpid Give objects the same group ID as their creator.
bsdgroups
nogrpid (*) New objects have the group ID of their creator.
sysvgroups
resgid=n The group ID which may use the reserved blocks.
resuid=n The user ID which may use the reserved blocks.
sb=n Use alternate superblock at this location.
quota
noquota
grpquota
usrquota
bh (*) ext3 associates buffer heads to data pages to
nobh (a) cache disk block mapping information
(b) link pages into transaction to provide
ordering guarantees.
"bh" option forces use of buffer heads.
"nobh" option tries to avoid associating buffer
heads (supported only for "writeback" mode).
Specification
=============
Ext3 shares all disk implementation with the ext2 filesystem, and adds
transactions capabilities to ext2. Journaling is done by the Journaling Block
Device layer.
Journaling Block Device layer
-----------------------------
The Journaling Block Device layer (JBD) isn't ext3 specific. It was design to
add journaling capabilities on a block device. The ext3 filesystem code will
inform the JBD of modifications it is performing (called a transaction). The
journal supports the transactions start and stop, and in case of crash, the
journal can replayed the transactions to put the partition back in a
consistent state fast.
Handles represent a single atomic update to a filesystem. JBD can handle an
external journal on a block device.
Data Mode
---------
There are 3 different data modes:
* writeback mode
In data=writeback mode, ext3 does not journal data at all. This mode provides
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
mode - metadata journaling. A crash+recovery can cause incorrect data to
appear in files which were written shortly before the crash. This mode will
typically provide the best ext3 performance.
* ordered mode
In data=ordered mode, ext3 only officially journals metadata, but it logically
groups metadata and data blocks into a single unit called a transaction. When
it's time to write the new metadata out to disk, the associated data blocks
are written first. In general, this mode performs slightly slower than
writeback but significantly faster than journal mode.
* journal mode
data=journal mode provides full data and metadata journaling. All new data is
written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
outperforms all others modes.
Compatibility
-------------
Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
Ext3 is fully compatible with Ext2. Ext3 partitions can easily be mounted as
Ext2.
External Tools
==============
See manual pages to learn more.
tune2fs: create a ext3 journal on a ext2 partition with the -j flag.
mke2fs: create a ext3 partition with the -j flag.
debugfs: ext2 and ext3 file system debugger.
ext2online: online (mounted) ext2 and ext3 filesystem resizer
References
==========
kernel source: <file:fs/ext3/>
<file:fs/jbd/>
programs: http://e2fsprogs.sourceforge.net/
http://ext2resize.sourceforge.net
useful links: http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
http://www-106.ibm.com/developerworks/linux/library/l-fs7/
http://www-106.ibm.com/developerworks/linux/library/l-fs8/

View File

@@ -0,0 +1,236 @@
Ext4 Filesystem
===============
This is a development version of the ext4 filesystem, an advanced level
of the ext3 filesystem which incorporates scalability and reliability
enhancements for supporting large filesystems (64 bit) in keeping with
increasing disk capacities and state-of-the-art feature requirements.
Mailing list: linux-ext4@vger.kernel.org
1. Quick usage instructions:
===========================
- Grab updated e2fsprogs from
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/
This is a patchset on top of e2fsprogs-1.39, which can be found at
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
- It's still mke2fs -j /dev/hda1
- mount /dev/hda1 /wherever -t ext4dev
- To enable extents,
mount /dev/hda1 /wherever -t ext4dev -o extents
- The filesystem is compatible with the ext3 driver until you add a file
which has extents (ie: `mount -o extents', then create a file).
NOTE: The "extents" mount flag is temporary. It will soon go away and
extents will be enabled by the "-o extents" flag to mke2fs or tune2fs
- When comparing performance with other filesystems, remember that
ext3/4 by default offers higher data integrity guarantees than most. So
when comparing with a metadata-only journalling filesystem, use `mount -o
data=writeback'. And you might as well use `mount -o nobh' too along
with it. Making the journal larger than the mke2fs default often helps
performance with metadata-intensive workloads.
2. Features
===========
2.1 Currently available
* ability to use filesystems > 16TB
* extent format reduces metadata overhead (RAM, IO for access, transactions)
* extent format more robust in face of on-disk corruption due to magics,
* internal redunancy in tree
2.1 Previously available, soon to be enabled by default by "mkefs.ext4":
* dir_index and resize inode will be on by default
* large inodes will be used by default for fast EAs, nsec timestamps, etc
2.2 Candidate features for future inclusion
There are several under discussion, whether they all make it in is
partly a function of how much time everyone has to work on them:
* improved file allocation (multi-block alloc, delayed alloc; basically done)
* fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
* nsec timestamps for mtime, atime, ctime, create time (patch exists,
needs some e2fsck work)
* inode version field on disk (NFSv4, Lustre; prototype exists)
* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
* journal checksumming for robustness, performance (prototype exists)
* persistent file preallocation (e.g for streaming media, databases)
Features like metadata checksumming have been discussed and planned for
a bit but no patches exist yet so I'm not sure they're in the near-term
roadmap.
The big performance win will come with mballoc and delalloc. CFS has
been using mballoc for a few years already with Lustre, and IBM + Bull
did a lot of benchmarking on it. The reason it isn't in the first set of
patches is partly a manageability issue, and partly because it doesn't
directly affect the on-disk format (outside of much better allocation)
so it isn't critical to get into the first round of changes. I believe
Alex is working on a new set of patches right now.
3. Options
==========
When mounting an ext4 filesystem, the following option are accepted:
(*) == default
extents ext4 will use extents to address file data. The
file system will no longer be mountable by ext3.
journal=update Update the ext4 file system's journal to the current
format.
journal=inum When a journal already exists, this option is ignored.
Otherwise, it specifies the number of the inode which
will represent the ext4 file system's journal file.
journal_dev=devnum When the external journal device's major/minor numbers
have changed, this option allows the user to specify
the new journal location. The journal device is
identified through its new major/minor numbers encoded
in devnum.
noload Don't load the journal on mounting.
data=journal All data are committed into the journal prior to being
written into the main file system.
data=ordered (*) All data are forced directly out to the main file
system prior to its metadata being committed to the
journal.
data=writeback Data ordering is not preserved, data may be written
into the main file system after its metadata has been
committed to the journal.
commit=nrsec (*) Ext4 can be told to sync all its data and metadata
every 'nrsec' seconds. The default value is 5 seconds.
This means that if you lose your power, you will lose
as much as the latest 5 seconds of work (your
filesystem will not be damaged though, thanks to the
journaling). This default value (or any low value)
will hurt performance, but it's good for data-safety.
Setting it to 0 will have the same effect as leaving
it at the default (5 seconds).
Setting it to very large values will improve
performance.
barrier=1 This enables/disables barriers. barrier=0 disables
it, barrier=1 enables it.
orlov (*) This enables the new Orlov block allocator. It is
enabled by default.
oldalloc This disables the Orlov block allocator and enables
the old block allocator. Orlov should have better
performance - we'd like to get some feedback if it's
the contrary for you.
user_xattr Enables Extended User Attributes. Additionally, you
need to have extended attribute support enabled in the
kernel configuration (CONFIG_EXT4_FS_XATTR). See the
attr(5) manual page and http://acl.bestbits.at/ to
learn more about extended attributes.
nouser_xattr Disables Extended User Attributes.
acl Enables POSIX Access Control Lists support.
Additionally, you need to have ACL support enabled in
the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL).
See the acl(5) manual page and http://acl.bestbits.at/
for more information.
noacl This option disables POSIX Access Control List
support.
reservation
noreservation
bsddf (*) Make 'df' act like BSD.
minixdf Make 'df' act like Minix.
check=none Don't do extra checking of bitmaps on mount.
nocheck
debug Extra debugging information is sent to syslog.
errors=remount-ro(*) Remount the filesystem read-only on an error.
errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.
grpid Give objects the same group ID as their creator.
bsdgroups
nogrpid (*) New objects have the group ID of their creator.
sysvgroups
resgid=n The group ID which may use the reserved blocks.
resuid=n The user ID which may use the reserved blocks.
sb=n Use alternate superblock at this location.
quota
noquota
grpquota
usrquota
bh (*) ext4 associates buffer heads to data pages to
nobh (a) cache disk block mapping information
(b) link pages into transaction to provide
ordering guarantees.
"bh" option forces use of buffer heads.
"nobh" option tries to avoid associating buffer
heads (supported only for "writeback" mode).
Data Mode
---------
There are 3 different data modes:
* writeback mode
In data=writeback mode, ext4 does not journal data at all. This mode provides
a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
mode - metadata journaling. A crash+recovery can cause incorrect data to
appear in files which were written shortly before the crash. This mode will
typically provide the best ext4 performance.
* ordered mode
In data=ordered mode, ext4 only officially journals metadata, but it logically
groups metadata and data blocks into a single unit called a transaction. When
it's time to write the new metadata out to disk, the associated data blocks
are written first. In general, this mode performs slightly slower than
writeback but significantly faster than journal mode.
* journal mode
data=journal mode provides full data and metadata journaling. All new data is
written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
outperforms all others modes.
References
==========
kernel source: <file:fs/ext4/>
<file:fs/jbd2/>
programs: http://e2fsprogs.sourceforge.net/
http://ext2resize.sourceforge.net
useful links: http://fedoraproject.org/wiki/ext3-devel
http://www.bullopensource.org/ext4/

View File

@@ -0,0 +1,123 @@
File management in the Linux kernel
-----------------------------------
This document describes how locking for files (struct file)
and file descriptor table (struct files) works.
Up until 2.6.12, the file descriptor table has been protected
with a lock (files->file_lock) and reference count (files->count).
->file_lock protected accesses to all the file related fields
of the table. ->count was used for sharing the file descriptor
table between tasks cloned with CLONE_FILES flag. Typically
this would be the case for posix threads. As with the common
refcounting model in the kernel, the last task doing
a put_files_struct() frees the file descriptor (fd) table.
The files (struct file) themselves are protected using
reference count (->f_count).
In the new lock-free model of file descriptor management,
the reference counting is similar, but the locking is
based on RCU. The file descriptor table contains multiple
elements - the fd sets (open_fds and close_on_exec, the
array of file pointers, the sizes of the sets and the array
etc.). In order for the updates to appear atomic to
a lock-free reader, all the elements of the file descriptor
table are in a separate structure - struct fdtable.
files_struct contains a pointer to struct fdtable through
which the actual fd table is accessed. Initially the
fdtable is embedded in files_struct itself. On a subsequent
expansion of fdtable, a new fdtable structure is allocated
and files->fdtab points to the new structure. The fdtable
structure is freed with RCU and lock-free readers either
see the old fdtable or the new fdtable making the update
appear atomic. Here are the locking rules for
the fdtable structure -
1. All references to the fdtable must be done through
the files_fdtable() macro :
struct fdtable *fdt;
rcu_read_lock();
fdt = files_fdtable(files);
....
if (n <= fdt->max_fds)
....
...
rcu_read_unlock();
files_fdtable() uses rcu_dereference() macro which takes care of
the memory barrier requirements for lock-free dereference.
The fdtable pointer must be read within the read-side
critical section.
2. Reading of the fdtable as described above must be protected
by rcu_read_lock()/rcu_read_unlock().
3. For any update to the fd table, files->file_lock must
be held.
4. To look up the file structure given an fd, a reader
must use either fcheck() or fcheck_files() APIs. These
take care of barrier requirements due to lock-free lookup.
An example :
struct file *file;
rcu_read_lock();
file = fcheck(fd);
if (file) {
...
}
....
rcu_read_unlock();
5. Handling of the file structures is special. Since the look-up
of the fd (fget()/fget_light()) are lock-free, it is possible
that look-up may race with the last put() operation on the
file structure. This is avoided using the rcuref APIs
on ->f_count :
rcu_read_lock();
file = fcheck_files(files, fd);
if (file) {
if (rcuref_inc_lf(&file->f_count))
*fput_needed = 1;
else
/* Didn't get the reference, someone's freed */
file = NULL;
}
rcu_read_unlock();
....
return file;
rcuref_inc_lf() detects if refcounts is already zero or
goes to zero during increment. If it does, we fail
fget()/fget_light().
6. Since both fdtable and file structures can be looked up
lock-free, they must be installed using rcu_assign_pointer()
API. If they are looked up lock-free, rcu_dereference()
must be used. However it is advisable to use files_fdtable()
and fcheck()/fcheck_files() which take care of these issues.
7. While updating, the fdtable pointer must be looked up while
holding files->file_lock. If ->file_lock is dropped, then
another thread expand the files thereby creating a new
fdtable and making the earlier fdtable pointer stale.
For example :
spin_lock(&files->file_lock);
fd = locate_fd(files, file, start);
if (fd >= 0) {
/* locate_fd() may have expanded fdtable, load the ptr */
fdt = files_fdtable(files);
FD_SET(fd, fdt->open_fds);
FD_CLR(fd, fdt->close_on_exec);
spin_unlock(&files->file_lock);
.....
Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
the fdtable pointer (fdt) must be loaded after locate_fd().

View File

@@ -0,0 +1,423 @@
Definitions
~~~~~~~~~~~
Userspace filesystem:
A filesystem in which data and metadata are provided by an ordinary
userspace process. The filesystem can be accessed normally through
the kernel interface.
Filesystem daemon:
The process(es) providing the data and metadata of the filesystem.
Non-privileged mount (or user mount):
A userspace filesystem mounted by a non-privileged (non-root) user.
The filesystem daemon is running with the privileges of the mounting
user. NOTE: this is not the same as mounts allowed with the "user"
option in /etc/fstab, which is not discussed here.
Filesystem connection:
A connection between the filesystem daemon and the kernel. The
connection exists until either the daemon dies, or the filesystem is
umounted. Note that detaching (or lazy umounting) the filesystem
does _not_ break the connection, in this case it will exist until
the last reference to the filesystem is released.
Mount owner:
The user who does the mounting.
User:
The user who is performing filesystem operations.
What is FUSE?
~~~~~~~~~~~~~
FUSE is a userspace filesystem framework. It consists of a kernel
module (fuse.ko), a userspace library (libfuse.*) and a mount utility
(fusermount).
One of the most important features of FUSE is allowing secure,
non-privileged mounts. This opens up new possibilities for the use of
filesystems. A good example is sshfs: a secure network filesystem
using the sftp protocol.
The userspace library and utilities are available from the FUSE
homepage:
http://fuse.sourceforge.net/
Filesystem type
~~~~~~~~~~~~~~~
The filesystem type given to mount(2) can be one of the following:
'fuse'
This is the usual way to mount a FUSE filesystem. The first
argument of the mount system call may contain an arbitrary string,
which is not interpreted by the kernel.
'fuseblk'
The filesystem is block device based. The first argument of the
mount system call is interpreted as the name of the device.
Mount options
~~~~~~~~~~~~~
'fd=N'
The file descriptor to use for communication between the userspace
filesystem and the kernel. The file descriptor must have been
obtained by opening the FUSE device ('/dev/fuse').
'rootmode=M'
The file mode of the filesystem's root in octal representation.
'user_id=N'
The numeric user id of the mount owner.
'group_id=N'
The numeric group id of the mount owner.
'default_permissions'
By default FUSE doesn't check file access permissions, the
filesystem is free to implement it's access policy or leave it to
the underlying file access mechanism (e.g. in case of network
filesystems). This option enables permission checking, restricting
access based on file mode. It is usually useful together with the
'allow_other' mount option.
'allow_other'
This option overrides the security measure restricting file access
to the user mounting the filesystem. This option is by default only
allowed to root, but this restriction can be removed with a
(userspace) configuration option.
'max_read=N'
With this option the maximum size of read operations can be set.
The default is infinite. Note that the size of read requests is
limited anyway to 32 pages (which is 128kbyte on i386).
'blksize=N'
Set the block size for the filesystem. The default is 512. This
option is only valid for 'fuseblk' type mounts.
Control filesystem
~~~~~~~~~~~~~~~~~~
There's a control filesystem for FUSE, which can be mounted by:
mount -t fusectl none /sys/fs/fuse/connections
Mounting it under the '/sys/fs/fuse/connections' directory makes it
backwards compatible with earlier versions.
Under the fuse control filesystem each connection has a directory
named by a unique number.
For each connection the following files exist within this directory:
'waiting'
The number of requests which are waiting to be transferred to
userspace or being processed by the filesystem daemon. If there is
no filesystem activity and 'waiting' is non-zero, then the
filesystem is hung or deadlocked.
'abort'
Writing anything into this file will abort the filesystem
connection. This means that all waiting requests will be aborted an
error returned for all aborted and new requests.
Only the owner of the mount may read or write these files.
Interrupting filesystem operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If a process issuing a FUSE filesystem request is interrupted, the
following will happen:
1) If the request is not yet sent to userspace AND the signal is
fatal (SIGKILL or unhandled fatal signal), then the request is
dequeued and returns immediately.
2) If the request is not yet sent to userspace AND the signal is not
fatal, then an 'interrupted' flag is set for the request. When
the request has been successfully transferred to userspace and
this flag is set, an INTERRUPT request is queued.
3) If the request is already sent to userspace, then an INTERRUPT
request is queued.
INTERRUPT requests take precedence over other requests, so the
userspace filesystem will receive queued INTERRUPTs before any others.
The userspace filesystem may ignore the INTERRUPT requests entirely,
or may honor them by sending a reply to the _original_ request, with
the error set to EINTR.
It is also possible that there's a race between processing the
original request and it's INTERRUPT request. There are two possibilities:
1) The INTERRUPT request is processed before the original request is
processed
2) The INTERRUPT request is processed after the original request has
been answered
If the filesystem cannot find the original request, it should wait for
some timeout and/or a number of new requests to arrive, after which it
should reply to the INTERRUPT request with an EAGAIN error. In case
1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT
reply will be ignored.
Aborting a filesystem connection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is possible to get into certain situations where the filesystem is
not responding. Reasons for this may be:
a) Broken userspace filesystem implementation
b) Network connection down
c) Accidental deadlock
d) Malicious deadlock
(For more on c) and d) see later sections)
In either of these cases it may be useful to abort the connection to
the filesystem. There are several ways to do this:
- Kill the filesystem daemon. Works in case of a) and b)
- Kill the filesystem daemon and all users of the filesystem. Works
in all cases except some malicious deadlocks
- Use forced umount (umount -f). Works in all cases but only if
filesystem is still attached (it hasn't been lazy unmounted)
- Abort filesystem through the FUSE control filesystem. Most
powerful method, always works.
How do non-privileged mounts work?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since the mount() system call is a privileged operation, a helper
program (fusermount) is needed, which is installed setuid root.
The implication of providing non-privileged mounts is that the mount
owner must not be able to use this capability to compromise the
system. Obvious requirements arising from this are:
A) mount owner should not be able to get elevated privileges with the
help of the mounted filesystem
B) mount owner should not get illegitimate access to information from
other users' and the super user's processes
C) mount owner should not be able to induce undesired behavior in
other users' or the super user's processes
How are requirements fulfilled?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A) The mount owner could gain elevated privileges by either:
1) creating a filesystem containing a device file, then opening
this device
2) creating a filesystem containing a suid or sgid application,
then executing this application
The solution is not to allow opening device files and ignore
setuid and setgid bits when executing programs. To ensure this
fusermount always adds "nosuid" and "nodev" to the mount options
for non-privileged mounts.
B) If another user is accessing files or directories in the
filesystem, the filesystem daemon serving requests can record the
exact sequence and timing of operations performed. This
information is otherwise inaccessible to the mount owner, so this
counts as an information leak.
The solution to this problem will be presented in point 2) of C).
C) There are several ways in which the mount owner can induce
undesired behavior in other users' processes, such as:
1) mounting a filesystem over a file or directory which the mount
owner could otherwise not be able to modify (or could only
make limited modifications).
This is solved in fusermount, by checking the access
permissions on the mountpoint and only allowing the mount if
the mount owner can do unlimited modification (has write
access to the mountpoint, and mountpoint is not a "sticky"
directory)
2) Even if 1) is solved the mount owner can change the behavior
of other users' processes.
i) It can slow down or indefinitely delay the execution of a
filesystem operation creating a DoS against the user or the
whole system. For example a suid application locking a
system file, and then accessing a file on the mount owner's
filesystem could be stopped, and thus causing the system
file to be locked forever.
ii) It can present files or directories of unlimited length, or
directory structures of unlimited depth, possibly causing a
system process to eat up diskspace, memory or other
resources, again causing DoS.
The solution to this as well as B) is not to allow processes
to access the filesystem, which could otherwise not be
monitored or manipulated by the mount owner. Since if the
mount owner can ptrace a process, it can do all of the above
without using a FUSE mount, the same criteria as used in
ptrace can be used to check if a process is allowed to access
the filesystem or not.
Note that the ptrace check is not strictly necessary to
prevent B/2/i, it is enough to check if mount owner has enough
privilege to send signal to the process accessing the
filesystem, since SIGSTOP can be used to get a similar effect.
I think these limitations are unacceptable?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If a sysadmin trusts the users enough, or can ensure through other
measures, that system processes will never enter non-privileged
mounts, it can relax the last limitation with a "user_allow_other"
config option. If this config option is set, the mounting user can
add the "allow_other" mount option which disables the check for other
users' processes.
Kernel - userspace interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following diagram shows how a filesystem operation (in this
example unlink) is performed in FUSE.
NOTE: everything in this description is greatly simplified
| "rm /mnt/fuse/file" | FUSE filesystem daemon
| |
| | >sys_read()
| | >fuse_dev_read()
| | >request_wait()
| | [sleep on fc->waitq]
| |
| >sys_unlink() |
| >fuse_unlink() |
| [get request from |
| fc->unused_list] |
| >request_send() |
| [queue req on fc->pending] |
| [wake up fc->waitq] | [woken up]
| >request_wait_answer() |
| [sleep on req->waitq] |
| | <request_wait()
| | [remove req from fc->pending]
| | [copy req to read buffer]
| | [add req to fc->processing]
| | <fuse_dev_read()
| | <sys_read()
| |
| | [perform unlink]
| |
| | >sys_write()
| | >fuse_dev_write()
| | [look up req in fc->processing]
| | [remove from fc->processing]
| | [copy write buffer to req]
| [woken up] | [wake up req->waitq]
| | <fuse_dev_write()
| | <sys_write()
| <request_wait_answer() |
| <request_send() |
| [add request to |
| fc->unused_list] |
| <fuse_unlink() |
| <sys_unlink() |
There are a couple of ways in which to deadlock a FUSE filesystem.
Since we are talking about unprivileged userspace programs,
something must be done about these.
Scenario 1 - Simple deadlock
-----------------------------
| "rm /mnt/fuse/file" | FUSE filesystem daemon
| |
| >sys_unlink("/mnt/fuse/file") |
| [acquire inode semaphore |
| for "file"] |
| >fuse_unlink() |
| [sleep on req->waitq] |
| | <sys_read()
| | >sys_unlink("/mnt/fuse/file")
| | [acquire inode semaphore
| | for "file"]
| | *DEADLOCK*
The solution for this is to allow the filesystem to be aborted.
Scenario 2 - Tricky deadlock
----------------------------
This one needs a carefully crafted filesystem. It's a variation on
the above, only the call back to the filesystem is not explicit,
but is caused by a pagefault.
| Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2
| |
| [fd = open("/mnt/fuse/file")] | [request served normally]
| [mmap fd to 'addr'] |
| [close fd] | [FLUSH triggers 'magic' flag]
| [read a byte from addr] |
| >do_page_fault() |
| [find or create page] |
| [lock page] |
| >fuse_readpage() |
| [queue READ request] |
| [sleep on req->waitq] |
| | [read request to buffer]
| | [create reply header before addr]
| | >sys_write(addr - headerlength)
| | >fuse_dev_write()
| | [look up req in fc->processing]
| | [remove from fc->processing]
| | [copy write buffer to req]
| | >do_page_fault()
| | [find or create page]
| | [lock page]
| | * DEADLOCK *
Solution is basically the same as above.
An additional problem is that while the write buffer is being copied
to the request, the request must not be interrupted/aborted. This is
because the destination address of the copy may not be valid after the
request has returned.
This is solved with doing the copy atomically, and allowing abort
while the page(s) belonging to the write buffer are faulted with
get_user_pages(). The 'req->locked' flag indicates when the copy is
taking place, and abort is delayed until this flag is unset.

View File

@@ -0,0 +1,43 @@
Global File System
------------------
http://sources.redhat.com/cluster/
GFS is a cluster file system. It allows a cluster of computers to
simultaneously use a block device that is shared between them (with FC,
iSCSI, NBD, etc). GFS reads and writes to the block device like a local
file system, but also uses a lock module to allow the computers coordinate
their I/O so file system consistency is maintained. One of the nifty
features of GFS is perfect consistency -- changes made to the file system
on one machine show up immediately on all other machines in the cluster.
GFS uses interchangable inter-node locking mechanisms. Different lock
modules can plug into GFS and each file system selects the appropriate
lock module at mount time. Lock modules include:
lock_nolock -- allows gfs to be used as a local file system
lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking
The dlm is found at linux/fs/dlm/
In addition to interfacing with an external locking manager, a gfs lock
module is responsible for interacting with external cluster management
systems. Lock_dlm depends on user space cluster management systems found
at the URL above.
To use gfs as a local file system, no external clustering systems are
needed, simply:
$ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device
$ mount -t gfs2 /dev/block_device /dir
GFS2 is not on-disk compatible with previous versions of GFS.
The following man pages can be found at the URL above:
gfs2_fsck to repair a filesystem
gfs2_grow to expand a filesystem online
gfs2_jadd to add journals to a filesystem online
gfs2_tool to manipulate, examine and tune a filesystem
gfs2_quota to examine and change quota values in a filesystem
mount.gfs2 to help mount(8) mount a filesystem
mkfs.gfs2 to make a filesystem

View File

@@ -0,0 +1,83 @@
Macintosh HFS Filesystem for Linux
==================================
HFS stands for ``Hierarchical File System'' and is the filesystem used
by the Mac Plus and all later Macintosh models. Earlier Macintosh
models used MFS (``Macintosh File System''), which is not supported,
MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
HFS but is extended in various areas. Use the hfsplus filesystem driver
to access such filesystems from Linux.
Mount options
=============
When mounting an HFS filesystem, the following options are accepted:
creator=cccc, type=cccc
Specifies the creator/type values as shown by the MacOS finder
used for creating new files. Default values: '????'.
uid=n, gid=n
Specifies the user/group that owns all files on the filesystems.
Default: user/group id of the mounting process.
dir_umask=n, file_umask=n, umask=n
Specifies the umask used for all files , all directories or all
files and directories. Defaults to the umask of the mounting process.
session=n
Select the CDROM session to mount as HFS filesystem. Defaults to
leaving that decision to the CDROM driver. This option will fail
with anything but a CDROM as underlying devices.
part=n
Select partition number n from the devices. Does only makes
sense for CDROMS because they can't be partitioned under Linux.
For disk devices the generic partition parsing code does this
for us. Defaults to not parsing the partition table at all.
quiet
Ignore invalid mount options instead of complaining.
Writing to HFS Filesystems
==========================
HFS is not a UNIX filesystem, thus it does not have the usual features you'd
expect:
o You can't modify the set-uid, set-gid, sticky or executable bits or the uid
and gid of files.
o You can't create hard- or symlinks, device files, sockets or FIFOs.
HFS does on the other have the concepts of multiple forks per file. These
non-standard forks are represented as hidden additional files in the normal
filesystems namespace which is kind of a cludge and makes the semantics for
the a little strange:
o You can't create, delete or rename resource forks of files or the
Finder's metadata.
o They are however created (with default values), deleted and renamed
along with the corresponding data fork or directory.
o Copying files to a different filesystem will loose those attributes
that are essential for MacOS to work.
Creating HFS filesystems
===================================
The hfsutils package from Robert Leslie contains a program called
hformat that can be used to create HFS filesystem. See
<http://www.mars.org/home/rob/proj/hfs/> for details.
Credits
=======
The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU)
and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis
Technologies.
Roman rewrote large parts of the code and brought in btree routines derived
from Brad Boyer's hfsplus driver (also maintained by Roman now).

View File

@@ -0,0 +1,296 @@
Read/Write HPFS 2.09
1998-2004, Mikulas Patocka
email: mikulas@artax.karlin.mff.cuni.cz
homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
CREDITS:
Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
is taken from it
Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
Mount options
uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask)
Set owner/group/mode for files that do not have it specified in extended
attributes. Mode is inverted umask - for example umask 027 gives owner
all permission, group read permission and anybody else no access. Note
that for files mode is anded with 0666. If you want files to have 'x'
rights, you must use extended attributes.
case=lower,asis (default asis)
File name lowercasing in readdir.
conv=binary,text,auto (default binary)
CR/LF -> LF conversion, if auto, decision is made according to extension
- there is a list of text extensions (I thing it's better to not convert
text file than to damage binary file). If you want to change that list,
change it in the source. Original readonly HPFS contained some strange
heuristic algorithm that I removed. I thing it's danger to let the
computer decide whether file is text or binary. For example, DJGPP
binaries contain small text message at the beginning and they could be
misidentified and damaged under some circumstances.
check=none,normal,strict (default normal)
Check level. Selecting none will cause only little speedup and big
danger. I tried to write it so that it won't crash if check=normal on
corrupted filesystems. check=strict means many superfluous checks -
used for debugging (for example it checks if file is allocated in
bitmaps when accessing it).
errors=continue,remount-ro,panic (default remount-ro)
Behaviour when filesystem errors found.
chkdsk=no,errors,always (default errors)
When to mark filesystem dirty so that OS/2 checks it.
eas=no,ro,rw (default rw)
What to do with extended attributes. 'no' - ignore them and use always
values specified in uid/gid/mode options. 'ro' - read extended
attributes but do not create them. 'rw' - create extended attributes
when you use chmod/chown/chgrp/mknod/ln -s on the filesystem.
timeshift=(-)nnn (default 0)
Shifts the time by nnn seconds. For example, if you see under linux
one hour more, than under os/2, use timeshift=-3600.
File names
As in OS/2, filenames are case insensitive. However, shell thinks that names
are case sensitive, so for example when you create a file FOO, you can use
'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you
also won't be able to compile linux kernel (and maybe other things) on HPFS
because kernel creates different files with names like bootsect.S and
bootsect.s. When searching for file thats name has characters >= 128, codepages
are used - see below.
OS/2 ignores dots and spaces at the end of file name, so this driver does as
well. If you create 'a. ...', the file 'a' will be created, but you can still
access it under names 'a.', 'a..', 'a . . . ' etc.
Extended attributes
On HPFS partitions, OS/2 can associate to each file a special information called
extended attributes. Extended attributes are pairs of (key,value) where key is
an ascii string identifying that attribute and value is any string of bytes of
variable length. OS/2 stores window and icon positions and file types there. So
why not use it for unix-specific info like file owner or access rights? This
driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended
attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only
that extended attributes those value differs from defaults specified in mount
options are created. Once created, the extended attributes are never deleted,
they're just changed. It means that when your default uid=0 and you type
something like 'chown luser file; chown root file' the file will contain
extended attribute UID=0. And when you umount the fs and mount it again with
uid=luser_uid, the file will be still owned by root! If you chmod file to 444,
extended attribute "MODE" will not be set, this special case is done by setting
read-only flag. When you mknod a block or char device, besides "MODE", the
special 4-byte extended attribute "DEV" will be created containing the device
number. Currently this driver cannot resize extended attributes - it means
that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV"
attributes with different sizes, they won't be rewritten and changing these
values doesn't work.
Symlinks
You can do symlinks on HPFS partition, symlinks are achieved by setting extended
attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
chgrp symlinks but I don't know what is it good for. chmoding symlink results
in chmoding file where symlink points. These symlinks are just for Linux use and
incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are
stored in very crazy way. They tried to do it so that link changes when file is
moved ... sometimes it works. But the link is partly stored in directory
extended attributes and partly in OS2SYS.INI. I don't want (and don't know how)
to analyze or change OS2SYS.INI.
Codepages
HPFS can contain several uppercasing tables for several codepages and each
file has a pointer to codepage it's name is in. However OS/2 was created in
America where people don't care much about codepages and so multiple codepages
support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk.
Once I booted English OS/2 working in cp 850 and I created a file on my 852
partition. It marked file name codepage as 850 - good. But when I again booted
Czech OS/2, the file was completely inaccessible under any name. It seems that
OS/2 uppercases the search pattern with its system code page (852) and file
name it's comparing to with its code page (850). These could never match. Is it
really what IBM developers wanted? But problems continued. When I created in
Czech OS/2 another file in that directory, that file was inaccessible too. OS/2
probably uses different uppercasing method when searching where to place a file
(note, that files in HPFS directory must be sorted) and when searching for
a file. Finally when I opened this directory in PmShell, PmShell crashed (the
funny thing was that, when rebooted, PmShell tried to reopen this directory
again :-). chkdsk happily ignores these errors and only low-level disk
modification saved me. Never mix different language versions of OS/2 on one
system although HPFS was designed to allow that.
OK, I could implement complex codepage support to this driver but I think it
would cause more problems than benefit with such buggy implementation in OS/2.
So this driver simply uses first codepage it finds for uppercasing and
lowercasing no matter what's file codepage index. Usually all file names are in
this codepage - if you don't try to do what I described above :-)
Known bugs
HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
should work. If you have OS/2 server, use only read-only mode. I don't know how
to handle some HPFS386 structures like access control list or extended perm
list, I don't know how to delete them when file is deleted and how to not
overwrite them with extended attributes. Send me some info on these structures
and I'll make it. However, this driver should detect presence of HPFS386
structures, remount read-only and not destroy them (I hope).
When there's not enough space for extended attributes, they will be truncated
and no error is returned.
OS/2 can't access files if the path is longer than about 256 chars but this
driver allows you to do it. chkdsk ignores such errors.
Sometimes you won't be able to delete some files on a very full filesystem
(returning error ENOSPC). That's because file in non-leaf node in directory tree
(one directory, if it's large, has dirents in tree on HPFS) must be replaced
with another node when deleted. And that new file might have larger name than
the old one so the new name doesn't fit in directory node (dnode). And that
would result in directory tree splitting, that takes disk space. Workaround is
to delete other files that are leaf (probability that the file is non-leaf is
about 1/50) or to truncate file first to make some space.
You encounter this problem only if you have many directories so that
preallocated directory band is full i.e.
number_of_directories / size_of_filesystem_in_mb > 4.
You can't delete open directories.
You can't rename over directories (what is it good for?).
Renaming files so that only case changes doesn't work. This driver supports it
but vfs doesn't. Something like 'mv file FILE' won't work.
All atimes and directory mtimes are not updated. That's because of performance
reasons. If you extremely wish to update them, let me know, I'll write it (but
it will be slow).
When the system is out of memory and swap, it may slightly corrupt filesystem
(lost files, unbalanced directories). (I guess all filesystem may do it).
When compiled, you get warning: function declaration isn't a prototype. Does
anybody know what does it mean?
What does "unbalanced tree" message mean?
Old versions of this driver created sometimes unbalanced dnode trees. OS/2
chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely
crashes when the tree is not balanced. This driver handles unbalanced trees
correctly and writes warning if it finds them. If you see this message, this is
probably because of directories created with old version of this driver.
Workaround is to move all files from that directory to another and then back
again. Do it in Linux, not OS/2! If you see this message in directory that is
whole created by this driver, it is BUG - let me know about it.
Bugs in OS/2
When you have two (or more) lost directories pointing each to other, chkdsk
locks up when repairing filesystem.
Sometimes (I think it's random) when you create a file with one-char name under
OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs
error corrected".
File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and
marks them as short (and writes "minor fs error corrected"). This bug is not in
HPFS386.
Codepage bugs described above.
If you don't install fixpacks, there are many, many more...
History
0.90 First public release
0.91 Fixed bug that caused shooting to memory when write_inode was called on
open inode (rarely happened)
0.92 Fixed a little memory leak in freeing directory inodes
0.93 Fixed bug that locked up the machine when there were too many filenames
with first 15 characters same
Fixed write_file to zero file when writing behind file end
0.94 Fixed a little memory leak when trying to delete busy file or directory
0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
1.90 First version for 2.1.1xx kernels
1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
Fixed a race-condition when write_inode is called while deleting file
Fixed a bug that could possibly happen (with very low probability) when
using 0xff in filenames
Rewritten locking to avoid race-conditions
Mount option 'eas' now works
Fsync no longer returns error
Files beginning with '.' are marked hidden
Remount support added
Alloc is not so slow when filesystem becomes full
Atimes are no more updated because it slows down operation
Code cleanup (removed all commented debug prints)
1.92 Corrected a bug when sync was called just before closing file
1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
works with previous versions
Fixed a possible problem with disks > 64G (but I don't have one, so I can't
test it)
Fixed a file overflow at 2G
Added new option 'timeshift'
Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
read-only mode
Fixed a bug that slowed down alloc and prevented allocating 100% space
(this bug was not destructive)
1.94 Added workaround for one bug in Linux
Fixed one buffer leak
Fixed some incompatibilities with large extended attributes (but it's still
not 100% ok, I have no info on it and OS/2 doesn't want to create them)
Rewritten allocation
Fixed a bug with i_blocks (du sometimes didn't display correct values)
Directories have no longer archive attribute set (some programs don't like
it)
Fixed a bug that it set badly one flag in large anode tree (it was not
destructive)
1.95 Fixed one buffer leak, that could happen on corrupted filesystem
Fixed one bug in allocation in 1.94
1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
error sometimes when opening directories in PMSHELL)
Fixed a possible bitmap race
Fixed possible problem on large disks
You can now delete open files
Fixed a nondestructive race in rename
1.97 Support for HPFS v3 (on large partitions)
Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
1.97.1 Changed names of global symbols
Fixed a bug when chmoding or chowning root directory
1.98 Fixed a deadlock when using old_readdir
Better directory handling; workaround for "unbalanced tree" bug in OS/2
1.99 Corrected a possible problem when there's not enough space while deleting
file
Now it tries to truncate the file if there's not enough space when deleting
Removed a lot of redundant code
2.00 Fixed a bug in rename (it was there since 1.96)
Better anti-fragmentation strategy
2.01 Fixed problem with directory listing over NFS
Directory lseek now checks for proper parameters
Fixed race-condition in buffer code - it is in all filesystems in Linux;
when reading device (cat /dev/hda) while creating files on it, files
could be damaged
2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond
end of partition
2.03 Char, block devices and pipes are correctly created
Fixed non-crashing race in unlink (Alexander Viro)
Now it works with Japanese version of OS/2
2.04 Fixed error when ftruncate used to extend file
2.05 Fixed crash when got mount parameters without =
Fixed crash when allocation of anode failed due to full disk
Fixed some crashes when block io or inode allocation failed
2.06 Fixed some crash on corrupted disk structures
Better allocation strategy
Reschedule points added so that it doesn't lock CPU long time
It should work in read-only mode on Warp Server
2.07 More fixes for Warp Server. Now it really works
2.08 Creating new files is not so slow on large disks
An attempt to sync deleted file does not generate filesystem error
2.09 Fixed error on extremly fragmented files
vim: set textwidth=80:

View File

@@ -0,0 +1,269 @@
inotify
a powerful yet simple file change notification system
Document started 15 Mar 2005 by Robert Love <rml@novell.com>
(i) User Interface
Inotify is controlled by a set of three system calls and normal file I/O on a
returned file descriptor.
First step in using inotify is to initialise an inotify instance:
int fd = inotify_init ();
Each instance is associated with a unique, ordered queue.
Change events are managed by "watches". A watch is an (object,mask) pair where
the object is a file or directory and the mask is a bit mask of one or more
inotify events that the application wishes to receive. See <linux/inotify.h>
for valid events. A watch is referenced by a watch descriptor, or wd.
Watches are added via a path to the file.
Watches on a directory will return events on any files inside of the directory.
Adding a watch is simple:
int wd = inotify_add_watch (fd, path, mask);
Where "fd" is the return value from inotify_init(), path is the path to the
object to watch, and mask is the watch mask (see <linux/inotify.h>).
You can update an existing watch in the same manner, by passing in a new mask.
An existing watch is removed via
int ret = inotify_rm_watch (fd, wd);
Events are provided in the form of an inotify_event structure that is read(2)
from a given inotify instance. The filename is of dynamic length and follows
the struct. It is of size len. The filename is padded with null bytes to
ensure proper alignment. This padding is reflected in len.
You can slurp multiple events by passing a large buffer, for example
size_t len = read (fd, buf, BUF_LEN);
Where "buf" is a pointer to an array of "inotify_event" structures at least
BUF_LEN bytes in size. The above example will return as many events as are
available and fit in BUF_LEN.
Each inotify instance fd is also select()- and poll()-able.
You can find the size of the current event queue via the standard FIONREAD
ioctl on the fd returned by inotify_init().
All watches are destroyed and cleaned up on close.
(ii)
Prototypes:
int inotify_init (void);
int inotify_add_watch (int fd, const char *path, __u32 mask);
int inotify_rm_watch (int fd, __u32 mask);
(iii) Kernel Interface
Inotify's kernel API consists a set of functions for managing watches and an
event callback.
To use the kernel API, you must first initialize an inotify instance with a set
of inotify_operations. You are given an opaque inotify_handle, which you use
for any further calls to inotify.
struct inotify_handle *ih = inotify_init(my_event_handler);
You must provide a function for processing events and a function for destroying
the inotify watch.
void handle_event(struct inotify_watch *watch, u32 wd, u32 mask,
u32 cookie, const char *name, struct inode *inode)
watch - the pointer to the inotify_watch that triggered this call
wd - the watch descriptor
mask - describes the event that occurred
cookie - an identifier for synchronizing events
name - the dentry name for affected files in a directory-based event
inode - the affected inode in a directory-based event
void destroy_watch(struct inotify_watch *watch)
You may add watches by providing a pre-allocated and initialized inotify_watch
structure and specifying the inode to watch along with an inotify event mask.
You must pin the inode during the call. You will likely wish to embed the
inotify_watch structure in a structure of your own which contains other
information about the watch. Once you add an inotify watch, it is immediately
subject to removal depending on filesystem events. You must grab a reference if
you depend on the watch hanging around after the call.
inotify_init_watch(&my_watch->iwatch);
inotify_get_watch(&my_watch->iwatch); // optional
s32 wd = inotify_add_watch(ih, &my_watch->iwatch, inode, mask);
inotify_put_watch(&my_watch->iwatch); // optional
You may use the watch descriptor (wd) or the address of the inotify_watch for
other inotify operations. You must not directly read or manipulate data in the
inotify_watch. Additionally, you must not call inotify_add_watch() more than
once for a given inotify_watch structure, unless you have first called either
inotify_rm_watch() or inotify_rm_wd().
To determine if you have already registered a watch for a given inode, you may
call inotify_find_watch(), which gives you both the wd and the watch pointer for
the inotify_watch, or an error if the watch does not exist.
wd = inotify_find_watch(ih, inode, &watchp);
You may use container_of() on the watch pointer to access your own data
associated with a given watch. When an existing watch is found,
inotify_find_watch() bumps the refcount before releasing its locks. You must
put that reference with:
put_inotify_watch(watchp);
Call inotify_find_update_watch() to update the event mask for an existing watch.
inotify_find_update_watch() returns the wd of the updated watch, or an error if
the watch does not exist.
wd = inotify_find_update_watch(ih, inode, mask);
An existing watch may be removed by calling either inotify_rm_watch() or
inotify_rm_wd().
int ret = inotify_rm_watch(ih, &my_watch->iwatch);
int ret = inotify_rm_wd(ih, wd);
A watch may be removed while executing your event handler with the following:
inotify_remove_watch_locked(ih, iwatch);
Call inotify_destroy() to remove all watches from your inotify instance and
release it. If there are no outstanding references, inotify_destroy() will call
your destroy_watch op for each watch.
inotify_destroy(ih);
When inotify removes a watch, it sends an IN_IGNORED event to your callback.
You may use this event as an indication to free the watch memory. Note that
inotify may remove a watch due to filesystem events, as well as by your request.
If you use IN_ONESHOT, inotify will remove the watch after the first event, at
which point you may call the final inotify_put_watch.
(iv) Kernel Interface Prototypes
struct inotify_handle *inotify_init(struct inotify_operations *ops);
inotify_init_watch(struct inotify_watch *watch);
s32 inotify_add_watch(struct inotify_handle *ih,
struct inotify_watch *watch,
struct inode *inode, u32 mask);
s32 inotify_find_watch(struct inotify_handle *ih, struct inode *inode,
struct inotify_watch **watchp);
s32 inotify_find_update_watch(struct inotify_handle *ih,
struct inode *inode, u32 mask);
int inotify_rm_wd(struct inotify_handle *ih, u32 wd);
int inotify_rm_watch(struct inotify_handle *ih,
struct inotify_watch *watch);
void inotify_remove_watch_locked(struct inotify_handle *ih,
struct inotify_watch *watch);
void inotify_destroy(struct inotify_handle *ih);
void get_inotify_watch(struct inotify_watch *watch);
void put_inotify_watch(struct inotify_watch *watch);
(v) Internal Kernel Implementation
Each inotify instance is represented by an inotify_handle structure.
Inotify's userspace consumers also have an inotify_device which is
associated with the inotify_handle, and on which events are queued.
Each watch is associated with an inotify_watch structure. Watches are chained
off of each associated inotify_handle and each associated inode.
See fs/inotify.c and fs/inotify_user.c for the locking and lifetime rules.
(vi) Rationale
Q: What is the design decision behind not tying the watch to the open fd of
the watched object?
A: Watches are associated with an open inotify device, not an open file.
This solves the primary problem with dnotify: keeping the file open pins
the file and thus, worse, pins the mount. Dnotify is therefore infeasible
for use on a desktop system with removable media as the media cannot be
unmounted. Watching a file should not require that it be open.
Q: What is the design decision behind using an-fd-per-instance as opposed to
an fd-per-watch?
A: An fd-per-watch quickly consumes more file descriptors than are allowed,
more fd's than are feasible to manage, and more fd's than are optimally
select()-able. Yes, root can bump the per-process fd limit and yes, users
can use epoll, but requiring both is a silly and extraneous requirement.
A watch consumes less memory than an open file, separating the number
spaces is thus sensible. The current design is what user-space developers
want: Users initialize inotify, once, and add n watches, requiring but one
fd and no twiddling with fd limits. Initializing an inotify instance two
thousand times is silly. If we can implement user-space's preferences
cleanly--and we can, the idr layer makes stuff like this trivial--then we
should.
There are other good arguments. With a single fd, there is a single
item to block on, which is mapped to a single queue of events. The single
fd returns all watch events and also any potential out-of-band data. If
every fd was a separate watch,
- There would be no way to get event ordering. Events on file foo and
file bar would pop poll() on both fd's, but there would be no way to tell
which happened first. A single queue trivially gives you ordering. Such
ordering is crucial to existing applications such as Beagle. Imagine
"mv a b ; mv b a" events without ordering.
- We'd have to maintain n fd's and n internal queues with state,
versus just one. It is a lot messier in the kernel. A single, linear
queue is the data structure that makes sense.
- User-space developers prefer the current API. The Beagle guys, for
example, love it. Trust me, I asked. It is not a surprise: Who'd want
to manage and block on 1000 fd's via select?
- No way to get out of band data.
- 1024 is still too low. ;-)
When you talk about designing a file change notification system that
scales to 1000s of directories, juggling 1000s of fd's just does not seem
the right interface. It is too heavy.
Additionally, it _is_ possible to more than one instance and
juggle more than one queue and thus more than one associated fd. There
need not be a one-fd-per-process mapping; it is one-fd-per-queue and a
process can easily want more than one queue.
Q: Why the system call approach?
A: The poor user-space interface is the second biggest problem with dnotify.
Signals are a terrible, terrible interface for file notification. Or for
anything, for that matter. The ideal solution, from all perspectives, is a
file descriptor-based one that allows basic file I/O and poll/select.
Obtaining the fd and managing the watches could have been done either via a
device file or a family of new system calls. We decided to implement a
family of system calls because that is the preferred approach for new kernel
interfaces. The only real difference was whether we wanted to use open(2)
and ioctl(2) or a couple of new system calls. System calls beat ioctls.

View File

@@ -0,0 +1,42 @@
Mount options that are the same as for msdos and vfat partitions.
gid=nnn All files in the partition will be in group nnn.
uid=nnn All files in the partition will be owned by user id nnn.
umask=nnn The permission mask (see umask(1)) for the partition.
Mount options that are the same as vfat partitions. These are only useful
when using discs encoded using Microsoft's Joliet extensions.
iocharset=name Character set to use for converting from Unicode to
ASCII. Joliet filenames are stored in Unicode format, but
Unix for the most part doesn't know how to deal with Unicode.
There is also an option of doing UTF-8 translations with the
utf8 option.
utf8 Encode Unicode names in UTF-8 format. Default is no.
Mount options unique to the isofs filesystem.
block=512 Set the block size for the disk to 512 bytes
block=1024 Set the block size for the disk to 1024 bytes
block=2048 Set the block size for the disk to 2048 bytes
check=relaxed Matches filenames with different cases
check=strict Matches only filenames with the exact same case
cruft Try to handle badly formatted CDs.
map=off Do not map non-Rock Ridge filenames to lower case
map=normal Map non-Rock Ridge filenames to lower case
map=acorn As map=normal but also apply Acorn extensions if present
mode=xxx Sets the permissions on files to xxx
nojoliet Ignore Joliet extensions if they are present.
norock Ignore Rock Ridge extensions if they are present.
hide Completely strip hidden files from the file system.
showassoc Show files marked with the 'associated' bit
unhide Deprecated; showing hidden files is now default;
If given, it is a synonym for 'showassoc' which will
recreate previous unhide behavior
session=x Select number of session on multisession CD
sbsector=xxx Session begins from sector xxx
Recommended documents about ISO 9660 standard are located at:
http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm
ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
identical with ISO 9660.", so it is a valid and gratis substitute of the
official ISO specification.

View File

@@ -0,0 +1,35 @@
IBM's Journaled File System (JFS) for Linux
JFS Homepage: http://jfs.sourceforge.net/
The following mount options are supported:
iocharset=name Character set to use for converting from Unicode to
ASCII. The default is to do no conversion. Use
iocharset=utf8 for UTF-8 translations. This requires
CONFIG_NLS_UTF8 to be set in the kernel .config file.
iocharset=none specifies the default behavior explicitly.
resize=value Resize the volume to <value> blocks. JFS only supports
growing a volume, not shrinking it. This option is only
valid during a remount, when the volume is mounted
read-write. The resize keyword with no value will grow
the volume to the full size of the partition.
nointegrity Do not write to the journal. The primary use of this option
is to allow for higher performance when restoring a volume
from backup media. The integrity of the volume is not
guaranteed if the system abnormally abends.
integrity Default. Commit metadata changes to the journal. Use this
option to remount a volume where the nointegrity option was
previously specified in order to restore normal behavior.
errors=continue Keep going on a filesystem error.
errors=remount-ro Default. Remount the filesystem read-only on an error.
errors=panic Panic and halt the machine if an error occurs.
Please send bugs, comments, cards and letters to shaggy@austin.ibm.com.
The JFS mailing list can be subscribed to by using the link labeled
"Mail list Subscribe" at our web page http://jfs.sourceforge.net/

View File

@@ -0,0 +1,12 @@
The ncpfs filesystem understands the NCP protocol, designed by the
Novell Corporation for their NetWare(tm) product. NCP is functionally
similar to the NFS used in the TCP/IP community.
To mount a NetWare filesystem, you need a special mount program, which
can be found in the ncpfs package. The home site for ncpfs is
ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors
will have it as well.
Related products are linware and mars_nwe, which will give Linux partial
NetWare server functionality. Linware's home site is
klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on
ftp.gwdg.de/pub/linux/misc/ncpfs.

View File

@@ -0,0 +1,714 @@
The Linux NTFS filesystem driver
================================
Table of contents
=================
- Overview
- Web site
- Features
- Supported mount options
- Known bugs and (mis-)features
- Using NTFS volume and stripe sets
- The Device-Mapper driver
- The Software RAID / MD driver
- Limitations when using the MD driver
- ChangeLog
Overview
========
Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
These include mkntfs, a full-featured ntfs filesystem format utility,
ntfsundelete used for recovering files that were unintentionally deleted
from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
See the web site for more information.
To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
system type 'ntfs'. The driver currently supports read-only mode (with no
fault-tolerance, encryption or journalling) and very limited, but safe, write
support.
For fault tolerance and raid support (i.e. volume and stripe sets), you can
use the kernel's Software RAID / MD driver. See section "Using Software RAID
with NTFS" for details.
Web site
========
There is plenty of additional information on the linux-ntfs web site
at http://linux-ntfs.sourceforge.net/
The web site has a lot of additional information, such as a comprehensive
FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS
userspace utilities, etc.
Features
========
- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and
earlier kernels. This new driver implements NTFS read support and is
functionally equivalent to the old ntfs driver and it also implements limited
write support. The biggest limitation at present is that files/directories
cannot be created or deleted. See below for the list of write features that
are so far supported. Another limitation is that writing to compressed files
is not implemented at all. Also, neither read nor write access to encrypted
files is so far implemented.
- The new driver has full support for sparse files on NTFS 3.x volumes which
the old driver isn't happy with.
- The new driver supports execution of binaries due to mmap() now being
supported.
- The new driver supports loopback mounting of files on NTFS which is used by
some Linux distributions to enable the user to run Linux from an NTFS
partition by creating a large file while in Windows and then loopback
mounting the file while in Linux and creating a Linux filesystem on it that
is used to install Linux on it.
- A comparison of the two drivers using:
time find . -type f -exec md5sum "{}" \;
run three times in sequence with each driver (after a reboot) on a 1.4GiB
NTFS partition, showed the new driver to be 20% faster in total time elapsed
(from 9:43 minutes on average down to 7:53). The time spent in user space
was unchanged but the time spent in the kernel was decreased by a factor of
2.5 (from 85 CPU seconds down to 33).
- The driver does not support short file names in general. For backwards
compatibility, we implement access to files using their short file names if
they exist. The driver will not create short file names however, and a
rename will discard any existing short file name.
- The new driver supports exporting of mounted NTFS volumes via NFS.
- The new driver supports async io (aio).
- The new driver supports fsync(2), fdatasync(2), and msync(2).
- The new driver supports readv(2) and writev(2).
- The new driver supports access time updates (including mtime and ctime).
- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present
only very limited support for highly fragmented files, i.e. ones which have
their data attribute split across multiple extents, is included. Another
limitation is that at present truncate(2) will never create sparse files,
since to mark a file sparse we need to modify the directory entry for the
file and we do not implement directory modifications yet.
- The new driver supports write(2) which can both overwrite existing data and
extend the file size so that you can write beyond the existing data. Also,
writing into sparse regions is supported and the holes are filled in with
clusters. But at present only limited support for highly fragmented files,
i.e. ones which have their data attribute split across multiple extents, is
included. Another limitation is that write(2) will never create sparse
files, since to mark a file sparse we need to modify the directory entry for
the file and we do not implement directory modifications yet.
Supported mount options
=======================
In addition to the generic mount options described by the manual page for the
mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
following mount options:
iocharset=name Deprecated option. Still supported but please use
nls=name in the future. See description for nls=name.
nls=name Character set to use when returning file names.
Unlike VFAT, NTFS suppresses names that contain
unconvertible characters. Note that most character
sets contain insufficient characters to represent all
possible Unicode characters that can exist on NTFS.
To be sure you are not missing any files, you are
advised to use nls=utf8 which is capable of
representing all Unicode characters.
utf8=<bool> Option no longer supported. Currently mapped to
nls=utf8 but please use nls=utf8 in the future and
make sure utf8 is compiled either as module or into
the kernel. See description for nls=name.
uid=
gid=
umask= Provide default owner, group, and access mode mask.
These options work as documented in mount(8). By
default, the files/directories are owned by root and
he/she has read and write permissions, as well as
browse permission for directories. No one else has any
access permissions. I.e. the mode on all files is by
default rw------- and for directories rwx------, a
consequence of the default fmask=0177 and dmask=0077.
Using a umask of zero will grant all permissions to
everyone, i.e. all files and directories will have mode
rwxrwxrwx.
fmask=
dmask= Instead of specifying umask which applies both to
files and directories, fmask applies only to files and
dmask only to directories.
sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
Otherwise the default behaviour is to abort mount if
any unknown options are found.
show_sys_files=<BOOL> If show_sys_files is specified, show the system files
in directory listings. Otherwise the default behaviour
is to hide the system files.
Note that even when show_sys_files is specified, "$MFT"
will not be visible due to bugs/mis-features in glibc.
Further, note that irrespective of show_sys_files, all
files are accessible by name, i.e. you can always do
"ls -l \$UpCase" for example to specifically show the
system file containing the Unicode upcase table.
case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
case sensitive and create file names in the POSIX
namespace. Otherwise the default behaviour is to treat
file names as case insensitive and to create file names
in the WIN32/LONG name space. Note, the Linux NTFS
driver will never create short file names and will
remove them on rename/delete of the corresponding long
file name.
Note that files remain accessible via their short file
name, if it exists. If case_sensitive, you will need
to provide the correct case of the short file name.
disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse
regions, i.e. holes, inside files is disabled for the
volume (for the duration of this mount only). By
default, creation of sparse regions is enabled, which
is consistent with the behaviour of traditional Unix
filesystems.
errors=opt What to do when critical filesystem errors are found.
Following values can be used for "opt":
continue: DEFAULT, try to clean-up as much as
possible, e.g. marking a corrupt inode as
bad so it is no longer accessed, and then
continue.
recover: At present only supported is recovery of
the boot sector from the backup copy.
If read-only mount, the recovery is done
in memory only and not written to disk.
Note that the options are additive, i.e. specifying:
errors=continue,errors=recover
means the driver will attempt to recover and if that
fails it will clean-up as much as possible and
continue.
mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
setting is not persistent across mounts and can be
changed from mount to mount but cannot be changed on
remount). Values of 1 to 4 are allowed, 1 being the
default. The MFT zone multiplier determines how much
space is reserved for the MFT on the volume. If all
other space is used up, then the MFT zone will be
shrunk dynamically, so this has no impact on the
amount of free space. However, it can have an impact
on performance by affecting fragmentation of the MFT.
In general use the default. If you have a lot of small
files then use a higher value. The values have the
following meaning:
Value MFT zone size (% of volume size)
1 12.5%
2 25%
3 37.5%
4 50%
Note this option is irrelevant for read-only mounts.
Known bugs and (mis-)features
=============================
- The link count on each directory inode entry is set to 1, due to Linux not
supporting directory hard links. This may well confuse some user space
applications, since the directory names will have the same inode numbers.
This also speeds up ntfs_read_inode() immensely. And we haven't found any
problems with this approach so far. If you find a problem with this, please
let us know.
Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
Using NTFS volume and stripe sets
=================================
For support of volume and stripe sets, you can either use the kernel's
Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
the recommended one to use for linear raid. But the latter is required for
raid level 5. For striping and mirroring, either driver should work fine.
The Device-Mapper driver
------------------------
You will need to create a table of the components of the volume/stripe set and
how they fit together and load this into the kernel using the dmsetup utility
(see man 8 dmsetup).
Linear volume sets, i.e. linear raid, has been tested and works fine. Even
though untested, there is no reason why stripe sets, i.e. raid level 0, and
mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
raid level 5, unfortunately cannot work yet because the current version of the
Device-Mapper driver does not support raid level 5. You may be able to use the
Software RAID / MD driver for raid level 5, see the next section for details.
To create the table describing your volume you will need to know each of its
components and their sizes in sectors, i.e. multiples of 512-byte blocks.
For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
example if one of your partitions is /dev/hda2 you would do:
$ fdisk -ul /dev/hda
Disk /dev/hda: 81.9 GB, 81964302336 bytes
255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 63 4209029 2104483+ 83 Linux
/dev/hda2 4209030 37768814 16779892+ 86 NTFS
/dev/hda3 37768815 46170809 4200997+ 83 Linux
And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
33559785 sectors.
For Win2k and later dynamic disks, you can for example use the ldminfo utility
which is part of the Linux LDM tools (the latest version at the time of
writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
http://linux-ntfs.sourceforge.net/downloads.html
Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
able to compile this yourself easily so use the binary version!
Then you would use ldminfo in dump mode to obtain the necessary information:
$ ./ldminfo --dump /dev/hda
This would dump the LDM database found on /dev/hda which describes all of your
dynamic disks and all the volumes on them. At the bottom you will see the
VOLUME DEFINITIONS section which is all you really need. You may need to look
further above to determine which of the disks in the volume definitions is
which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
section). You can then find these Disk Ids in the VBLK DATABASE section in the
<Disk> components where you will get the LDM Name for the disk that is found in
the VOLUME DEFINITIONS section.
Note you will also need to enable the LDM driver in the Linux kernel. If your
distribution did not enable it, you will need to recompile the kernel with it
enabled. This will create the LDM partitions on each device at boot time. You
would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
in the Device-Mapper table.
You can also bypass using the LDM driver by using the main device (e.g.
/dev/hda) and then using the offsets of the LDM partitions into this device as
the "Start sector of device" when creating the table. Once again ldminfo would
give you the correct information to do this.
Assuming you know all your devices and their sizes things are easy.
For a linear raid the table would look like this (note all values are in
512-byte sectors):
--- cut here ---
# Offset into Size of this Raid type Device Start sector
# volume device of device
0 1028161 linear /dev/hda1 0
1028161 3903762 linear /dev/hdb2 0
4931923 2103211 linear /dev/hdc1 0
--- cut here ---
For a striped volume, i.e. raid level 0, you will need to know the chunk size
you used when creating the volume. Windows uses 64kiB as the default, so it
will probably be this unless you changes the defaults when creating the array.
For a raid level 0 the table would look like this (note all values are in
512-byte sectors):
--- cut here ---
# Offset Size Raid Number Chunk 1st Start 2nd Start
# into of the type of size Device in Device in
# volume volume stripes device device
0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
--- cut here ---
If there are more than two devices, just add each of them to the end of the
line.
Finally, for a mirrored volume, i.e. raid level 1, the table would look like
this (note all values are in 512-byte sectors):
--- cut here ---
# Ofs Size Raid Log Number Region Should Number Source Start Target Start
# in of the type type of log size sync? of Device in Device in
# vol volume params mirrors Device Device
0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
--- cut here ---
If you are mirroring to multiple devices you can specify further targets at the
end of the line.
Note the "Should sync?" parameter "nosync" means that the two mirrors are
already in sync which will be the case on a clean shutdown of Windows. If the
mirrors are not clean, you can specify the "sync" option instead of "nosync"
and the Device-Mapper driver will then copy the entirey of the "Source Device"
to the "Target Device" or if you specified multipled target devices to all of
them.
Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
and hand it over to dmsetup to work with, like so:
$ dmsetup create myvolume1 /etc/ntfsvolume1
You can obviously replace "myvolume1" with whatever name you like.
If it all worked, you will now have the device /dev/device-mapper/myvolume1
which you can then just use as an argument to the mount command as usual to
mount the ntfs volume. For example:
$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
(You need to create the directory /mnt/myvol1 first and of course you can use
anything you like instead of /mnt/myvol1 as long as it is an existing
directory.)
It is advisable to do the mount read-only to see if the volume has been setup
correctly to avoid the possibility of causing damage to the data on the ntfs
volume.
The Software RAID / MD driver
-----------------------------
An alternative to using the Device-Mapper driver is to use the kernel's
Software RAID / MD driver. For which you need to set up your /etc/raidtab
appropriately (see man 5 raidtab).
Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
0, have been tested and work fine (though see section "Limitations when using
the MD driver with NTFS volumes" especially if you want to use linear raid).
Even though untested, there is no reason why mirrors, i.e. raid level 1, and
stripes with parity, i.e. raid level 5, should not work, too.
You have to use the "persistent-superblock 0" option for each raid-disk in the
NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
superblock used by the MD driver would damage the NTFS volume.
Windows by default uses a stripe chunk size of 64k, so you probably want the
"chunk-size 64k" option for each raid-disk, too.
For example, if you have a stripe set consisting of two partitions /dev/hda5
and /dev/hdb1 your /etc/raidtab would look like this:
raiddev /dev/md0
raid-level 0
nr-raid-disks 2
nr-spare-disks 0
persistent-superblock 0
chunk-size 64k
device /dev/hda5
raid-disk 0
device /dev/hdb1
raid-disl 1
For linear raid, just change the raid-level above to "raid-level linear", for
mirrors, change it to "raid-level 1", and for stripe sets with parity, change
it to "raid-level 5".
Note for stripe sets with parity you will also need to tell the MD driver
which parity algorithm to use by specifying the option "parity-algorithm
which", where you need to replace "which" with the name of the algorithm to
use (see man 5 raidtab for available algorithms) and you will have to try the
different available algorithms until you find one that works. Make sure you
are working read-only when playing with this as you may damage your data
otherwise. If you find which algorithm works please let us know (email the
linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
IRC in channel #ntfs on the irc.freenode.net network) so we can update this
documentation.
Once the raidtab is setup, run for example raid0run -a to start all devices or
raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
Then just use the mount command as usual to mount the ntfs volume using for
example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
It is advisable to do the mount read-only to see if the md volume has been
setup correctly to avoid the possibility of causing damage to the data on the
ntfs volume.
Limitations when using the Software RAID / MD driver
-----------------------------------------------------
Using the md driver will not work properly if any of your NTFS partitions have
an odd number of sectors. This is especially important for linear raid as all
data after the first partition with an odd number of sectors will be offset by
one or more sectors so if you mount such a partition with write support you
will cause massive damage to the data on the volume which will only become
apparent when you try to use the volume again under Windows.
So when using linear raid, make sure that all your partitions have an even
number of sectors BEFORE attempting to use it. You have been warned!
Even better is to simply use the Device-Mapper for linear raid and then you do
not have this problem with odd numbers of sectors.
ChangeLog
=========
Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
2.1.28:
- Fix a deadlock.
2.1.27:
- Implement page migration support so the kernel can move memory used
by NTFS files and directories around for management purposes.
- Add support for writing to sparse files created with Windows XP SP2.
- Many minor improvements and bug fixes.
2.1.26:
- Implement support for sector sizes above 512 bytes (up to the maximum
supported by NTFS which is 4096 bytes).
- Enhance support for NTFS volumes which were supported by Windows but
not by Linux due to invalid attribute list attribute flags.
- A few minor updates and bug fixes.
2.1.25:
- Write support is now extended with write(2) being able to both
overwrite existing file data and to extend files. Also, if a write
to a sparse region occurs, write(2) will fill in the hole. Note,
mmap(2) based writes still do not support writing into holes or
writing beyond the initialized size.
- Write support has a new feature and that is that truncate(2) and
open(2) with O_TRUNC are now implemented thus files can be both made
smaller and larger.
- Note: Both write(2) and truncate(2)/open(2) with O_TRUNC still have
limitations in that they
- only provide limited support for highly fragmented files.
- only work on regular, i.e. uncompressed and unencrypted files.
- never create sparse files although this will change once directory
operations are implemented.
- Lots of bug fixes and enhancements across the board.
2.1.24:
- Support journals ($LogFile) which have been modified by chkdsk. This
means users can boot into Windows after we marked the volume dirty.
The Windows boot will run chkdsk and then reboot. The user can then
immediately boot into Linux rather than having to do a full Windows
boot first before rebooting into Linux and we will recognize such a
journal and empty it as it is clean by definition.
- Support journals ($LogFile) with only one restart page as well as
journals with two different restart pages. We sanity check both and
either use the only sane one or the more recent one of the two in the
case that both are valid.
- Lots of bug fixes and enhancements across the board.
2.1.23:
- Stamp the user space journal, aka transaction log, aka $UsnJrnl, if
it is present and active thus telling Windows and applications using
the transaction log that changes can have happened on the volume
which are not recorded in $UsnJrnl.
- Detect the case when Windows has been hibernated (suspended to disk)
and if this is the case do not allow (re)mounting read-write to
prevent data corruption when you boot back into the suspended
Windows session.
- Implement extension of resident files using the normal file write
code paths, i.e. most very small files can be extended to be a little
bit bigger but not by much.
- Add new mount option "disable_sparse". (See list of mount options
above for details.)
- Improve handling of ntfs volumes with errors and strange boot sectors
in particular.
- Fix various bugs including a nasty deadlock that appeared in recent
kernels (around 2.6.11-2.6.12 timeframe).
2.1.22:
- Improve handling of ntfs volumes with errors.
- Fix various bugs and race conditions.
2.1.21:
- Fix several race conditions and various other bugs.
- Many internal cleanups, code reorganization, optimizations, and mft
and index record writing code rewritten to fit in with the changes.
- Update Documentation/filesystems/ntfs.txt with instructions on how to
use the Device-Mapper driver with NTFS ftdisk/LDM raid.
2.1.20:
- Fix two stupid bugs introduced in 2.1.18 release.
2.1.19:
- Minor bugfix in handling of the default upcase table.
- Many internal cleanups and improvements. Many thanks to Linus
Torvalds and Al Viro for the help and advice with the sparse
annotations and cleanups.
2.1.18:
- Fix scheduling latencies at mount time. (Ingo Molnar)
- Fix endianness bug in a little traversed portion of the attribute
lookup code.
2.1.17:
- Fix bugs in mount time error code paths.
2.1.16:
- Implement access time updates (including mtime and ctime).
- Implement fsync(2), fdatasync(2), and msync(2) system calls.
- Enable the readv(2) and writev(2) system calls.
- Enable access via the asynchronous io (aio) API by adding support for
the aio_read(3) and aio_write(3) functions.
2.1.15:
- Invalidate quotas when (re)mounting read-write.
NOTE: This now only leave user space journalling on the side. (See
note for version 2.1.13, below.)
2.1.14:
- Fix an NFSd caused deadlock reported by several users.
2.1.13:
- Implement writing of inodes (access time updates are not implemented
yet so mounting with -o noatime,nodiratime is enforced).
- Enable writing out of resident files so you can now overwrite any
uncompressed, unencrypted, nonsparse file as long as you do not
change the file size.
- Add housekeeping of ntfs system files so that ntfsfix no longer needs
to be run after writing to an NTFS volume.
NOTE: This still leaves quota tracking and user space journalling on
the side but they should not cause data corruption. In the worst
case the charged quotas will be out of date ($Quota) and some
userspace applications might get confused due to the out of date
userspace journal ($UsnJrnl).
2.1.12:
- Fix the second fix to the decompression engine from the 2.1.9 release
and some further internals cleanups.
2.1.11:
- Driver internal cleanups.
2.1.10:
- Force read-only (re)mounting of volumes with unsupported volume
flags and various cleanups.
2.1.9:
- Fix two bugs in handling of corner cases in the decompression engine.
2.1.8:
- Read the $MFT mirror and compare it to the $MFT and if the two do not
match, force a read-only mount and do not allow read-write remounts.
- Read and parse the $LogFile journal and if it indicates that the
volume was not shutdown cleanly, force a read-only mount and do not
allow read-write remounts. If the $LogFile indicates a clean
shutdown and a read-write (re)mount is requested, empty $LogFile to
ensure that Windows cannot cause data corruption by replaying a stale
journal after Linux has written to the volume.
- Improve time handling so that the NTFS time is fully preserved when
converted to kernel time and only up to 99 nano-seconds are lost when
kernel time is converted to NTFS time.
2.1.7:
- Enable NFS exporting of mounted NTFS volumes.
2.1.6:
- Fix minor bug in handling of compressed directories that fixes the
erroneous "du" and "stat" output people reported.
2.1.5:
- Minor bug fix in attribute list attribute handling that fixes the
I/O errors on "ls" of certain fragmented files found by at least two
people running Windows XP.
2.1.4:
- Minor update allowing compilation with all gcc versions (well, the
ones the kernel can be compiled with anyway).
2.1.3:
- Major bug fixes for reading files and volumes in corner cases which
were being hit by Windows 2k/XP users.
2.1.2:
- Major bug fixes alleviating the hangs in statfs experienced by some
users.
2.1.1:
- Update handling of compressed files so people no longer get the
frequently reported warning messages about initialized_size !=
data_size.
2.1.0:
- Add configuration option for developmental write support.
- Initial implementation of file overwriting. (Writes to resident files
are not written out to disk yet, so avoid writing to files smaller
than about 1kiB.)
- Intercept/abort changes in file size as they are not implemented yet.
2.0.25:
- Minor bugfixes in error code paths and small cleanups.
2.0.24:
- Small internal cleanups.
- Support for sendfile system call. (Christoph Hellwig)
2.0.23:
- Massive internal locking changes to mft record locking. Fixes
various race conditions and deadlocks.
- Fix ntfs over loopback for compressed files by adding an
optimization barrier. (gcc was screwing up otherwise ?)
Thanks go to Christoph Hellwig for pointing these two out:
- Remove now unused function fs/ntfs/malloc.h::vmalloc_nofs().
- Fix ntfs_free() for ia64 and parisc.
2.0.22:
- Small internal cleanups.
2.0.21:
These only affect 32-bit architectures:
- Check for, and refuse to mount too large volumes (maximum is 2TiB).
- Check for, and refuse to open too large files and directories
(maximum is 16TiB).
2.0.20:
- Support non-resident directory index bitmaps. This means we now cope
with huge directories without problems.
- Fix a page leak that manifested itself in some cases when reading
directory contents.
- Internal cleanups.
2.0.19:
- Fix race condition and improvements in block i/o interface.
- Optimization when reading compressed files.
2.0.18:
- Fix race condition in reading of compressed files.
2.0.17:
- Cleanups and optimizations.
2.0.16:
- Fix stupid bug introduced in 2.0.15 in new attribute inode API.
- Big internal cleanup replacing the mftbmp access hacks by using the
new attribute inode API instead.
2.0.15:
- Bug fix in parsing of remount options.
- Internal changes implementing attribute (fake) inodes allowing all
attribute i/o to go via the page cache and to use all the normal
vfs/mm functionality.
2.0.14:
- Internal changes improving run list merging code and minor locking
change to not rely on BKL in ntfs_statfs().
2.0.13:
- Internal changes towards using iget5_locked() in preparation for
fake inodes and small cleanups to ntfs_volume structure.
2.0.12:
- Internal cleanups in address space operations made possible by the
changes introduced in the previous release.
2.0.11:
- Internal updates and cleanups introducing the first step towards
fake inode based attribute i/o.
2.0.10:
- Microsoft says that the maximum number of inodes is 2^32 - 1. Update
the driver accordingly to only use 32-bits to store inode numbers on
32-bit architectures. This improves the speed of the driver a little.
2.0.9:
- Change decompression engine to use a single buffer. This should not
affect performance except perhaps on the most heavy i/o on SMP
systems when accessing multiple compressed files from multiple
devices simultaneously.
- Minor updates and cleanups.
2.0.8:
- Remove now obsolete show_inodes and posix mount option(s).
- Restore show_sys_files mount option.
- Add new mount option case_sensitive, to determine if the driver
treats file names as case sensitive or not.
- Mostly drop support for short file names (for backwards compatibility
we only support accessing files via their short file name if one
exists).
- Fix dcache aliasing issues wrt short/long file names.
- Cleanups and minor fixes.
2.0.7:
- Just cleanups.
2.0.6:
- Major bugfix to make compatible with other kernel changes. This fixes
the hangs/oopses on umount.
- Locking cleanup in directory operations (remove BKL usage).
2.0.5:
- Major buffer overflow bug fix.
- Minor cleanups and updates for kernel 2.5.12.
2.0.4:
- Cleanups and updates for kernel 2.5.11.
2.0.3:
- Small bug fixes, cleanups, and performance improvements.
2.0.2:
- Use default fmask of 0177 so that files are no executable by default.
If you want owner executable files, just use fmask=0077.
- Update for kernel 2.5.9 but preserve backwards compatibility with
kernel 2.5.7.
- Minor bug fixes, cleanups, and updates.
2.0.1:
- Minor updates, primarily set the executable bit by default on files
so they can be executed.
2.0.0:
- Started ChangeLog.

View File

@@ -0,0 +1,59 @@
OCFS2 filesystem
==================
OCFS2 is a general purpose extent based shared disk cluster file
system with many similarities to ext3. It supports 64 bit inode
numbers, and has automatically extending metadata groups which may
also make it attractive for non-clustered use.
You'll want to install the ocfs2-tools package in order to at least
get "mount.ocfs2" and "ocfs2_hb_ctl".
Project web page: http://oss.oracle.com/projects/ocfs2
Tools web page: http://oss.oracle.com/projects/ocfs2-tools
OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
All code copyright 2005 Oracle except when otherwise noted.
CREDITS:
Lots of code taken from ext3 and other projects.
Authors in alphabetical order:
Joel Becker <joel.becker@oracle.com>
Zach Brown <zach.brown@oracle.com>
Mark Fasheh <mark.fasheh@oracle.com>
Kurt Hackel <kurt.hackel@oracle.com>
Sunil Mushran <sunil.mushran@oracle.com>
Manish Singh <manish.singh@oracle.com>
Caveats
=======
Features which OCFS2 does not support yet:
- sparse files
- extended attributes
- shared writable mmap
- loopback is supported, but data written will not
be cluster coherent.
- quotas
- cluster aware flock
- cluster aware lockf
- Directory change notification (F_NOTIFY)
- Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease)
- POSIX ACLs
- readpages / writepages (not user visible)
Mount options
=============
OCFS2 supports the following mount options:
(*) == default
barrier=1 This enables/disables barriers. barrier=0 disables it,
barrier=1 enables it.
errors=remount-ro(*) Remount the filesystem read-only on an error.
errors=panic Panic and halt the machine if an error occurs.
intr (*) Allow signals to interrupt cluster operations.
nointr Do not allow signals to interrupt cluster
operations.
atime_quantum=60(*) OCFS2 will not update atime unless this number
of seconds has passed since the last update.
Set to zero to always update atime.

View File

@@ -0,0 +1,267 @@
Changes since 2.5.0:
---
[recommended]
New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),
sb_set_blocksize() and sb_min_blocksize().
Use them.
(sb_find_get_block() replaces 2.4's get_hash_table())
---
[recommended]
New methods: ->alloc_inode() and ->destroy_inode().
Remove inode->u.foo_inode_i
Declare
struct foo_inode_info {
/* fs-private stuff */
struct inode vfs_inode;
};
static inline struct foo_inode_info *FOO_I(struct inode *inode)
{
return list_entry(inode, struct foo_inode_info, vfs_inode);
}
Use FOO_I(inode) instead of &inode->u.foo_inode_i;
Add foo_alloc_inode() and foo_destory_inode() - the former should allocate
foo_inode_info and return the address of ->vfs_inode, the latter should free
FOO_I(inode) (see in-tree filesystems for examples).
Make them ->alloc_inode and ->destroy_inode in your super_operations.
Keep in mind that now you need explicit initialization of private data -
typically in ->read_inode() and after getting an inode from new_inode().
At some point that will become mandatory.
---
[mandatory]
Change of file_system_type method (->read_super to ->get_sb)
->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.
Turn your foo_read_super() into a function that would return 0 in case of
success and negative number in case of error (-EINVAL unless you have more
informative error value to report). Call it foo_fill_super(). Now declare
int foo_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
{
return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,
mnt);
}
(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
filesystem).
Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as
foo_get_sb.
---
[mandatory]
Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.
Most likely there is no need to change anything, but if you relied on
global exclusion between renames for some internal purpose - you need to
change your internal locking. Otherwise exclusion warranties remain the
same (i.e. parents and victim are locked, etc.).
---
[informational]
Now we have the exclusion between ->lookup() and directory removal (by
->rmdir() and ->rename()). If you used to need that exclusion and do
it by internal locking (most of filesystems couldn't care less) - you
can relax your locking.
---
[mandatory]
->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()
and ->readdir() are called without BKL now. Grab it on entry, drop upon return
- that will guarantee the same locking you used to have. If your method or its
parts do not need BKL - better yet, now you can shift lock_kernel() and
unlock_kernel() so that they would protect exactly what needs to be
protected.
---
[mandatory]
BKL is also moved from around sb operations. ->write_super() Is now called
without BKL held. BKL should have been shifted into individual fs sb_op
functions. If you don't need it, remove it.
---
[informational]
check for ->link() target not being a directory is done by callers. Feel
free to drop it...
---
[informational]
->link() callers hold ->i_sem on the object we are linking to. Some of your
problems might be over...
---
[mandatory]
new file_system_type method - kill_sb(superblock). If you are converting
an existing filesystem, set it according to ->fs_flags:
FS_REQUIRES_DEV - kill_block_super
FS_LITTER - kill_litter_super
neither - kill_anon_super
FS_LITTER is gone - just remove it from fs_flags.
---
[mandatory]
FS_SINGLE is gone (actually, that had happened back when ->get_sb()
went in - and hadn't been documented ;-/). Just remove it from fs_flags
(and see ->get_sb() entry for other actions).
---
[mandatory]
->setattr() is called without BKL now. Caller _always_ holds ->i_sem, so
watch for ->i_sem-grabbing code that might be used by your ->setattr().
Callers of notify_change() need ->i_sem now.
---
[recommended]
New super_block field "struct export_operations *s_export_op" for
explicit support for exporting, e.g. via NFS. The structure is fully
documented at its declaration in include/linux/fs.h, and in
Documentation/filesystems/Exporting.
Briefly it allows for the definition of decode_fh and encode_fh operations
to encode and decode filehandles, and allows the filesystem to use
a standard helper function for decode_fh, and provide file-system specific
support for this helper, particularly get_parent.
It is planned that this will be required for exporting once the code
settles down a bit.
[mandatory]
s_export_op is now required for exporting a filesystem.
isofs, ext2, ext3, resierfs, fat
can be used as examples of very different filesystems.
---
[mandatory]
iget4() and the read_inode2 callback have been superseded by iget5_locked()
which has the following prototype,
struct inode *iget5_locked(struct super_block *sb, unsigned long ino,
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *),
void *data);
'test' is an additional function that can be used when the inode
number is not sufficient to identify the actual file object. 'set'
should be a non-blocking function that initializes those parts of a
newly created inode to allow the test function to succeed. 'data' is
passed as an opaque value to both test and set functions.
When the inode has been created by iget5_locked(), it will be returned with
the I_NEW flag set and will still be locked. read_inode has not been
called so the file system still has to finalize the initialization. Once
the inode is initialized it must be unlocked by calling unlock_new_inode().
The filesystem is responsible for setting (and possibly testing) i_ino
when appropriate. There is also a simpler iget_locked function that
just takes the superblock and inode number as arguments and does the
test and set for you.
e.g.
inode = iget_locked(sb, ino);
if (inode->i_state & I_NEW) {
read_inode_from_disk(inode);
unlock_new_inode(inode);
}
---
[recommended]
->getattr() finally getting used. See instances in nfs, minix, etc.
---
[mandatory]
->revalidate() is gone. If your filesystem had it - provide ->getattr()
and let it call whatever you had as ->revlidate() + (for symlinks that
had ->revalidate()) add calls in ->follow_link()/->readlink().
---
[mandatory]
->d_parent changes are not protected by BKL anymore. Read access is safe
if at least one of the following is true:
* filesystem has no cross-directory rename()
* dcache_lock is held
* we know that parent had been locked (e.g. we are looking at
->d_parent of ->lookup() argument).
* we are called from ->rename().
* the child's ->d_lock is held
Audit your code and add locking if needed. Notice that any place that is
not protected by the conditions above is risky even in the old tree - you
had been relying on BKL and that's prone to screwups. Old tree had quite
a few holes of that kind - unprotected access to ->d_parent leading to
anything from oops to silent memory corruption.
---
[mandatory]
FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags
(see rootfs for one kind of solution and bdev/socket/pipe for another).
---
[recommended]
Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter
is still alive, but only because of the mess in drivers/s390/block/dasd.c.
As soon as it gets fixed is_read_only() will die.
---
[mandatory]
->permission() is called without BKL now. Grab it on entry, drop upon
return - that will guarantee the same locking you used to have. If
your method or its parts do not need BKL - better yet, now you can
shift lock_kernel() and unlock_kernel() so that they would protect
exactly what needs to be protected.
---
[mandatory]
->statfs() is now called without BKL held. BKL should have been
shifted into individual fs sb_op functions where it's not clear that
it's safe to remove it. If you don't need it, remove it.
---
[mandatory]
is_read_only() is gone; use bdev_read_only() instead.
---
[mandatory]
destroy_buffers() is gone; use invalidate_bdev().
---
[mandatory]
fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is
deliberate; as soon as struct block_device * is propagated in a reasonable
way by that code fixing will become trivial; until then nothing can be
done.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,355 @@
ramfs, rootfs and initramfs
October 17, 2005
Rob Landley <rob@landley.net>
=============================
What is ramfs?
--------------
Ramfs is a very simple filesystem that exports Linux's disk caching
mechanisms (the page cache and dentry cache) as a dynamically resizable
ram-based filesystem.
Normally all files are cached in memory by Linux. Pages of data read from
backing store (usually the block device the filesystem is mounted on) are kept
around in case it's needed again, but marked as clean (freeable) in case the
Virtual Memory system needs the memory for something else. Similarly, data
written to files is marked clean as soon as it has been written to backing
store, but kept around for caching purposes until the VM reallocates the
memory. A similar mechanism (the dentry cache) greatly speeds up access to
directories.
With ramfs, there is no backing store. Files written into ramfs allocate
dentries and page cache as usual, but there's nowhere to write them to.
This means the pages are never marked clean, so they can't be freed by the
VM when it's looking to recycle memory.
The amount of code required to implement ramfs is tiny, because all the
work is done by the existing Linux caching infrastructure. Basically,
you're mounting the disk cache as a filesystem. Because of this, ramfs is not
an optional component removable via menuconfig, since there would be negligible
space savings.
ramfs and ramdisk:
------------------
The older "ram disk" mechanism created a synthetic block device out of
an area of ram and used it as backing store for a filesystem. This block
device was of fixed size, so the filesystem mounted on it was of fixed
size. Using a ram disk also required unnecessarily copying memory from the
fake block device into the page cache (and copying changes back out), as well
as creating and destroying dentries. Plus it needed a filesystem driver
(such as ext2) to format and interpret this data.
Compared to ramfs, this wastes memory (and memory bus bandwidth), creates
unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks
to avoid this copying by playing with the page tables, but they're unpleasantly
complicated and turn out to be about as expensive as the copying anyway.)
More to the point, all the work ramfs is doing has to happen _anyway_,
since all file access goes through the page and dentry caches. The ram
disk is simply unnecessary, ramfs is internally much simpler.
Another reason ramdisks are semi-obsolete is that the introduction of
loopback devices offered a more flexible and convenient way to create
synthetic block devices, now from files instead of from chunks of memory.
See losetup (8) for details.
ramfs and tmpfs:
----------------
One downside of ramfs is you can keep writing data into it until you fill
up all memory, and the VM can't free it because the VM thinks that files
should get written to backing store (rather than swap space), but ramfs hasn't
got any backing store. Because of this, only root (or a trusted user) should
be allowed write access to a ramfs mount.
A ramfs derivative called tmpfs was created to add size limits, and the ability
to write the data to swap space. Normal users can be allowed write access to
tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information.
What is rootfs?
---------------
Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is
always present in 2.6 systems. You can't unmount rootfs for approximately the
same reason you can't kill the init process; rather than having special code
to check for and handle an empty list, it's smaller and simpler for the kernel
to just make sure certain lists can't become empty.
Most systems just mount another filesystem over rootfs and ignore it. The
amount of space an empty instance of ramfs takes up is tiny.
What is initramfs?
------------------
All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is
extracted into rootfs when the kernel boots up. After extracting, the kernel
checks to see if rootfs contains a file "init", and if so it executes it as PID
1. If found, this init process is responsible for bringing the system the
rest of the way up, including locating and mounting the real root device (if
any). If rootfs does not contain an init program after the embedded cpio
archive is extracted into it, the kernel will fall through to the older code
to locate and mount a root partition, then exec some variant of /sbin/init
out of that.
All this differs from the old initrd in several ways:
- The old initrd was always a separate file, while the initramfs archive is
linked into the linux kernel image. (The directory linux-*/usr is devoted
to generating this archive during the build.)
- The old initrd file was a gzipped filesystem image (in some file format,
such as ext2, that needed a driver built into the kernel), while the new
initramfs archive is a gzipped cpio archive (like tar only simpler,
see cpio(1) and Documentation/early-userspace/buffer-format.txt). The
kernel's cpio extraction code is not only extremely small, it's also
__init data that can be discarded during the boot process.
- The program run by the old initrd (which was called /initrd, not /init) did
some setup and then returned to the kernel, while the init program from
initramfs is not expected to return to the kernel. (If /init needs to hand
off control it can overmount / with a new root device and exec another init
program. See the switch_root utility, below.)
- When switching another root device, initrd would pivot_root and then
umount the ramdisk. But initramfs is rootfs: you can neither pivot_root
rootfs, nor unmount it. Instead delete everything out of rootfs to
free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs
with the new root (cd /newmount; mount --move . /; chroot .), attach
stdin/stdout/stderr to the new /dev/console, and exec the new init.
Since this is a remarkably persnickity process (and involves deleting
commands before you can run them), the klibc package introduced a helper
program (utils/run_init.c) to do all this for you. Most other packages
(such as busybox) have named this command "switch_root".
Populating initramfs:
---------------------
The 2.6 kernel build process always creates a gzipped cpio format initramfs
archive and links it into the resulting kernel binary. By default, this
archive is empty (consuming 134 bytes on x86).
The config option CONFIG_INITRAMFS_SOURCE (for some reason buried under
devices->block devices in menuconfig, and living in usr/Kconfig) can be used
to specify a source for the initramfs archive, which will automatically be
incorporated into the resulting binary. This option can point to an existing
gzipped cpio archive, a directory containing files to be archived, or a text
file specification such as the following example:
dir /dev 755 0 0
nod /dev/console 644 0 0 c 5 1
nod /dev/loop0 644 0 0 b 7 0
dir /bin 755 1000 1000
slink /bin/sh busybox 777 0 0
file /bin/busybox initramfs/busybox 755 0 0
dir /proc 755 0 0
dir /sys 755 0 0
dir /mnt 755 0 0
file /init initramfs/init.sh 755 0 0
Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
documenting the above file format.
One advantage of the configuration file is that root access is not required to
set permissions or create device nodes in the new archive. (Note that those
two example "file" entries expect to find files named "init.sh" and "busybox" in
a directory called "initramfs", under the linux-2.6.* directory. See
Documentation/early-userspace/README for more details.)
The kernel does not depend on external cpio tools. If you specify a
directory instead of a configuration file, the kernel's build infrastructure
creates a configuration file from that directory (usr/Makefile calls
scripts/gen_initramfs_list.sh), and proceeds to package up that directory
using the config file (by feeding it to usr/gen_init_cpio, which is created
from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is
entirely self-contained, and the kernel's boot-time extractor is also
(obviously) self-contained.
The one thing you might need external cpio utilities installed for is creating
or extracting your own preprepared cpio files to feed to the kernel build
(instead of a config file or directory).
The following command line can extract a cpio image (either by the above script
or by the kernel build) back into its component files:
cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
The following shell script can create a prebuilt cpio archive you can
use in place of the above config file:
#!/bin/sh
# Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation.
# Licensed under GPL version 2
if [ $# -ne 2 ]
then
echo "usage: mkinitramfs directory imagename.cpio.gz"
exit 1
fi
if [ -d "$1" ]
then
echo "creating $2 from $1"
(cd "$1"; find . | cpio -o -H newc | gzip) > "$2"
else
echo "First argument must be a directory"
exit 1
fi
Note: The cpio man page contains some bad advice that will break your initramfs
archive if you follow it. It says "A typical way to generate the list
of filenames is with the find command; you should give find the -depth option
to minimize problems with permissions on directories that are unwritable or not
searchable." Don't do this when creating initramfs.cpio.gz images, it won't
work. The Linux kernel cpio extractor won't create files in a directory that
doesn't exist, so the directory entries must go before the files that go in
those directories. The above script gets them in the right order.
External initramfs images:
--------------------------
If the kernel has initrd support enabled, an external cpio.gz archive can also
be passed into a 2.6 kernel in place of an initrd. In this case, the kernel
will autodetect the type (initramfs, not initrd) and extract the external cpio
archive into rootfs before trying to run /init.
This has the memory efficiency advantages of initramfs (no ramdisk block
device) but the separate packaging of initrd (which is nice if you have
non-GPL code you'd like to run from initramfs, without conflating it with
the GPL licensed Linux kernel binary).
It can also be used to supplement the kernel's built-in initamfs image. The
files in the external archive will overwrite any conflicting files in
the built-in initramfs archive. Some distributors also prefer to customize
a single kernel image with task-specific initramfs images, without recompiling.
Contents of initramfs:
----------------------
An initramfs archive is a complete self-contained root filesystem for Linux.
If you don't already understand what shared libraries, devices, and paths
you need to get a minimal root filesystem up and running, here are some
references:
http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
http://www.linuxfromscratch.org/lfs/view/stable/
The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
designed to be a tiny C library to statically link early userspace
code against, along with some related utilities. It is BSD licensed.
I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
myself. These are LGPL and GPL, respectively. (A self-contained initramfs
package is planned for the busybox 1.3 release.)
In theory you could use glibc, but that's not well suited for small embedded
uses like this. (A "hello world" program statically linked against glibc is
over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do
name lookups, even when otherwise statically linked.)
A good first step is to get initramfs to run a statically linked "hello world"
program as init, and test it under an emulator like qemu (www.qemu.org) or
User Mode Linux, like so:
cat > hello.c << EOF
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
printf("Hello world!\n");
sleep(999999999);
}
EOF
gcc -static hello2.c -o init
echo init | cpio -o -H newc | gzip > test.cpio.gz
# Testing external initramfs using the initrd loading mechanism.
qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero
When debugging a normal root filesystem, it's nice to be able to boot with
"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's
just as useful.
Why cpio rather than tar?
-------------------------
This decision was made back in December, 2001. The discussion started here:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html
And spawned a second thread (specifically on tar vs cpio), starting here:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html
The quick and dirty summary version (which is no substitute for reading
the above threads) is:
1) cpio is a standard. It's decades old (from the AT&T days), and already
widely used on Linux (inside RPM, Red Hat's device driver disks). Here's
a Linux Journal article about it from 1996:
http://www.linuxjournal.com/article/1213
It's not as popular as tar because the traditional cpio command line tools
require _truly_hideous_ command line arguments. But that says nothing
either way about the archive format, and there are alternative tools,
such as:
http://freshmeat.net/projects/afio/
2) The cpio archive format chosen by the kernel is simpler and cleaner (and
thus easier to create and parse) than any of the (literally dozens of)
various tar archive formats. The complete initramfs archive format is
explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
extracted in init/initramfs.c. All three together come to less than 26k
total of human-readable text.
3) The GNU project standardizing on tar is approximately as relevant as
Windows standardizing on zip. Linux is not part of either, and is free
to make its own technical decisions.
4) Since this is a kernel internal format, it could easily have been
something brand new. The kernel provides its own tools to create and
extract this format anyway. Using an existing standard was preferable,
but not essential.
5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be
supported on the kernel side"):
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html
explained his reasoning:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
and, most importantly, designed and implemented the initramfs code.
Future directions:
------------------
Today (2.6.16), initramfs is always compiled in, but not always used. The
kernel falls back to legacy boot code that is reached only if initramfs does
not contain an /init program. The fallback is legacy code, there to ensure a
smooth transition and allowing early boot functionality to gradually move to
"early userspace" (I.E. initramfs).
The move to early userspace is necessary because finding and mounting the real
root device is complex. Root partitions can span multiple devices (raid or
separate journal). They can be out on the network (requiring dhcp, setting a
specific mac address, logging into a server, etc). They can live on removable
media, with dynamically allocated major/minor numbers and persistent naming
issues requiring a full udev implementation to sort out. They can be
compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned,
and so on.
This kind of complexity (which inevitably includes policy) is rightly handled
in userspace. Both klibc and busybox/uClibc are working on simple initramfs
packages to drop into a kernel build.
The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree.
The kernel's current early boot code (partition detection, etc) will probably
be migrated into a default initramfs, automatically created and used by the
kernel build.

View File

@@ -0,0 +1,484 @@
relay interface (formerly relayfs)
==================================
The relay interface provides a means for kernel applications to
efficiently log and transfer large quantities of data from the kernel
to userspace via user-defined 'relay channels'.
A 'relay channel' is a kernel->user data relay mechanism implemented
as a set of per-cpu kernel buffers ('channel buffers'), each
represented as a regular file ('relay file') in user space. Kernel
clients write into the channel buffers using efficient write
functions; these automatically log into the current cpu's channel
buffer. User space applications mmap() or read() from the relay files
and retrieve the data as it becomes available. The relay files
themselves are files created in a host filesystem, e.g. debugfs, and
are associated with the channel buffers using the API described below.
The format of the data logged into the channel buffers is completely
up to the kernel client; the relay interface does however provide
hooks which allow kernel clients to impose some structure on the
buffer data. The relay interface doesn't implement any form of data
filtering - this also is left to the kernel client. The purpose is to
keep things as simple as possible.
This document provides an overview of the relay interface API. The
details of the function parameters are documented along with the
functions in the relay interface code - please see that for details.
Semantics
=========
Each relay channel has one buffer per CPU, each buffer has one or more
sub-buffers. Messages are written to the first sub-buffer until it is
too full to contain a new message, in which case it it is written to
the next (if available). Messages are never split across sub-buffers.
At this point, userspace can be notified so it empties the first
sub-buffer, while the kernel continues writing to the next.
When notified that a sub-buffer is full, the kernel knows how many
bytes of it are padding i.e. unused space occurring because a complete
message couldn't fit into a sub-buffer. Userspace can use this
knowledge to copy only valid data.
After copying it, userspace can notify the kernel that a sub-buffer
has been consumed.
A relay channel can operate in a mode where it will overwrite data not
yet collected by userspace, and not wait for it to be consumed.
The relay channel itself does not provide for communication of such
data between userspace and kernel, allowing the kernel side to remain
simple and not impose a single interface on userspace. It does
provide a set of examples and a separate helper though, described
below.
The read() interface both removes padding and internally consumes the
read sub-buffers; thus in cases where read(2) is being used to drain
the channel buffers, special-purpose communication between kernel and
user isn't necessary for basic operation.
One of the major goals of the relay interface is to provide a low
overhead mechanism for conveying kernel data to userspace. While the
read() interface is easy to use, it's not as efficient as the mmap()
approach; the example code attempts to make the tradeoff between the
two approaches as small as possible.
klog and relay-apps example code
================================
The relay interface itself is ready to use, but to make things easier,
a couple simple utility functions and a set of examples are provided.
The relay-apps example tarball, available on the relay sourceforge
site, contains a set of self-contained examples, each consisting of a
pair of .c files containing boilerplate code for each of the user and
kernel sides of a relay application. When combined these two sets of
boilerplate code provide glue to easily stream data to disk, without
having to bother with mundane housekeeping chores.
The 'klog debugging functions' patch (klog.patch in the relay-apps
tarball) provides a couple of high-level logging functions to the
kernel which allow writing formatted text or raw data to a channel,
regardless of whether a channel to write into exists or not, or even
whether the relay interface is compiled into the kernel or not. These
functions allow you to put unconditional 'trace' statements anywhere
in the kernel or kernel modules; only when there is a 'klog handler'
registered will data actually be logged (see the klog and kleak
examples for details).
It is of course possible to use the relay interface from scratch,
i.e. without using any of the relay-apps example code or klog, but
you'll have to implement communication between userspace and kernel,
allowing both to convey the state of buffers (full, empty, amount of
padding). The read() interface both removes padding and internally
consumes the read sub-buffers; thus in cases where read(2) is being
used to drain the channel buffers, special-purpose communication
between kernel and user isn't necessary for basic operation. Things
such as buffer-full conditions would still need to be communicated via
some channel though.
klog and the relay-apps examples can be found in the relay-apps
tarball on http://relayfs.sourceforge.net
The relay interface user space API
==================================
The relay interface implements basic file operations for user space
access to relay channel buffer data. Here are the file operations
that are available and some comments regarding their behavior:
open() enables user to open an _existing_ channel buffer.
mmap() results in channel buffer being mapped into the caller's
memory space. Note that you can't do a partial mmap - you
must map the entire file, which is NRBUF * SUBBUFSIZE.
read() read the contents of a channel buffer. The bytes read are
'consumed' by the reader, i.e. they won't be available
again to subsequent reads. If the channel is being used
in no-overwrite mode (the default), it can be read at any
time even if there's an active kernel writer. If the
channel is being used in overwrite mode and there are
active channel writers, results may be unpredictable -
users should make sure that all logging to the channel has
ended before using read() with overwrite mode. Sub-buffer
padding is automatically removed and will not be seen by
the reader.
sendfile() transfer data from a channel buffer to an output file
descriptor. Sub-buffer padding is automatically removed
and will not be seen by the reader.
poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
notified when sub-buffer boundaries are crossed.
close() decrements the channel buffer's refcount. When the refcount
reaches 0, i.e. when no process or kernel client has the
buffer open, the channel buffer is freed.
In order for a user application to make use of relay files, the
host filesystem must be mounted. For example,
mount -t debugfs debugfs /debug
NOTE: the host filesystem doesn't need to be mounted for kernel
clients to create or use channels - it only needs to be
mounted when user space applications need access to the buffer
data.
The relay interface kernel API
==============================
Here's a summary of the API the relay interface provides to in-kernel clients:
TBD(curr. line MT:/API/)
channel management functions:
relay_open(base_filename, parent, subbuf_size, n_subbufs,
callbacks, private_data)
relay_close(chan)
relay_flush(chan)
relay_reset(chan)
channel management typically called on instigation of userspace:
relay_subbufs_consumed(chan, cpu, subbufs_consumed)
write functions:
relay_write(chan, data, length)
__relay_write(chan, data, length)
relay_reserve(chan, length)
callbacks:
subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
buf_mapped(buf, filp)
buf_unmapped(buf, filp)
create_buf_file(filename, parent, mode, buf, is_global)
remove_buf_file(dentry)
helper functions:
relay_buf_full(buf)
subbuf_start_reserve(buf, length)
Creating a channel
------------------
relay_open() is used to create a channel, along with its per-cpu
channel buffers. Each channel buffer will have an associated file
created for it in the host filesystem, which can be and mmapped or
read from in user space. The files are named basename0...basenameN-1
where N is the number of online cpus, and by default will be created
in the root of the filesystem (if the parent param is NULL). If you
want a directory structure to contain your relay files, you should
create it using the host filesystem's directory creation function,
e.g. debugfs_create_dir(), and pass the parent directory to
relay_open(). Users are responsible for cleaning up any directory
structure they create, when the channel is closed - again the host
filesystem's directory removal functions should be used for that,
e.g. debugfs_remove().
In order for a channel to be created and the host filesystem's files
associated with its channel buffers, the user must provide definitions
for two callback functions, create_buf_file() and remove_buf_file().
create_buf_file() is called once for each per-cpu buffer from
relay_open() and allows the user to create the file which will be used
to represent the corresponding channel buffer. The callback should
return the dentry of the file created to represent the channel buffer.
remove_buf_file() must also be defined; it's responsible for deleting
the file(s) created in create_buf_file() and is called during
relay_close().
Here are some typical definitions for these callbacks, in this case
using debugfs:
/*
* create_buf_file() callback. Creates relay file in debugfs.
*/
static struct dentry *create_buf_file_handler(const char *filename,
struct dentry *parent,
int mode,
struct rchan_buf *buf,
int *is_global)
{
return debugfs_create_file(filename, mode, parent, buf,
&relay_file_operations);
}
/*
* remove_buf_file() callback. Removes relay file from debugfs.
*/
static int remove_buf_file_handler(struct dentry *dentry)
{
debugfs_remove(dentry);
return 0;
}
/*
* relay interface callbacks
*/
static struct rchan_callbacks relay_callbacks =
{
.create_buf_file = create_buf_file_handler,
.remove_buf_file = remove_buf_file_handler,
};
And an example relay_open() invocation using them:
chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL);
If the create_buf_file() callback fails, or isn't defined, channel
creation and thus relay_open() will fail.
The total size of each per-cpu buffer is calculated by multiplying the
number of sub-buffers by the sub-buffer size passed into relay_open().
The idea behind sub-buffers is that they're basically an extension of
double-buffering to N buffers, and they also allow applications to
easily implement random-access-on-buffer-boundary schemes, which can
be important for some high-volume applications. The number and size
of sub-buffers is completely dependent on the application and even for
the same application, different conditions will warrant different
values for these parameters at different times. Typically, the right
values to use are best decided after some experimentation; in general,
though, it's safe to assume that having only 1 sub-buffer is a bad
idea - you're guaranteed to either overwrite data or lose events
depending on the channel mode being used.
The create_buf_file() implementation can also be defined in such a way
as to allow the creation of a single 'global' buffer instead of the
default per-cpu set. This can be useful for applications interested
mainly in seeing the relative ordering of system-wide events without
the need to bother with saving explicit timestamps for the purpose of
merging/sorting per-cpu files in a postprocessing step.
To have relay_open() create a global buffer, the create_buf_file()
implementation should set the value of the is_global outparam to a
non-zero value in addition to creating the file that will be used to
represent the single buffer. In the case of a global buffer,
create_buf_file() and remove_buf_file() will be called only once. The
normal channel-writing functions, e.g. relay_write(), can still be
used - writes from any cpu will transparently end up in the global
buffer - but since it is a global buffer, callers should make sure
they use the proper locking for such a buffer, either by wrapping
writes in a spinlock, or by copying a write function from relay.h and
creating a local version that internally does the proper locking.
The private_data passed into relay_open() allows clients to associate
user-defined data with a channel, and is immediately available
(including in create_buf_file()) via chan->private_data or
buf->chan->private_data.
Channel 'modes'
---------------
relay channels can be used in either of two modes - 'overwrite' or
'no-overwrite'. The mode is entirely determined by the implementation
of the subbuf_start() callback, as described below. The default if no
subbuf_start() callback is defined is 'no-overwrite' mode. If the
default mode suits your needs, and you plan to use the read()
interface to retrieve channel data, you can ignore the details of this
section, as it pertains mainly to mmap() implementations.
In 'overwrite' mode, also known as 'flight recorder' mode, writes
continuously cycle around the buffer and will never fail, but will
unconditionally overwrite old data regardless of whether it's actually
been consumed. In no-overwrite mode, writes will fail, i.e. data will
be lost, if the number of unconsumed sub-buffers equals the total
number of sub-buffers in the channel. It should be clear that if
there is no consumer or if the consumer can't consume sub-buffers fast
enough, data will be lost in either case; the only difference is
whether data is lost from the beginning or the end of a buffer.
As explained above, a relay channel is made of up one or more
per-cpu channel buffers, each implemented as a circular buffer
subdivided into one or more sub-buffers. Messages are written into
the current sub-buffer of the channel's current per-cpu buffer via the
write functions described below. Whenever a message can't fit into
the current sub-buffer, because there's no room left for it, the
client is notified via the subbuf_start() callback that a switch to a
new sub-buffer is about to occur. The client uses this callback to 1)
initialize the next sub-buffer if appropriate 2) finalize the previous
sub-buffer if appropriate and 3) return a boolean value indicating
whether or not to actually move on to the next sub-buffer.
To implement 'no-overwrite' mode, the userspace client would provide
an implementation of the subbuf_start() callback something like the
following:
static int subbuf_start(struct rchan_buf *buf,
void *subbuf,
void *prev_subbuf,
unsigned int prev_padding)
{
if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding;
if (relay_buf_full(buf))
return 0;
subbuf_start_reserve(buf, sizeof(unsigned int));
return 1;
}
If the current buffer is full, i.e. all sub-buffers remain unconsumed,
the callback returns 0 to indicate that the buffer switch should not
occur yet, i.e. until the consumer has had a chance to read the
current set of ready sub-buffers. For the relay_buf_full() function
to make sense, the consumer is reponsible for notifying the relay
interface when sub-buffers have been consumed via
relay_subbufs_consumed(). Any subsequent attempts to write into the
buffer will again invoke the subbuf_start() callback with the same
parameters; only when the consumer has consumed one or more of the
ready sub-buffers will relay_buf_full() return 0, in which case the
buffer switch can continue.
The implementation of the subbuf_start() callback for 'overwrite' mode
would be very similar:
static int subbuf_start(struct rchan_buf *buf,
void *subbuf,
void *prev_subbuf,
unsigned int prev_padding)
{
if (prev_subbuf)
*((unsigned *)prev_subbuf) = prev_padding;
subbuf_start_reserve(buf, sizeof(unsigned int));
return 1;
}
In this case, the relay_buf_full() check is meaningless and the
callback always returns 1, causing the buffer switch to occur
unconditionally. It's also meaningless for the client to use the
relay_subbufs_consumed() function in this mode, as it's never
consulted.
The default subbuf_start() implementation, used if the client doesn't
define any callbacks, or doesn't define the subbuf_start() callback,
implements the simplest possible 'no-overwrite' mode, i.e. it does
nothing but return 0.
Header information can be reserved at the beginning of each sub-buffer
by calling the subbuf_start_reserve() helper function from within the
subbuf_start() callback. This reserved area can be used to store
whatever information the client wants. In the example above, room is
reserved in each sub-buffer to store the padding count for that
sub-buffer. This is filled in for the previous sub-buffer in the
subbuf_start() implementation; the padding value for the previous
sub-buffer is passed into the subbuf_start() callback along with a
pointer to the previous sub-buffer, since the padding value isn't
known until a sub-buffer is filled. The subbuf_start() callback is
also called for the first sub-buffer when the channel is opened, to
give the client a chance to reserve space in it. In this case the
previous sub-buffer pointer passed into the callback will be NULL, so
the client should check the value of the prev_subbuf pointer before
writing into the previous sub-buffer.
Writing to a channel
--------------------
Kernel clients write data into the current cpu's channel buffer using
relay_write() or __relay_write(). relay_write() is the main logging
function - it uses local_irqsave() to protect the buffer and should be
used if you might be logging from interrupt context. If you know
you'll never be logging from interrupt context, you can use
__relay_write(), which only disables preemption. These functions
don't return a value, so you can't determine whether or not they
failed - the assumption is that you wouldn't want to check a return
value in the fast logging path anyway, and that they'll always succeed
unless the buffer is full and no-overwrite mode is being used, in
which case you can detect a failed write in the subbuf_start()
callback by calling the relay_buf_full() helper function.
relay_reserve() is used to reserve a slot in a channel buffer which
can be written to later. This would typically be used in applications
that need to write directly into a channel buffer without having to
stage data in a temporary buffer beforehand. Because the actual write
may not happen immediately after the slot is reserved, applications
using relay_reserve() can keep a count of the number of bytes actually
written, either in space reserved in the sub-buffers themselves or as
a separate array. See the 'reserve' example in the relay-apps tarball
at http://relayfs.sourceforge.net for an example of how this can be
done. Because the write is under control of the client and is
separated from the reserve, relay_reserve() doesn't protect the buffer
at all - it's up to the client to provide the appropriate
synchronization when using relay_reserve().
Closing a channel
-----------------
The client calls relay_close() when it's finished using the channel.
The channel and its associated buffers are destroyed when there are no
longer any references to any of the channel buffers. relay_flush()
forces a sub-buffer switch on all the channel buffers, and can be used
to finalize and process the last sub-buffers before the channel is
closed.
Misc
----
Some applications may want to keep a channel around and re-use it
rather than open and close a new channel for each use. relay_reset()
can be used for this purpose - it resets a channel to its initial
state without reallocating channel buffer memory or destroying
existing mappings. It should however only be called when it's safe to
do so, i.e. when the channel isn't currently being written to.
Finally, there are a couple of utility callbacks that can be used for
different purposes. buf_mapped() is called whenever a channel buffer
is mmapped from user space and buf_unmapped() is called when it's
unmapped. The client can use this notification to trigger actions
within the kernel application, such as enabling/disabling logging to
the channel.
Resources
=========
For news, example code, mailing list, etc. see the relay interface homepage:
http://relayfs.sourceforge.net
Credits
=======
The ideas and specs for the relay interface came about as a result of
discussions on tracing involving the following:
Michel Dagenais <michel.dagenais@polymtl.ca>
Richard Moore <richardj_moore@uk.ibm.com>
Bob Wisniewski <bob@watson.ibm.com>
Karim Yaghmour <karim@opersys.com>
Tom Zanussi <zanussi@us.ibm.com>
Also thanks to Hubertus Franke for a lot of useful suggestions and bug
reports.

View File

@@ -0,0 +1,187 @@
ROMFS - ROM FILE SYSTEM
This is a quite dumb, read only filesystem, mainly for initial RAM
disks of installation disks. It has grown up by the need of having
modules linked at boot time. Using this filesystem, you get a very
similar feature, and even the possibility of a small kernel, with a
file system which doesn't take up useful memory from the router
functions in the basement of your office.
For comparison, both the older minix and xiafs (the latter is now
defunct) filesystems, compiled as module need more than 20000 bytes,
while romfs is less than a page, about 4000 bytes (assuming i586
code). Under the same conditions, the msdos filesystem would need
about 30K (and does not support device nodes or symlinks), while the
nfs module with nfsroot is about 57K. Furthermore, as a bit unfair
comparison, an actual rescue disk used up 3202 blocks with ext2, while
with romfs, it needed 3079 blocks.
To create such a file system, you'll need a user program named
genromfs. It is available via anonymous ftp on sunsite.unc.edu and
its mirrors, in the /pub/Linux/system/recovery/ directory.
As the name suggests, romfs could be also used (space-efficiently) on
various read-only media, like (E)EPROM disks if someone will have the
motivation.. :)
However, the main purpose of romfs is to have a very small kernel,
which has only this filesystem linked in, and then can load any module
later, with the current module utilities. It can also be used to run
some program to decide if you need SCSI devices, and even IDE or
floppy drives can be loaded later if you use the "initrd"--initial
RAM disk--feature of the kernel. This would not be really news
flash, but with romfs, you can even spare off your ext2 or minix or
maybe even affs filesystem until you really know that you need it.
For example, a distribution boot disk can contain only the cd disk
drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem
module. The kernel can be small enough, since it doesn't have other
filesystems, like the quite large ext2fs module, which can then be
loaded off the CD at a later stage of the installation. Another use
would be for a recovery disk, when you are reinstalling a workstation
from the network, and you will have all the tools/modules available
from a nearby server, so you don't want to carry two disks for this
purpose, just because it won't fit into ext2.
romfs operates on block devices as you can expect, and the underlying
structure is very simple. Every accessible structure begins on 16
byte boundaries for fast access. The minimum space a file will take
is 32 bytes (this is an empty file, with a less than 16 character
name). The maximum overhead for any non-empty file is the header, and
the 16 byte padding for the name and the contents, also 16+14+15 = 45
bytes. This is quite rare however, since most file names are longer
than 3 bytes, and shorter than 15 bytes.
The layout of the filesystem is the following:
offset content
+---+---+---+---+
0 | - | r | o | m | \
+---+---+---+---+ The ASCII representation of those bytes
4 | 1 | f | s | - | / (i.e. "-rom1fs-")
+---+---+---+---+
8 | full size | The number of accessible bytes in this fs.
+---+---+---+---+
12 | checksum | The checksum of the FIRST 512 BYTES.
+---+---+---+---+
16 | volume name | The zero terminated name of the volume,
: : padded to 16 byte boundary.
+---+---+---+---+
xx | file |
: headers :
Every multi byte value (32 bit words, I'll use the longwords term from
now on) must be in big endian order.
The first eight bytes identify the filesystem, even for the casual
inspector. After that, in the 3rd longword, it contains the number of
bytes accessible from the start of this filesystem. The 4th longword
is the checksum of the first 512 bytes (or the number of bytes
accessible, whichever is smaller). The applied algorithm is the same
as in the AFFS filesystem, namely a simple sum of the longwords
(assuming bigendian quantities again). For details, please consult
the source. This algorithm was chosen because although it's not quite
reliable, it does not require any tables, and it is very simple.
The following bytes are now part of the file system; each file header
must begin on a 16 byte boundary.
offset content
+---+---+---+---+
0 | next filehdr|X| The offset of the next file header
+---+---+---+---+ (zero if no more files)
4 | spec.info | Info for directories/hard links/devices
+---+---+---+---+
8 | size | The size of this file in bytes
+---+---+---+---+
12 | checksum | Covering the meta data, including the file
+---+---+---+---+ name, and padding
16 | file name | The zero terminated name of the file,
: : padded to 16 byte boundary
+---+---+---+---+
xx | file data |
: :
Since the file headers begin always at a 16 byte boundary, the lowest
4 bits would be always zero in the next filehdr pointer. These four
bits are used for the mode information. Bits 0..2 specify the type of
the file; while bit 4 shows if the file is executable or not. The
permissions are assumed to be world readable, if this bit is not set,
and world executable if it is; except the character and block devices,
they are never accessible for other than owner. The owner of every
file is user and group 0, this should never be a problem for the
intended use. The mapping of the 8 possible values to file types is
the following:
mapping spec.info means
0 hard link link destination [file header]
1 directory first file's header
2 regular file unused, must be zero [MBZ]
3 symbolic link unused, MBZ (file data is the link content)
4 block device 16/16 bits major/minor number
5 char device - " -
6 socket unused, MBZ
7 fifo unused, MBZ
Note that hard links are specifically marked in this filesystem, but
they will behave as you can expect (i.e. share the inode number).
Note also that it is your responsibility to not create hard link
loops, and creating all the . and .. links for directories. This is
normally done correctly by the genromfs program. Please refrain from
using the executable bits for special purposes on the socket and fifo
special files, they may have other uses in the future. Additionally,
please remember that only regular files, and symlinks are supposed to
have a nonzero size field; they contain the number of bytes available
directly after the (padded) file name.
Another thing to note is that romfs works on file headers and data
aligned to 16 byte boundaries, but most hardware devices and the block
device drivers are unable to cope with smaller than block-sized data.
To overcome this limitation, the whole size of the file system must be
padded to an 1024 byte boundary.
If you have any problems or suggestions concerning this file system,
please contact me. However, think twice before wanting me to add
features and code, because the primary and most important advantage of
this file system is the small code. On the other hand, don't be
alarmed, I'm not getting that much romfs related mail. Now I can
understand why Avery wrote poems in the ARCnet docs to get some more
feedback. :)
romfs has also a mailing list, and to date, it hasn't received any
traffic, so you are welcome to join it to discuss your ideas. :)
It's run by ezmlm, so you can subscribe to it by sending a message
to romfs-subscribe@shadow.banki.hu, the content is irrelevant.
Pending issues:
- Permissions and owner information are pretty essential features of a
Un*x like system, but romfs does not provide the full possibilities.
I have never found this limiting, but others might.
- The file system is read only, so it can be very small, but in case
one would want to write _anything_ to a file system, he still needs
a writable file system, thus negating the size advantages. Possible
solutions: implement write access as a compile-time option, or a new,
similarly small writable filesystem for RAM disks.
- Since the files are only required to have alignment on a 16 byte
boundary, it is currently possibly suboptimal to read or execute files
from the filesystem. It might be resolved by reordering file data to
have most of it (i.e. except the start and the end) laying at "natural"
boundaries, thus it would be possible to directly map a big portion of
the file contents to the mm subsystem.
- Compression might be an useful feature, but memory is quite a
limiting factor in my eyes.
- Where it is used?
- Does it work on other architectures than intel and motorola?
Have fun,
Janos Farkas <chexum@shadow.banki.hu>

View File

@@ -0,0 +1,8 @@
Smbfs is a filesystem that implements the SMB protocol, which is the
protocol used by Windows for Workgroups, Windows 95 and Windows NT.
Smbfs was inspired by Samba, the program written by Andrew Tridgell
that turns any Unix host into a file server for DOS or Windows clients.
Smbfs is a SMB client, but uses parts of samba for it's operation. For
more info on samba, including documentation, please go to
http://www.samba.org/ and then on to your nearest mirror.

View File

@@ -0,0 +1,521 @@
SPUFS(2) Linux Programmer's Manual SPUFS(2)
NAME
spufs - the SPU file system
DESCRIPTION
The SPU file system is used on PowerPC machines that implement the Cell
Broadband Engine Architecture in order to access Synergistic Processor
Units (SPUs).
The file system provides a name space similar to posix shared memory or
message queues. Users that have write permissions on the file system
can use spu_create(2) to establish SPU contexts in the spufs root.
Every SPU context is represented by a directory containing a predefined
set of files. These files can be used for manipulating the state of the
logical SPU. Users can change permissions on those files, but not actu-
ally add or remove files.
MOUNT OPTIONS
uid=<uid>
set the user owning the mount point, the default is 0 (root).
gid=<gid>
set the group owning the mount point, the default is 0 (root).
FILES
The files in spufs mostly follow the standard behavior for regular sys-
tem calls like read(2) or write(2), but often support only a subset of
the operations supported on regular file systems. This list details the
supported operations and the deviations from the behaviour in the
respective man pages.
All files that support the read(2) operation also support readv(2) and
all files that support the write(2) operation also support writev(2).
All files support the access(2) and stat(2) family of operations, but
only the st_mode, st_nlink, st_uid and st_gid fields of struct stat
contain reliable information.
All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera-
tions, but will not be able to grant permissions that contradict the
possible operations, e.g. read access on the wbox file.
The current set of files is:
/mem
the contents of the local storage memory of the SPU. This can be
accessed like a regular shared memory file and contains both code and
data in the address space of the SPU. The possible operations on an
open mem file are:
read(2), pread(2), write(2), pwrite(2), lseek(2)
These operate as documented, with the exception that seek(2),
write(2) and pwrite(2) are not supported beyond the end of the
file. The file size is the size of the local storage of the SPU,
which normally is 256 kilobytes.
mmap(2)
Mapping mem into the process address space gives access to the
SPU local storage within the process address space. Only
MAP_SHARED mappings are allowed.
/mbox
The first SPU to CPU communication mailbox. This file is read-only and
can be read in units of 32 bits. The file can only be used in non-
blocking mode and it even poll() will not block on it. The possible
operations on an open mbox file are:
read(2)
If a count smaller than four is requested, read returns -1 and
sets errno to EINVAL. If there is no data available in the mail
box, the return value is set to -1 and errno becomes EAGAIN.
When data has been read successfully, four bytes are placed in
the data buffer and the value four is returned.
/ibox
The second SPU to CPU communication mailbox. This file is similar to
the first mailbox file, but can be read in blocking I/O mode, and the
poll family of system calls can be used to wait for it. The possible
operations on an open ibox file are:
read(2)
If a count smaller than four is requested, read returns -1 and
sets errno to EINVAL. If there is no data available in the mail
box and the file descriptor has been opened with O_NONBLOCK, the
return value is set to -1 and errno becomes EAGAIN.
If there is no data available in the mail box and the file
descriptor has been opened without O_NONBLOCK, the call will
block until the SPU writes to its interrupt mailbox channel.
When data has been read successfully, four bytes are placed in
the data buffer and the value four is returned.
poll(2)
Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever
data is available for reading.
/wbox
The CPU to SPU communation mailbox. It is write-only and can be written
in units of 32 bits. If the mailbox is full, write() will block and
poll can be used to wait for it becoming empty again. The possible
operations on an open wbox file are: write(2) If a count smaller than
four is requested, write returns -1 and sets errno to EINVAL. If there
is no space available in the mail box and the file descriptor has been
opened with O_NONBLOCK, the return value is set to -1 and errno becomes
EAGAIN.
If there is no space available in the mail box and the file descriptor
has been opened without O_NONBLOCK, the call will block until the SPU
reads from its PPE mailbox channel. When data has been read success-
fully, four bytes are placed in the data buffer and the value four is
returned.
poll(2)
Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever
space is available for writing.
/mbox_stat
/ibox_stat
/wbox_stat
Read-only files that contain the length of the current queue, i.e. how
many words can be read from mbox or ibox or how many words can be
written to wbox without blocking. The files can be read only in 4-byte
units and return a big-endian binary integer number. The possible
operations on an open *box_stat file are:
read(2)
If a count smaller than four is requested, read returns -1 and
sets errno to EINVAL. Otherwise, a four byte value is placed in
the data buffer, containing the number of elements that can be
read from (for mbox_stat and ibox_stat) or written to (for
wbox_stat) the respective mail box without blocking or resulting
in EAGAIN.
/npc
/decr
/decr_status
/spu_tag_mask
/event_mask
/srr0
Internal registers of the SPU. The representation is an ASCII string
with the numeric value of the next instruction to be executed. These
can be used in read/write mode for debugging, but normal operation of
programs should not rely on them because access to any of them except
npc requires an SPU context save and is therefore very inefficient.
The contents of these files are:
npc Next Program Counter
decr SPU Decrementer
decr_status Decrementer Status
spu_tag_mask MFC tag mask for SPU DMA
event_mask Event mask for SPU interrupts
srr0 Interrupt Return address register
The possible operations on an open npc, decr, decr_status,
spu_tag_mask, event_mask or srr0 file are:
read(2)
When the count supplied to the read call is shorter than the
required length for the pointer value plus a newline character,
subsequent reads from the same file descriptor will result in
completing the string, regardless of changes to the register by
a running SPU task. When a complete string has been read, all
subsequent read operations will return zero bytes and a new file
descriptor needs to be opened to read the value again.
write(2)
A write operation on the file results in setting the register to
the value given in the string. The string is parsed from the
beginning to the first non-numeric character or the end of the
buffer. Subsequent writes to the same file descriptor overwrite
the previous setting.
/fpcr
This file gives access to the Floating Point Status and Control Regis-
ter as a four byte long file. The operations on the fpcr file are:
read(2)
If a count smaller than four is requested, read returns -1 and
sets errno to EINVAL. Otherwise, a four byte value is placed in
the data buffer, containing the current value of the fpcr regis-
ter.
write(2)
If a count smaller than four is requested, write returns -1 and
sets errno to EINVAL. Otherwise, a four byte value is copied
from the data buffer, updating the value of the fpcr register.
/signal1
/signal2
The two signal notification channels of an SPU. These are read-write
files that operate on a 32 bit word. Writing to one of these files
triggers an interrupt on the SPU. The value written to the signal
files can be read from the SPU through a channel read or from host user
space through the file. After the value has been read by the SPU, it
is reset to zero. The possible operations on an open signal1 or sig-
nal2 file are:
read(2)
If a count smaller than four is requested, read returns -1 and
sets errno to EINVAL. Otherwise, a four byte value is placed in
the data buffer, containing the current value of the specified
signal notification register.
write(2)
If a count smaller than four is requested, write returns -1 and
sets errno to EINVAL. Otherwise, a four byte value is copied
from the data buffer, updating the value of the specified signal
notification register. The signal notification register will
either be replaced with the input data or will be updated to the
bitwise OR or the old value and the input data, depending on the
contents of the signal1_type, or signal2_type respectively,
file.
/signal1_type
/signal2_type
These two files change the behavior of the signal1 and signal2 notifi-
cation files. The contain a numerical ASCII string which is read as
either "1" or "0". In mode 0 (overwrite), the hardware replaces the
contents of the signal channel with the data that is written to it. in
mode 1 (logical OR), the hardware accumulates the bits that are subse-
quently written to it. The possible operations on an open signal1_type
or signal2_type file are:
read(2)
When the count supplied to the read call is shorter than the
required length for the digit plus a newline character, subse-
quent reads from the same file descriptor will result in com-
pleting the string. When a complete string has been read, all
subsequent read operations will return zero bytes and a new file
descriptor needs to be opened to read the value again.
write(2)
A write operation on the file results in setting the register to
the value given in the string. The string is parsed from the
beginning to the first non-numeric character or the end of the
buffer. Subsequent writes to the same file descriptor overwrite
the previous setting.
EXAMPLES
/etc/fstab entry
none /spu spufs gid=spu 0 0
AUTHORS
Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
SEE ALSO
capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
Linux 2005-09-28 SPUFS(2)
------------------------------------------------------------------------------
SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2)
NAME
spu_run - execute an spu context
SYNOPSIS
#include <sys/spu.h>
int spu_run(int fd, unsigned int *npc, unsigned int *event);
DESCRIPTION
The spu_run system call is used on PowerPC machines that implement the
Cell Broadband Engine Architecture in order to access Synergistic Pro-
cessor Units (SPUs). It uses the fd that was returned from spu_cre-
ate(2) to address a specific SPU context. When the context gets sched-
uled to a physical SPU, it starts execution at the instruction pointer
passed in npc.
Execution of SPU code happens synchronously, meaning that spu_run does
not return while the SPU is still running. If there is a need to exe-
cute SPU code in parallel with other code on either the main CPU or
other SPUs, you need to create a new thread of execution first, e.g.
using the pthread_create(3) call.
When spu_run returns, the current value of the SPU instruction pointer
is written back to npc, so you can call spu_run again without updating
the pointers.
event can be a NULL pointer or point to an extended status code that
gets filled when spu_run returns. It can be one of the following con-
stants:
SPE_EVENT_DMA_ALIGNMENT
A DMA alignment error
SPE_EVENT_SPE_DATA_SEGMENT
A DMA segmentation error
SPE_EVENT_SPE_DATA_STORAGE
A DMA storage error
If NULL is passed as the event argument, these errors will result in a
signal delivered to the calling process.
RETURN VALUE
spu_run returns the value of the spu_status register or -1 to indicate
an error and set errno to one of the error codes listed below. The
spu_status register value contains a bit mask of status codes and
optionally a 14 bit code returned from the stop-and-signal instruction
on the SPU. The bit masks for the status codes are:
0x02 SPU was stopped by stop-and-signal.
0x04 SPU was stopped by halt.
0x08 SPU is waiting for a channel.
0x10 SPU is in single-step mode.
0x20 SPU has tried to execute an invalid instruction.
0x40 SPU has tried to access an invalid channel.
0x3fff0000
The bits masked with this value contain the code returned from
stop-and-signal.
There are always one or more of the lower eight bits set or an error
code is returned from spu_run.
ERRORS
EAGAIN or EWOULDBLOCK
fd is in non-blocking mode and spu_run would block.
EBADF fd is not a valid file descriptor.
EFAULT npc is not a valid pointer or status is neither NULL nor a valid
pointer.
EINTR A signal occurred while spu_run was in progress. The npc value
has been updated to the new program counter value if necessary.
EINVAL fd is not a file descriptor returned from spu_create(2).
ENOMEM Insufficient memory was available to handle a page fault result-
ing from an MFC direct memory access.
ENOSYS the functionality is not provided by the current system, because
either the hardware does not provide SPUs or the spufs module is
not loaded.
NOTES
spu_run is meant to be used from libraries that implement a more
abstract interface to SPUs, not to be used from regular applications.
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
ommended libraries.
CONFORMING TO
This call is Linux specific and only implemented by the ppc64 architec-
ture. Programs using this system call are not portable.
BUGS
The code does not yet fully implement all features lined out here.
AUTHOR
Arnd Bergmann <arndb@de.ibm.com>
SEE ALSO
capabilities(7), close(2), spu_create(2), spufs(7)
Linux 2005-09-28 SPU_RUN(2)
------------------------------------------------------------------------------
SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2)
NAME
spu_create - create a new spu context
SYNOPSIS
#include <sys/types.h>
#include <sys/spu.h>
int spu_create(const char *pathname, int flags, mode_t mode);
DESCRIPTION
The spu_create system call is used on PowerPC machines that implement
the Cell Broadband Engine Architecture in order to access Synergistic
Processor Units (SPUs). It creates a new logical context for an SPU in
pathname and returns a handle to associated with it. pathname must
point to a non-existing directory in the mount point of the SPU file
system (spufs). When spu_create is successful, a directory gets cre-
ated on pathname and it is populated with files.
The returned file handle can only be passed to spu_run(2) or closed,
other operations are not defined on it. When it is closed, all associ-
ated directory entries in spufs are removed. When the last file handle
pointing either inside of the context directory or to this file
descriptor is closed, the logical SPU context is destroyed.
The parameter flags can be zero or any bitwise or'd combination of the
following constants:
SPU_RAWIO
Allow mapping of some of the hardware registers of the SPU into
user space. This flag requires the CAP_SYS_RAWIO capability, see
capabilities(7).
The mode parameter specifies the permissions used for creating the new
directory in spufs. mode is modified with the user's umask(2) value
and then used for both the directory and the files contained in it. The
file permissions mask out some more bits of mode because they typically
support only read or write access. See stat(2) for a full list of the
possible mode values.
RETURN VALUE
spu_create returns a new file descriptor. It may return -1 to indicate
an error condition and set errno to one of the error codes listed
below.
ERRORS
EACCESS
The current user does not have write access on the spufs mount
point.
EEXIST An SPU context already exists at the given path name.
EFAULT pathname is not a valid string pointer in the current address
space.
EINVAL pathname is not a directory in the spufs mount point.
ELOOP Too many symlinks were found while resolving pathname.
EMFILE The process has reached its maximum open file limit.
ENAMETOOLONG
pathname was too long.
ENFILE The system has reached the global open file limit.
ENOENT Part of pathname could not be resolved.
ENOMEM The kernel could not allocate all resources required.
ENOSPC There are not enough SPU resources available to create a new
context or the user specific limit for the number of SPU con-
texts has been reached.
ENOSYS the functionality is not provided by the current system, because
either the hardware does not provide SPUs or the spufs module is
not loaded.
ENOTDIR
A part of pathname is not a directory.
NOTES
spu_create is meant to be used from libraries that implement a more
abstract interface to SPUs, not to be used from regular applications.
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
ommended libraries.
FILES
pathname must point to a location beneath the mount point of spufs. By
convention, it gets mounted in /spu.
CONFORMING TO
This call is Linux specific and only implemented by the ppc64 architec-
ture. Programs using this system call are not portable.
BUGS
The code does not yet fully implement all features lined out here.
AUTHOR
Arnd Bergmann <arndb@de.ibm.com>
SEE ALSO
capabilities(7), close(2), spu_run(2), spufs(7)
Linux 2005-09-28 SPU_CREATE(2)

View File

@@ -0,0 +1,95 @@
Accessing PCI device resources through sysfs
--------------------------------------------
sysfs, usually mounted at /sys, provides access to PCI resources on platforms
that support it. For example, a given bus might look like this:
/sys/devices/pci0000:17
|-- 0000:17:00.0
| |-- class
| |-- config
| |-- device
| |-- irq
| |-- local_cpus
| |-- resource
| |-- resource0
| |-- resource1
| |-- resource2
| |-- rom
| |-- subsystem_device
| |-- subsystem_vendor
| `-- vendor
`-- ...
The topmost element describes the PCI domain and bus number. In this case,
the domain number is 0000 and the bus number is 17 (both values are in hex).
This bus contains a single function device in slot 0. The domain and bus
numbers are reproduced for convenience. Under the device directory are several
files, each with their own function.
file function
---- --------
class PCI class (ascii, ro)
config PCI config space (binary, rw)
device PCI device (ascii, ro)
irq IRQ number (ascii, ro)
local_cpus nearby CPU mask (cpumask, ro)
resource PCI resource host addresses (ascii, ro)
resource0..N PCI resource N, if present (binary, mmap)
rom PCI ROM resource, if present (binary, ro)
subsystem_device PCI subsystem device (ascii, ro)
subsystem_vendor PCI subsystem vendor (ascii, ro)
vendor PCI vendor (ascii, ro)
ro - read only file
rw - file is readable and writable
mmap - file is mmapable
ascii - file contains ascii text
binary - file contains binary data
cpumask - file contains a cpumask type
The read only files are informational, writes to them will be ignored, with
the exception of the 'rom' file. Writable files can be used to perform
actions on the device (e.g. changing config space, detaching a device).
mmapable files are available via an mmap of the file at offset 0 and can be
used to do actual device programming from userspace. Note that some platforms
don't support mmapping of certain resources, so be sure to check the return
value from any attempted mmap.
The 'rom' file is special in that it provides read-only access to the device's
ROM file, if available. It's disabled by default, however, so applications
should write the string "1" to the file to enable it before attempting a read
call, and disable it following the access by writing "0" to the file.
Accessing legacy resources through sysfs
----------------------------------------
Legacy I/O port and ISA memory resources are also provided in sysfs if the
underlying platform supports them. They're located in the PCI class hierarchy,
e.g.
/sys/class/pci_bus/0000:17/
|-- bridge -> ../../../devices/pci0000:17
|-- cpuaffinity
|-- legacy_io
`-- legacy_mem
The legacy_io file is a read/write file that can be used by applications to
do legacy port I/O. The application should open the file, seek to the desired
port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem
file should be mmapped with an offset corresponding to the memory offset
desired, e.g. 0xa0000 for the VGA frame buffer. The application can then
simply dereference the returned pointer (after checking for errors of course)
to access legacy memory space.
Supporting PCI access on new platforms
--------------------------------------
In order to support PCI resource mapping as described above, Linux platform
code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
Platforms are free to only support subsets of the mmap functionality, but
useful return codes should be provided.
Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms
wishing to support legacy functionality should define it and provide
pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.

View File

@@ -0,0 +1,346 @@
sysfs - _The_ filesystem for exporting kernel objects.
Patrick Mochel <mochel@osdl.org>
10 January 2003
What it is:
~~~~~~~~~~~
sysfs is a ram-based filesystem initially based on ramfs. It provides
a means to export kernel data structures, their attributes, and the
linkages between them to userspace.
sysfs is tied inherently to the kobject infrastructure. Please read
Documentation/kobject.txt for more information concerning the kobject
interface.
Using sysfs
~~~~~~~~~~~
sysfs is always compiled in. You can access it by doing:
mount -t sysfs sysfs /sys
Directory Creation
~~~~~~~~~~~~~~~~~~
For every kobject that is registered with the system, a directory is
created for it in sysfs. That directory is created as a subdirectory
of the kobject's parent, expressing internal object hierarchies to
userspace. Top-level directories in sysfs represent the common
ancestors of object hierarchies; i.e. the subsystems the objects
belong to.
Sysfs internally stores the kobject that owns the directory in the
->d_fsdata pointer of the directory's dentry. This allows sysfs to do
reference counting directly on the kobject when the file is opened and
closed.
Attributes
~~~~~~~~~~
Attributes can be exported for kobjects in the form of regular files in
the filesystem. Sysfs forwards file I/O operations to methods defined
for the attributes, providing a means to read and write kernel
attributes.
Attributes should be ASCII text files, preferably with only one value
per file. It is noted that it may not be efficient to contain only
value per file, so it is socially acceptable to express an array of
values of the same type.
Mixing types, expressing multiple lines of data, and doing fancy
formatting of data is heavily frowned upon. Doing these things may get
you publically humiliated and your code rewritten without notice.
An attribute definition is simply:
struct attribute {
char * name;
mode_t mode;
};
int sysfs_create_file(struct kobject * kobj, struct attribute * attr);
void sysfs_remove_file(struct kobject * kobj, struct attribute * attr);
A bare attribute contains no means to read or write the value of the
attribute. Subsystems are encouraged to define their own attribute
structure and wrapper functions for adding and removing attributes for
a specific object type.
For example, the driver model defines struct device_attribute like:
struct device_attribute {
struct attribute attr;
ssize_t (*show)(struct device * dev, char * buf);
ssize_t (*store)(struct device * dev, const char * buf);
};
int device_create_file(struct device *, struct device_attribute *);
void device_remove_file(struct device *, struct device_attribute *);
It also defines this helper for defining device attributes:
#define DEVICE_ATTR(_name, _mode, _show, _store) \
struct device_attribute dev_attr_##_name = { \
.attr = {.name = __stringify(_name) , .mode = _mode }, \
.show = _show, \
.store = _store, \
};
For example, declaring
static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
is equivalent to doing:
static struct device_attribute dev_attr_foo = {
.attr = {
.name = "foo",
.mode = S_IWUSR | S_IRUGO,
},
.show = show_foo,
.store = store_foo,
};
Subsystem-Specific Callbacks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When a subsystem defines a new attribute type, it must implement a
set of sysfs operations for forwarding read and write calls to the
show and store methods of the attribute owners.
struct sysfs_ops {
ssize_t (*show)(struct kobject *, struct attribute *, char *);
ssize_t (*store)(struct kobject *, struct attribute *, const char *);
};
[ Subsystems should have already defined a struct kobj_type as a
descriptor for this type, which is where the sysfs_ops pointer is
stored. See the kobject documentation for more information. ]
When a file is read or written, sysfs calls the appropriate method
for the type. The method then translates the generic struct kobject
and struct attribute pointers to the appropriate pointer types, and
calls the associated methods.
To illustrate:
#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
#define to_dev(d) container_of(d, struct device, kobj)
static ssize_t
dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf)
{
struct device_attribute * dev_attr = to_dev_attr(attr);
struct device * dev = to_dev(kobj);
ssize_t ret = 0;
if (dev_attr->show)
ret = dev_attr->show(dev, buf);
return ret;
}
Reading/Writing Attribute Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To read or write attributes, show() or store() methods must be
specified when declaring the attribute. The method types should be as
simple as those defined for device attributes:
ssize_t (*show)(struct device * dev, char * buf);
ssize_t (*store)(struct device * dev, const char * buf);
IOW, they should take only an object and a buffer as parameters.
sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
method. Sysfs will call the method exactly once for each read or
write. This forces the following behavior on the method
implementations:
- On read(2), the show() method should fill the entire buffer.
Recall that an attribute should only be exporting one value, or an
array of similar values, so this shouldn't be that expensive.
This allows userspace to do partial reads and seeks arbitrarily over
the entire file at will.
- On write(2), sysfs expects the entire buffer to be passed during the
first write. Sysfs then passes the entire buffer to the store()
method.
When writing sysfs files, userspace processes should first read the
entire file, modify the values it wishes to change, then write the
entire buffer back.
Attribute method implementations should operate on an identical
buffer when reading and writing values.
Other notes:
- The buffer will always be PAGE_SIZE bytes in length. On i386, this
is 4096.
- show() methods should return the number of bytes printed into the
buffer. This is the return value of snprintf().
- show() should always use snprintf().
- store() should return the number of bytes used from the buffer. This
can be done using strlen().
- show() or store() can always return errors. If a bad value comes
through, be sure to return an error.
- The object passed to the methods will be pinned in memory via sysfs
referencing counting its embedded object. However, the physical
entity (e.g. device) the object represents may not be present. Be
sure to have a way to check this, if necessary.
A very simple (and naive) implementation of a device attribute is:
static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf)
{
return snprintf(buf, PAGE_SIZE, "%s\n", dev->name);
}
static ssize_t store_name(struct device * dev, const char * buf)
{
sscanf(buf, "%20s", dev->name);
return strnlen(buf, PAGE_SIZE);
}
static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
(Note that the real implementation doesn't allow userspace to set the
name for a device.)
Top Level Directory Layout
~~~~~~~~~~~~~~~~~~~~~~~~~~
The sysfs directory arrangement exposes the relationship of kernel
data structures.
The top level sysfs directory looks like:
block/
bus/
class/
devices/
firmware/
net/
fs/
devices/ contains a filesystem representation of the device tree. It maps
directly to the internal kernel device tree, which is a hierarchy of
struct device.
bus/ contains flat directory layout of the various bus types in the
kernel. Each bus's directory contains two subdirectories:
devices/
drivers/
devices/ contains symlinks for each device discovered in the system
that point to the device's directory under root/.
drivers/ contains a directory for each device driver that is loaded
for devices on that particular bus (this assumes that drivers do not
span multiple bus types).
fs/ contains a directory for some filesystems. Currently each
filesystem wanting to export attributes must create its own hierarchy
below fs/ (see ./fuse.txt for an example).
More information can driver-model specific features can be found in
Documentation/driver-model/.
TODO: Finish this section.
Current Interfaces
~~~~~~~~~~~~~~~~~~
The following interface layers currently exist in sysfs:
- devices (include/linux/device.h)
----------------------------------
Structure:
struct device_attribute {
struct attribute attr;
ssize_t (*show)(struct device * dev, char * buf);
ssize_t (*store)(struct device * dev, const char * buf);
};
Declaring:
DEVICE_ATTR(_name, _str, _mode, _show, _store);
Creation/Removal:
int device_create_file(struct device *device, struct device_attribute * attr);
void device_remove_file(struct device * dev, struct device_attribute * attr);
- bus drivers (include/linux/device.h)
--------------------------------------
Structure:
struct bus_attribute {
struct attribute attr;
ssize_t (*show)(struct bus_type *, char * buf);
ssize_t (*store)(struct bus_type *, const char * buf);
};
Declaring:
BUS_ATTR(_name, _mode, _show, _store)
Creation/Removal:
int bus_create_file(struct bus_type *, struct bus_attribute *);
void bus_remove_file(struct bus_type *, struct bus_attribute *);
- device drivers (include/linux/device.h)
-----------------------------------------
Structure:
struct driver_attribute {
struct attribute attr;
ssize_t (*show)(struct device_driver *, char * buf);
ssize_t (*store)(struct device_driver *, const char * buf);
};
Declaring:
DRIVER_ATTR(_name, _mode, _show, _store)
Creation/Removal:
int driver_create_file(struct device_driver *, struct driver_attribute *);
void driver_remove_file(struct device_driver *, struct driver_attribute *);

View File

@@ -0,0 +1,197 @@
It implements all of
- Xenix FS,
- SystemV/386 FS,
- Coherent FS.
To install:
* Answer the 'System V and Coherent filesystem support' question with 'y'
when configuring the kernel.
* To mount a disk or a partition, use
mount [-r] -t sysv device mountpoint
The file system type names
-t sysv
-t xenix
-t coherent
may be used interchangeably, but the last two will eventually disappear.
Bugs in the present implementation:
- Coherent FS:
- The "free list interleave" n:m is currently ignored.
- Only file systems with no filesystem name and no pack name are recognized.
(See Coherent "man mkfs" for a description of these features.)
- SystemV Release 2 FS:
The superblock is only searched in the blocks 9, 15, 18, which
corresponds to the beginning of track 1 on floppy disks. No support
for this FS on hard disk yet.
These filesystems are rather similar. Here is a comparison with Minix FS:
* Linux fdisk reports on partitions
- Minix FS 0x81 Linux/Minix
- Xenix FS ??
- SystemV FS ??
- Coherent FS 0x08 AIX bootable
* Size of a block or zone (data allocation unit on disk)
- Minix FS 1024
- Xenix FS 1024 (also 512 ??)
- SystemV FS 1024 (also 512 and 2048)
- Coherent FS 512
* General layout: all have one boot block, one super block and
separate areas for inodes and for directories/data.
On SystemV Release 2 FS (e.g. Microport) the first track is reserved and
all the block numbers (including the super block) are offset by one track.
* Byte ordering of "short" (16 bit entities) on disk:
- Minix FS little endian 0 1
- Xenix FS little endian 0 1
- SystemV FS little endian 0 1
- Coherent FS little endian 0 1
Of course, this affects only the file system, not the data of files on it!
* Byte ordering of "long" (32 bit entities) on disk:
- Minix FS little endian 0 1 2 3
- Xenix FS little endian 0 1 2 3
- SystemV FS little endian 0 1 2 3
- Coherent FS PDP-11 2 3 0 1
Of course, this affects only the file system, not the data of files on it!
* Inode on disk: "short", 0 means non-existent, the root dir ino is:
- Minix FS 1
- Xenix FS, SystemV FS, Coherent FS 2
* Maximum number of hard links to a file:
- Minix FS 250
- Xenix FS ??
- SystemV FS ??
- Coherent FS >=10000
* Free inode management:
- Minix FS a bitmap
- Xenix FS, SystemV FS, Coherent FS
There is a cache of a certain number of free inodes in the super-block.
When it is exhausted, new free inodes are found using a linear search.
* Free block management:
- Minix FS a bitmap
- Xenix FS, SystemV FS, Coherent FS
Free blocks are organized in a "free list". Maybe a misleading term,
since it is not true that every free block contains a pointer to
the next free block. Rather, the free blocks are organized in chunks
of limited size, and every now and then a free block contains pointers
to the free blocks pertaining to the next chunk; the first of these
contains pointers and so on. The list terminates with a "block number"
0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS.
* Super-block location:
- Minix FS block 1 = bytes 1024..2047
- Xenix FS block 1 = bytes 1024..2047
- SystemV FS bytes 512..1023
- Coherent FS block 1 = bytes 512..1023
* Super-block layout:
- Minix FS
unsigned short s_ninodes;
unsigned short s_nzones;
unsigned short s_imap_blocks;
unsigned short s_zmap_blocks;
unsigned short s_firstdatazone;
unsigned short s_log_zone_size;
unsigned long s_max_size;
unsigned short s_magic;
- Xenix FS, SystemV FS, Coherent FS
unsigned short s_firstdatazone;
unsigned long s_nzones;
unsigned short s_fzone_count;
unsigned long s_fzones[NICFREE];
unsigned short s_finode_count;
unsigned short s_finodes[NICINOD];
char s_flock;
char s_ilock;
char s_modified;
char s_rdonly;
unsigned long s_time;
short s_dinfo[4]; -- SystemV FS only
unsigned long s_free_zones;
unsigned short s_free_inodes;
short s_dinfo[4]; -- Xenix FS only
unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only
char s_fname[6];
char s_fpack[6];
then they differ considerably:
Xenix FS
char s_clean;
char s_fill[371];
long s_magic;
long s_type;
SystemV FS
long s_fill[12 or 14];
long s_state;
long s_magic;
long s_type;
Coherent FS
unsigned long s_unique;
Note that Coherent FS has no magic.
* Inode layout:
- Minix FS
unsigned short i_mode;
unsigned short i_uid;
unsigned long i_size;
unsigned long i_time;
unsigned char i_gid;
unsigned char i_nlinks;
unsigned short i_zone[7+1+1];
- Xenix FS, SystemV FS, Coherent FS
unsigned short i_mode;
unsigned short i_nlink;
unsigned short i_uid;
unsigned short i_gid;
unsigned long i_size;
unsigned char i_zone[3*(10+1+1+1)];
unsigned long i_atime;
unsigned long i_mtime;
unsigned long i_ctime;
* Regular file data blocks are organized as
- Minix FS
7 direct blocks
1 indirect block (pointers to blocks)
1 double-indirect block (pointer to pointers to blocks)
- Xenix FS, SystemV FS, Coherent FS
10 direct blocks
1 indirect block (pointers to blocks)
1 double-indirect block (pointer to pointers to blocks)
1 triple-indirect block (pointer to pointers to pointers to blocks)
* Inode size, inodes per block
- Minix FS 32 32
- Xenix FS 64 16
- SystemV FS 64 16
- Coherent FS 64 8
* Directory entry on disk
- Minix FS
unsigned short inode;
char name[14/30];
- Xenix FS, SystemV FS, Coherent FS
unsigned short inode;
char name[14];
* Dir entry size, dir entries per block
- Minix FS 16/32 64/32
- Xenix FS 16 64
- SystemV FS 16 64
- Coherent FS 16 32
* How to implement symbolic links such that the host fsck doesn't scream:
- Minix FS normal
- Xenix FS kludge: as regular files with chmod 1000
- SystemV FS ??
- Coherent FS kludge: as regular files with chmod 1000
Notation: We often speak of a "block" but mean a zone (the allocation unit)
and not the disk driver's notion of "block".

View File

@@ -0,0 +1,124 @@
Tmpfs is a file system which keeps all files in virtual memory.
Everything in tmpfs is temporary in the sense that no files will be
created on your hard drive. If you unmount a tmpfs instance,
everything stored therein is lost.
tmpfs puts everything into the kernel internal caches and grows and
shrinks to accommodate the files it contains and is able to swap
unneeded pages out to swap space. It has maximum size limits which can
be adjusted on the fly via 'mount -o remount ...'
If you compare it to ramfs (which was the template to create tmpfs)
you gain swapping and limit checking. Another similar thing is the RAM
disk (/dev/ram*), which simulates a fixed size hard disk in physical
RAM, where you have to create an ordinary filesystem on top. Ramdisks
cannot swap and you do not have the possibility to resize them.
Since tmpfs lives completely in the page cache and on swap, all tmpfs
pages currently in memory will show up as cached. It will not show up
as shared or something like that. Further on you can check the actual
RAM+swap use of a tmpfs instance with df(1) and du(1).
tmpfs has the following uses:
1) There is always a kernel internal mount which you will not see at
all. This is used for shared anonymous mappings and SYSV shared
memory.
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
set, the user visible part of tmpfs is not build. But the internal
mechanisms are always present.
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
POSIX shared memory (shm_open, shm_unlink). Adding the following
line to /etc/fstab should take care of this:
tmpfs /dev/shm tmpfs defaults 0 0
Remember to create the directory that you intend to mount tmpfs on
if necessary.
This mount is _not_ needed for SYSV shared memory. The internal
mount is used for that. (In the 2.3 kernel versions it was
necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
shared memory)
3) Some people (including me) find it very convenient to mount it
e.g. on /tmp and /var/tmp and have a big swap partition. And now
loop mounts of tmpfs files do work, so mkinitrd shipped by most
distributions should succeed with a tmpfs /tmp.
4) And probably a lot more I do not know about :-)
tmpfs has three mount options for sizing:
size: The limit of allocated bytes for this tmpfs instance. The
default is half of your physical RAM without swap. If you
oversize your tmpfs instances the machine will deadlock
since the OOM handler will not be able to free that memory.
nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE.
nr_inodes: The maximum number of inodes for this instance. The default
is half of the number of your physical RAM pages, or (on a
machine with highmem) the number of lowmem RAM pages,
whichever is the lower.
These parameters accept a suffix k, m or g for kilo, mega and giga and
can be changed on remount. The size parameter also accepts a suffix %
to limit this tmpfs instance to that percentage of your physical RAM:
the default, when neither size nor nr_blocks is specified, is size=50%
If nr_blocks=0 (or size=0), blocks will not be limited in that instance;
if nr_inodes=0, inodes will not be limited. It is generally unwise to
mount with such options, since it allows any user with write access to
use up all the memory on the machine; but enhances the scalability of
that instance in a system with many cpus making intensive use of it.
tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be
adjusted on the fly via 'mount -o remount ...'
mpol=default prefers to allocate memory from the local node
mpol=prefer:Node prefers to allocate memory from the given Node
mpol=bind:NodeList allocates memory only from nodes in NodeList
mpol=interleave prefers to allocate from each node in turn
mpol=interleave:NodeList allocates from each node of NodeList in turn
NodeList format is a comma-separated list of decimal numbers and ranges,
a range being two hyphen-separated decimal numbers, the smallest and
largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15
Note that trying to mount a tmpfs with an mpol option will fail if the
running kernel does not support NUMA; and will fail if its nodelist
specifies a node >= MAX_NUMNODES. If your system relies on that tmpfs
being mounted, but from time to time runs a kernel built without NUMA
capability (perhaps a safe recovery kernel), or configured to support
fewer nodes, then it is advisable to omit the mpol option from automatic
mount options. It can be added later, when the tmpfs is already mounted
on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'.
To specify the initial root directory you can use the following mount
options:
mode: The permissions as an octal number
uid: The user id
gid: The group id
These options do not have any effect on remount. You can change these
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
Author:
Christoph Rohland <cr@sap.com>, 1.12.01
Updated:
Hugh Dickins <hugh@veritas.com>, 19 February 2006

View File

@@ -0,0 +1,80 @@
*
* Documentation/filesystems/udf.txt
*
UDF Filesystem version 0.9.8.1
If you encounter problems with reading UDF discs using this driver,
please report them to linux_udf@hpesjro.fc.hp.com, which is the
developer's list.
Write support requires a block driver which supports writing. Currently
dvd+rw drives and media support true random sector writes, and so a udf
filesystem on such devices can be directly mounted read/write. CD-RW
media however, does not support this. Instead the media can be formatted
for packet mode using the utility cdrwtool, then the pktcdvd driver can
be bound to the underlying cd device to provide the required buffering
and read-modify-write cycles to allow the filesystem random sector writes
while providing the hardware with only full packet writes. While not
required for dvd+rw media, use of the pktcdvd driver often enhances
performance due to very poor read-modify-write support supplied internally
by drive firmware.
-------------------------------------------------------------------------------
The following mount options are supported:
gid= Set the default group.
umask= Set the default umask.
uid= Set the default user.
bs= Set the block size.
unhide Show otherwise hidden files.
undelete Show deleted files in lists.
adinicb Embed data in the inode (default)
noadinicb Don't embed data in the inode
shortad Use short ad's
longad Use long ad's (default)
nostrict Unset strict conformance
iocharset= Set the NLS character set
The uid= and gid= options need a bit more explaining. They will accept a
decimal numeric value which will be used as the default ID for that mount.
They will also accept the string "ignore" and "forget". For files on the disk
that are owned by nobody ( -1 ), they will instead look as if they are owned
by the default ID. The ignore option causes the default ID to override all
IDs on the disk, not just -1. The forget option causes all IDs to be written
to disk as -1, so when the media is later remounted, they will appear to be
owned by whatever default ID it is mounted with at that time.
For typical desktop use of removable media, you should set the ID to that
of the interactively logged on user, and also specify both the forget and
ignore options. This way the interactive user will always see the files
on the disk as belonging to him.
The remaining are for debugging and disaster recovery:
novrs Skip volume sequence recognition
The following expect a offset from 0.
session= Set the CDROM session (default= last session)
anchor= Override standard anchor location. (default= 256)
volume= Override the VolumeDesc location. (unused)
partition= Override the PartitionDesc location. (unused)
lastblock= Set the last block of the filesystem/
The following expect a offset from the partition root.
fileset= Override the fileset block location. (unused)
rootdir= Override the root directory location. (unused)
WARNING: overriding the rootdir to a non-directory may
yield highly unpredictable results.
-------------------------------------------------------------------------------
For the latest version and toolset see:
http://linux-udf.sourceforge.net/
Documentation on UDF and ECMA 167 is available FREE from:
http://www.osta.org/
http://www.ecma-international.org/
Ben Fennema <bfennema@falcon.csc.calpoly.edu>

View File

@@ -0,0 +1,60 @@
USING UFS
=========
mount -t ufs -o ufstype=type_of_ufs device dir
UFS OPTIONS
===========
ufstype=type_of_ufs
UFS is a file system widely used in different operating systems.
The problem are differences among implementations. Features of
some implementations are undocumented, so its hard to recognize
type of ufs automatically. That's why user must specify type of
ufs manually by mount option ufstype. Possible values are:
old old format of ufs
default value, supported as read-only
44bsd used in FreeBSD, NetBSD, OpenBSD
supported as read-write
ufs2 used in FreeBSD 5.x
supported as read-write
5xbsd synonym for ufs2
sun used in SunOS (Solaris)
supported as read-write
sunx86 used in SunOS for Intel (Solarisx86)
supported as read-write
hp used in HP-UX
supported as read-only
nextstep
used in NextStep
supported as read-only
nextstep-cd
used for NextStep CDROMs (block_size == 2048)
supported as read-only
openstep
used in OpenStep
supported as read-only
POSSIBLE PROBLEMS
=================
See next section, if you have any.
BUG REPORTS
===========
Any ufs bug report you can send to daniel.pirkl@email.cz or
to dushistov@mail.ru (do not send partition tables bug reports).

View File

@@ -0,0 +1,231 @@
USING VFAT
----------------------------------------------------------------------
To use the vfat filesystem, use the filesystem type 'vfat'. i.e.
mount -t vfat /dev/fd0 /mnt
No special partition formatter is required. mkdosfs will work fine
if you want to format from within Linux.
VFAT MOUNT OPTIONS
----------------------------------------------------------------------
umask=### -- The permission mask (for files and directories, see umask(1)).
The default is the umask of current process.
dmask=### -- The permission mask for the directory.
The default is the umask of current process.
fmask=### -- The permission mask for files.
The default is the umask of current process.
codepage=### -- Sets the codepage number for converting to shortname
characters on FAT filesystem.
By default, FAT_DEFAULT_CODEPAGE setting is used.
iocharset=name -- Character set to use for converting between the
encoding is used for user visible filename and 16 bit
Unicode characters. Long filenames are stored on disk
in Unicode format, but Unix for the most part doesn't
know how to deal with Unicode.
By default, FAT_DEFAULT_IOCHARSET setting is used.
There is also an option of doing UTF-8 translations
with the utf8 option.
NOTE: "iocharset=utf8" is not recommended. If unsure,
you should consider the following option instead.
utf8=<bool> -- UTF-8 is the filesystem safe version of Unicode that
is used by the console. It can be enabled for the
filesystem with this option. If 'uni_xlate' gets set,
UTF-8 gets disabled.
uni_xlate=<bool> -- Translate unhandled Unicode characters to special
escaped sequences. This would let you backup and
restore filenames that are created with any Unicode
characters. Until Linux supports Unicode for real,
this gives you an alternative. Without this option,
a '?' is used when no translation is possible. The
escape character is ':' because it is otherwise
illegal on the vfat filesystem. The escape sequence
that gets used is ':' and the four digits of hexadecimal
unicode.
nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will
end in '~1' or tilde followed by some number. If this
option is set, then if the filename is
"longfilename.txt" and "longfile.txt" does not
currently exist in the directory, 'longfile.txt' will
be the short alias instead of 'longfi~1.txt'.
quiet -- Stops printing certain warning messages.
check=s|r|n -- Case sensitivity checking setting.
s: strict, case sensitive
r: relaxed, case insensitive
n: normal, default setting, currently case insensitive
shortname=lower|win95|winnt|mixed
-- Shortname display/create setting.
lower: convert to lowercase for display,
emulate the Windows 95 rule for create.
win95: emulate the Windows 95 rule for display/create.
winnt: emulate the Windows NT rule for display/create.
mixed: emulate the Windows NT rule for display,
emulate the Windows 95 rule for create.
Default setting is `lower'.
<bool>: 0,1,yes,no,true,false
TODO
----------------------------------------------------------------------
* Need to get rid of the raw scanning stuff. Instead, always use
a get next directory entry approach. The only thing left that uses
raw scanning is the directory renaming code.
POSSIBLE PROBLEMS
----------------------------------------------------------------------
* vfat_valid_longname does not properly checked reserved names.
* When a volume name is the same as a directory name in the root
directory of the filesystem, the directory name sometimes shows
up as an empty file.
* autoconv option does not work correctly.
BUG REPORTS
----------------------------------------------------------------------
If you have trouble with the VFAT filesystem, mail bug reports to
chaffee@bmrc.cs.berkeley.edu. Please specify the filename
and the operation that gave you trouble.
TEST SUITE
----------------------------------------------------------------------
If you plan to make any modifications to the vfat filesystem, please
get the test suite that comes with the vfat distribution at
http://bmrc.berkeley.edu/people/chaffee/vfat.html
This tests quite a few parts of the vfat filesystem and additional
tests for new features or untested features would be appreciated.
NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
----------------------------------------------------------------------
(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
and lightly annotated by Gordon Chaffee).
This document presents a very rough, technical overview of my
knowledge of the extended FAT file system used in Windows NT 3.5 and
Windows 95. I don't guarantee that any of the following is correct,
but it appears to be so.
The extended FAT file system is almost identical to the FAT
file system used in DOS versions up to and including 6.223410239847
:-). The significant change has been the addition of long file names.
These names support up to 255 characters including spaces and lower
case characters as opposed to the traditional 8.3 short names.
Here is the description of the traditional FAT entry in the current
Windows 95 filesystem:
struct directory { // Short 8.3 names
unsigned char name[8]; // file name
unsigned char ext[3]; // file extension
unsigned char attr; // attribute byte
unsigned char lcase; // Case for base and extension
unsigned char ctime_ms; // Creation time, milliseconds
unsigned char ctime[2]; // Creation time
unsigned char cdate[2]; // Creation date
unsigned char adate[2]; // Last access date
unsigned char reserved[2]; // reserved values (ignored)
unsigned char time[2]; // time stamp
unsigned char date[2]; // date stamp
unsigned char start[2]; // starting cluster number
unsigned char size[4]; // size of the file
};
The lcase field specifies if the base and/or the extension of an 8.3
name should be capitalized. This field does not seem to be used by
Windows 95 but it is used by Windows NT. The case of filenames is not
completely compatible from Windows NT to Windows 95. It is not completely
compatible in the reverse direction, however. Filenames that fit in
the 8.3 namespace and are written on Windows NT to be lowercase will
show up as uppercase on Windows 95.
Note that the "start" and "size" values are actually little
endian integer values. The descriptions of the fields in this
structure are public knowledge and can be found elsewhere.
With the extended FAT system, Microsoft has inserted extra
directory entries for any files with extended names. (Any name which
legally fits within the old 8.3 encoding scheme does not have extra
entries.) I call these extra entries slots. Basically, a slot is a
specially formatted directory entry which holds up to 13 characters of
a file's extended name. Think of slots as additional labeling for the
directory entry of the file to which they correspond. Microsoft
prefers to refer to the 8.3 entry for a file as its alias and the
extended slot directory entries as the file name.
The C structure for a slot directory entry follows:
struct slot { // Up to 13 characters of a long name
unsigned char id; // sequence number for slot
unsigned char name0_4[10]; // first 5 characters in name
unsigned char attr; // attribute byte
unsigned char reserved; // always 0
unsigned char alias_checksum; // checksum for 8.3 alias
unsigned char name5_10[12]; // 6 more characters in name
unsigned char start[2]; // starting cluster number
unsigned char name11_12[4]; // last 2 characters in name
};
If the layout of the slots looks a little odd, it's only
because of Microsoft's efforts to maintain compatibility with old
software. The slots must be disguised to prevent old software from
panicking. To this end, a number of measures are taken:
1) The attribute byte for a slot directory entry is always set
to 0x0f. This corresponds to an old directory entry with
attributes of "hidden", "system", "read-only", and "volume
label". Most old software will ignore any directory
entries with the "volume label" bit set. Real volume label
entries don't have the other three bits set.
2) The starting cluster is always set to 0, an impossible
value for a DOS file.
Because the extended FAT system is backward compatible, it is
possible for old software to modify directory entries. Measures must
be taken to ensure the validity of slots. An extended FAT system can
verify that a slot does in fact belong to an 8.3 directory entry by
the following:
1) Positioning. Slots for a file always immediately proceed
their corresponding 8.3 directory entry. In addition, each
slot has an id which marks its order in the extended file
name. Here is a very abbreviated view of an 8.3 directory
entry and its corresponding long name slots for the file
"My Big File.Extension which is long":
<proceeding files...>
<slot #3, id = 0x43, characters = "h is long">
<slot #2, id = 0x02, characters = "xtension whic">
<slot #1, id = 0x01, characters = "My Big File.E">
<directory entry, name = "MYBIGFIL.EXT">
Note that the slots are stored from last to first. Slots
are numbered from 1 to N. The Nth slot is or'ed with 0x40
to mark it as the last one.
2) Checksum. Each slot has an "alias_checksum" value. The
checksum is calculated from the 8.3 name using the
following algorithm:
for (sum = i = 0; i < 11; i++) {
sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
}
3) If there is free space in the final slot, a Unicode NULL (0x0000)
is stored after the final character. After that, all unused
characters in the final slot are set to Unicode 0xFFFF.
Finally, note that the extended name is stored in Unicode. Each Unicode
character takes two bytes.

View File

@@ -0,0 +1,931 @@
Overview of the Linux Virtual File System
Original author: Richard Gooch <rgooch@atnf.csiro.au>
Last updated on October 28, 2005
Copyright (C) 1999 Richard Gooch
Copyright (C) 2005 Pekka Enberg
This file is released under the GPLv2.
Introduction
============
The Virtual File System (also known as the Virtual Filesystem Switch)
is the software layer in the kernel that provides the filesystem
interface to userspace programs. It also provides an abstraction
within the kernel which allows different filesystem implementations to
coexist.
VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
on are called from a process context. Filesystem locking is described
in the document Documentation/filesystems/Locking.
Directory Entry Cache (dcache)
------------------------------
The VFS implements the open(2), stat(2), chmod(2), and similar system
calls. The pathname argument that is passed to them is used by the VFS
to search through the directory entry cache (also known as the dentry
cache or dcache). This provides a very fast look-up mechanism to
translate a pathname (filename) into a specific dentry. Dentries live
in RAM and are never saved to disc: they exist only for performance.
The dentry cache is meant to be a view into your entire filespace. As
most computers cannot fit all dentries in the RAM at the same time,
some bits of the cache are missing. In order to resolve your pathname
into a dentry, the VFS may have to resort to creating dentries along
the way, and then loading the inode. This is done by looking up the
inode.
The Inode Object
----------------
An individual dentry usually has a pointer to an inode. Inodes are
filesystem objects such as regular files, directories, FIFOs and other
beasts. They live either on the disc (for block device filesystems)
or in the memory (for pseudo filesystems). Inodes that live on the
disc are copied into the memory when required and changes to the inode
are written back to disc. A single inode can be pointed to by multiple
dentries (hard links, for example, do this).
To look up an inode requires that the VFS calls the lookup() method of
the parent directory inode. This method is installed by the specific
filesystem implementation that the inode lives in. Once the VFS has
the required dentry (and hence the inode), we can do all those boring
things like open(2) the file, or stat(2) it to peek at the inode
data. The stat(2) operation is fairly simple: once the VFS has the
dentry, it peeks at the inode data and passes some of it back to
userspace.
The File Object
---------------
Opening a file requires another operation: allocation of a file
structure (this is the kernel-side implementation of file
descriptors). The freshly allocated file structure is initialized with
a pointer to the dentry and a set of file operation member functions.
These are taken from the inode data. The open() file method is then
called so the specific filesystem implementation can do it's work. You
can see that this is another switch performed by the VFS. The file
structure is placed into the file descriptor table for the process.
Reading, writing and closing files (and other assorted VFS operations)
is done by using the userspace file descriptor to grab the appropriate
file structure, and then calling the required file structure method to
do whatever is required. For as long as the file is open, it keeps the
dentry in use, which in turn means that the VFS inode is still in use.
Registering and Mounting a Filesystem
=====================================
To register and unregister a filesystem, use the following API
functions:
#include <linux/fs.h>
extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
The passed struct file_system_type describes your filesystem. When a
request is made to mount a device onto a directory in your filespace,
the VFS will call the appropriate get_sb() method for the specific
filesystem. The dentry for the mount point will then be updated to
point to the root inode for the new filesystem.
You can see all filesystems that are registered to the kernel in the
file /proc/filesystems.
struct file_system_type
-----------------------
This describes the filesystem. As of kernel 2.6.13, the following
members are defined:
struct file_system_type {
const char *name;
int fs_flags;
int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
struct list_head fs_supers;
};
name: the name of the filesystem type, such as "ext2", "iso9660",
"msdos" and so on
fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
get_sb: the method to call when a new instance of this
filesystem should be mounted
kill_sb: the method to call when an instance of this filesystem
should be unmounted
owner: for internal VFS use: you should initialize this to THIS_MODULE in
most cases.
next: for internal VFS use: you should initialize this to NULL
The get_sb() method has the following arguments:
struct super_block *sb: the superblock structure. This is partially
initialized by the VFS and the rest must be initialized by the
get_sb() method
int flags: mount flags
const char *dev_name: the device name we are mounting.
void *data: arbitrary mount options, usually comes as an ASCII
string
int silent: whether or not to be silent on error
The get_sb() method must determine if the block device specified
in the superblock contains a filesystem of the type the method
supports. On success the method returns the superblock pointer, on
failure it returns NULL.
The most interesting member of the superblock structure that the
get_sb() method fills in is the "s_op" field. This is a pointer to
a "struct super_operations" which describes the next level of the
filesystem implementation.
Usually, a filesystem uses one of the generic get_sb() implementations
and provides a fill_super() method instead. The generic methods are:
get_sb_bdev: mount a filesystem residing on a block device
get_sb_nodev: mount a filesystem that is not backed by a device
get_sb_single: mount a filesystem which shares the instance between
all mounts
A fill_super() method implementation has the following arguments:
struct super_block *sb: the superblock structure. The method fill_super()
must initialize this properly.
void *data: arbitrary mount options, usually comes as an ASCII
string
int silent: whether or not to be silent on error
The Superblock Object
=====================
A superblock object represents a mounted filesystem.
struct super_operations
-----------------------
This describes how the VFS can manipulate the superblock of your
filesystem. As of kernel 2.6.13, the following members are defined:
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*read_inode) (struct inode *);
void (*dirty_inode) (struct inode *);
int (*write_inode) (struct inode *, int);
void (*put_inode) (struct inode *);
void (*drop_inode) (struct inode *);
void (*delete_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
void (*write_super_lockfs) (struct super_block *);
void (*unlockfs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
void (*sync_inodes) (struct super_block *sb,
struct writeback_control *wbc);
int (*show_options)(struct seq_file *, struct vfsmount *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
};
All methods are called without any locks being held, unless otherwise
noted. This means that most methods can block safely. All methods are
only called from a process context (i.e. not from an interrupt handler
or bottom half).
alloc_inode: this method is called by inode_alloc() to allocate memory
for struct inode and initialize it. If this function is not
defined, a simple 'struct inode' is allocated. Normally
alloc_inode will be used to allocate a larger structure which
contains a 'struct inode' embedded within it.
destroy_inode: this method is called by destroy_inode() to release
resources allocated for struct inode. It is only required if
->alloc_inode was defined and simply undoes anything done by
->alloc_inode.
read_inode: this method is called to read a specific inode from the
mounted filesystem. The i_ino member in the struct inode is
initialized by the VFS to indicate which inode to read. Other
members are filled in by this method.
You can set this to NULL and use iget5_locked() instead of iget()
to read inodes. This is necessary for filesystems for which the
inode number is not sufficient to identify an inode.
dirty_inode: this method is called by the VFS to mark an inode dirty.
write_inode: this method is called when the VFS needs to write an
inode to disc. The second parameter indicates whether the write
should be synchronous or not, not all filesystems check this flag.
put_inode: called when the VFS inode is removed from the inode
cache.
drop_inode: called when the last access to the inode is dropped,
with the inode_lock spinlock held.
This method should be either NULL (normal UNIX filesystem
semantics) or "generic_delete_inode" (for filesystems that do not
want to cache inodes - causing "delete_inode" to always be
called regardless of the value of i_nlink)
The "generic_delete_inode()" behavior is equivalent to the
old practice of using "force_delete" in the put_inode() case,
but does not have the races that the "force_delete()" approach
had.
delete_inode: called when the VFS wants to delete an inode
put_super: called when the VFS wishes to free the superblock
(i.e. unmount). This is called with the superblock lock held
write_super: called when the VFS superblock needs to be written to
disc. This method is optional
sync_fs: called when VFS is writing out all dirty data associated with
a superblock. The second parameter indicates whether the method
should wait until the write out has been completed. Optional.
write_super_lockfs: called when VFS is locking a filesystem and
forcing it into a consistent state. This method is currently
used by the Logical Volume Manager (LVM).
unlockfs: called when VFS is unlocking a filesystem and making it writable
again.
statfs: called when the VFS needs to get filesystem statistics. This
is called with the kernel lock held
remount_fs: called when the filesystem is remounted. This is called
with the kernel lock held
clear_inode: called then the VFS clears the inode. Optional
umount_begin: called when the VFS is unmounting a filesystem.
sync_inodes: called when the VFS is writing out dirty data associated with
a superblock.
show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
quota_read: called by the VFS to read from filesystem quota file.
quota_write: called by the VFS to write to filesystem quota file.
The read_inode() method is responsible for filling in the "i_op"
field. This is a pointer to a "struct inode_operations" which
describes the methods that can be performed on individual inodes.
The Inode Object
================
An inode object represents an object within the filesystem.
struct inode_operations
-----------------------
This describes how the VFS can manipulate an inode in your
filesystem. As of kernel 2.6.13, the following members are defined:
struct inode_operations {
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
void * (*follow_link) (struct dentry *, struct nameidata *);
void (*put_link) (struct dentry *, struct nameidata *, void *);
void (*truncate) (struct inode *);
int (*permission) (struct inode *, int, struct nameidata *);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
};
Again, all methods are called without any locks being held, unless
otherwise noted.
create: called by the open(2) and creat(2) system calls. Only
required if you want to support regular files. The dentry you
get should not have an inode (i.e. it should be a negative
dentry). Here you will probably call d_instantiate() with the
dentry and the newly created inode
lookup: called when the VFS needs to look up an inode in a parent
directory. The name to look for is found in the dentry. This
method must call d_add() to insert the found inode into the
dentry. The "i_count" field in the inode structure should be
incremented. If the named inode does not exist a NULL inode
should be inserted into the dentry (this is called a negative
dentry). Returning an error code from this routine must only
be done on a real error, otherwise creating inodes with system
calls like create(2), mknod(2), mkdir(2) and so on will fail.
If you wish to overload the dentry methods then you should
initialise the "d_dop" field in the dentry; this is a pointer
to a struct "dentry_operations".
This method is called with the directory inode semaphore held
link: called by the link(2) system call. Only required if you want
to support hard links. You will probably need to call
d_instantiate() just as you would in the create() method
unlink: called by the unlink(2) system call. Only required if you
want to support deleting inodes
symlink: called by the symlink(2) system call. Only required if you
want to support symlinks. You will probably need to call
d_instantiate() just as you would in the create() method
mkdir: called by the mkdir(2) system call. Only required if you want
to support creating subdirectories. You will probably need to
call d_instantiate() just as you would in the create() method
rmdir: called by the rmdir(2) system call. Only required if you want
to support deleting subdirectories
mknod: called by the mknod(2) system call to create a device (char,
block) inode or a named pipe (FIFO) or socket. Only required
if you want to support creating these types of inodes. You
will probably need to call d_instantiate() just as you would
in the create() method
rename: called by the rename(2) system call to rename the object to
have the parent and name given by the second inode and dentry.
readlink: called by the readlink(2) system call. Only required if
you want to support reading symbolic links
follow_link: called by the VFS to follow a symbolic link to the
inode it points to. Only required if you want to support
symbolic links. This method returns a void pointer cookie
that is passed to put_link().
put_link: called by the VFS to release resources allocated by
follow_link(). The cookie returned by follow_link() is passed
to this method as the last parameter. It is used by
filesystems such as NFS where page cache is not stable
(i.e. page that was installed when the symbolic link walk
started might not be in the page cache at the end of the
walk).
truncate: called by the VFS to change the size of a file. The
i_size field of the inode is set to the desired size by the
VFS before this method is called. This method is called by
the truncate(2) system call and related functionality.
permission: called by the VFS to check for access rights on a POSIX-like
filesystem.
setattr: called by the VFS to set attributes for a file. This method
is called by chmod(2) and related system calls.
getattr: called by the VFS to get attributes of a file. This method
is called by stat(2) and related system calls.
setxattr: called by the VFS to set an extended attribute for a file.
Extended attribute is a name:value pair associated with an
inode. This method is called by setxattr(2) system call.
getxattr: called by the VFS to retrieve the value of an extended
attribute name. This method is called by getxattr(2) function
call.
listxattr: called by the VFS to list all extended attributes for a
given file. This method is called by listxattr(2) system call.
removexattr: called by the VFS to remove an extended attribute from
a file. This method is called by removexattr(2) system call.
The Address Space Object
========================
The address space object is used to group and manage pages in the page
cache. It can be used to keep track of the pages in a file (or
anything else) and also track the mapping of sections of the file into
process address spaces.
There are a number of distinct yet related services that an
address-space can provide. These include communicating memory
pressure, page lookup by address, and keeping track of pages tagged as
Dirty or Writeback.
The first can be used independently to the others. The VM can try to
either write dirty pages in order to clean them, or release clean
pages in order to reuse them. To do this it can call the ->writepage
method on dirty pages, and ->releasepage on clean pages with
PagePrivate set. Clean pages without PagePrivate and with no external
references will be released without notice being given to the
address_space.
To achieve this functionality, pages need to be placed on an LRU with
lru_cache_add and mark_page_active needs to be called whenever the
page is used.
Pages are normally kept in a radix tree index by ->index. This tree
maintains information about the PG_Dirty and PG_Writeback status of
each page, so that pages with either of these flags can be found
quickly.
The Dirty tag is primarily used by mpage_writepages - the default
->writepages method. It uses the tag to find dirty pages to call
->writepage on. If mpage_writepages is not used (i.e. the address
provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
almost unused. write_inode_now and sync_inode do use it (through
__sync_single_inode) to check if ->writepages has been successful in
writing out the whole address_space.
The Writeback tag is used by filemap*wait* and sync_page* functions,
via wait_on_page_writeback_range, to wait for all writeback to
complete. While waiting ->sync_page (if defined) will be called on
each page that is found to require writeback.
An address_space handler may attach extra information to a page,
typically using the 'private' field in the 'struct page'. If such
information is attached, the PG_Private flag should be set. This will
cause various VM routines to make extra calls into the address_space
handler to deal with that data.
An address space acts as an intermediate between storage and
application. Data is read into the address space a whole page at a
time, and provided to the application either by copying of the page,
or by memory-mapping the page.
Data is written into the address space by the application, and then
written-back to storage typically in whole pages, however the
address_space has finer control of write sizes.
The read process essentially only requires 'readpage'. The write
process is more complicated and uses prepare_write/commit_write or
set_page_dirty to write data into the address_space, and writepage,
sync_page, and writepages to writeback data to storage.
Adding and removing pages to/from an address_space is protected by the
inode's i_mutex.
When data is written to a page, the PG_Dirty flag should be set. It
typically remains set until writepage asks for it to be written. This
should clear PG_Dirty and set PG_Writeback. It can be actually
written at any point after PG_Dirty is clear. Once it is known to be
safe, PG_Writeback is cleared.
Writeback makes use of a writeback_control structure...
struct address_space_operations
-------------------------------
This describes how the VFS can manipulate mapping of a file to page cache in
your filesystem. As of kernel 2.6.16, the following members are defined:
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
int (*sync_page)(struct page *);
int (*writepages)(struct address_space *, struct writeback_control *);
int (*set_page_dirty)(struct page *page);
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
struct page* (*get_xip_page)(struct address_space *, sector_t,
int);
/* migrate the contents of a page to the specified target */
int (*migratepage) (struct page *, struct page *);
};
writepage: called by the VM to write a dirty page to backing store.
This may happen for data integrity reasons (i.e. 'sync'), or
to free up memory (flush). The difference can be seen in
wbc->sync_mode.
The PG_Dirty flag has been cleared and PageLocked is true.
writepage should start writeout, should set PG_Writeback,
and should make sure the page is unlocked, either synchronously
or asynchronously when the write operation completes.
If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
try too hard if there are problems, and may choose to write out
other pages from the mapping if that is easier (e.g. due to
internal dependencies). If it chooses not to start writeout, it
should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
calling ->writepage on that page.
See the file "Locking" for more details.
readpage: called by the VM to read a page from backing store.
The page will be Locked when readpage is called, and should be
unlocked and marked uptodate once the read completes.
If ->readpage discovers that it needs to unlock the page for
some reason, it can do so, and then return AOP_TRUNCATED_PAGE.
In this case, the page will be relocated, relocked and if
that all succeeds, ->readpage will be called again.
sync_page: called by the VM to notify the backing store to perform all
queued I/O operations for a page. I/O operations for other pages
associated with this address_space object may also be performed.
This function is optional and is called only for pages with
PG_Writeback set while waiting for the writeback to complete.
writepages: called by the VM to write out pages associated with the
address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then
the writeback_control will specify a range of pages that must be
written out. If it is WBC_SYNC_NONE, then a nr_to_write is given
and that many pages should be written if possible.
If no ->writepages is given, then mpage_writepages is used
instead. This will choose pages from the address space that are
tagged as DIRTY and will pass them to ->writepage.
set_page_dirty: called by the VM to set a page dirty.
This is particularly needed if an address space attaches
private data to a page, and that data needs to be updated when
a page is dirtied. This is called, for example, when a memory
mapped page gets modified.
If defined, it should set the PageDirty flag, and the
PAGECACHE_TAG_DIRTY tag in the radix tree.
readpages: called by the VM to read pages associated with the address_space
object. This is essentially just a vector version of
readpage. Instead of just one page, several pages are
requested.
readpages is only used for read-ahead, so read errors are
ignored. If anything goes wrong, feel free to give up.
prepare_write: called by the generic write path in VM to set up a write
request for a page. This indicates to the address space that
the given range of bytes is about to be written. The
address_space should check that the write will be able to
complete, by allocating space if necessary and doing any other
internal housekeeping. If the write will update parts of
any basic-blocks on storage, then those blocks should be
pre-read (if they haven't been read already) so that the
updated blocks can be written out properly.
The page will be locked. If prepare_write wants to unlock the
page it, like readpage, may do so and return
AOP_TRUNCATED_PAGE.
In this case the prepare_write will be retried one the lock is
regained.
Note: the page _must not_ be marked uptodate in this function
(or anywhere else) unless it actually is uptodate right now. As
soon as a page is marked uptodate, it is possible for a concurrent
read(2) to copy it to userspace.
commit_write: If prepare_write succeeds, new data will be copied
into the page and then commit_write will be called. It will
typically update the size of the file (if appropriate) and
mark the inode as dirty, and do any other related housekeeping
operations. It should avoid returning an error if possible -
errors should have been handled by prepare_write.
bmap: called by the VFS to map a logical block offset within object to
physical block number. This method is used by the FIBMAP
ioctl and for working with swap-files. To be able to swap to
a file, the file must have a stable mapping to a block
device. The swap system does not go through the filesystem
but instead uses bmap to find out where the blocks in the file
are and uses those addresses directly.
invalidatepage: If a page has PagePrivate set, then invalidatepage
will be called when part or all of the page is to be removed
from the address space. This generally corresponds to either a
truncation or a complete invalidation of the address space
(in the latter case 'offset' will always be 0).
Any private data associated with the page should be updated
to reflect this truncation. If offset is 0, then
the private data should be released, because the page
must be able to be completely discarded. This may be done by
calling the ->releasepage function, but in this case the
release MUST succeed.
releasepage: releasepage is called on PagePrivate pages to indicate
that the page should be freed if possible. ->releasepage
should remove any private data from the page and clear the
PagePrivate flag. It may also remove the page from the
address_space. If this fails for some reason, it may indicate
failure with a 0 return value.
This is used in two distinct though related cases. The first
is when the VM finds a clean page with no active users and
wants to make it a free page. If ->releasepage succeeds, the
page will be removed from the address_space and become free.
The second case if when a request has been made to invalidate
some or all pages in an address_space. This can happen
through the fadvice(POSIX_FADV_DONTNEED) system call or by the
filesystem explicitly requesting it as nfs and 9fs do (when
they believe the cache may be out of date with storage) by
calling invalidate_inode_pages2().
If the filesystem makes such a call, and needs to be certain
that all pages are invalidated, then its releasepage will
need to ensure this. Possibly it can clear the PageUptodate
bit if it cannot free private data yet.
direct_IO: called by the generic read/write routines to perform
direct_IO - that is IO requests which bypass the page cache
and transfer data directly between the storage and the
application's address space.
get_xip_page: called by the VM to translate a block number to a page.
The page is valid until the corresponding filesystem is unmounted.
Filesystems that want to use execute-in-place (XIP) need to implement
it. An example implementation can be found in fs/ext2/xip.c.
migrate_page: This is used to compact the physical memory usage.
If the VM wants to relocate a page (maybe off a memory card
that is signalling imminent failure) it will pass a new page
and an old page to this function. migrate_page should
transfer any private data across and update any references
that it has to the page.
The File Object
===============
A file object represents a file opened by a process.
struct file_operations
----------------------
This describes how the VFS can manipulate an open file. As of kernel
2.6.17, the following members are defined:
struct file_operations {
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, struct dentry *, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*dir_notify)(struct file *filp, unsigned long arg);
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned
int);
ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned
int);
};
Again, all methods are called without any locks being held, unless
otherwise noted.
llseek: called when the VFS needs to move the file position index
read: called by read(2) and related system calls
aio_read: called by io_submit(2) and other asynchronous I/O operations
write: called by write(2) and related system calls
aio_write: called by io_submit(2) and other asynchronous I/O operations
readdir: called when the VFS needs to read the directory contents
poll: called by the VFS when a process wants to check if there is
activity on this file and (optionally) go to sleep until there
is activity. Called by the select(2) and poll(2) system calls
ioctl: called by the ioctl(2) system call
unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
require the BKL should use this method instead of the ioctl() above.
compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
are used on 64 bit kernels.
mmap: called by the mmap(2) system call
open: called by the VFS when an inode should be opened. When the VFS
opens a file, it creates a new "struct file". It then calls the
open method for the newly allocated file structure. You might
think that the open method really belongs in
"struct inode_operations", and you may be right. I think it's
done the way it is because it makes filesystems simpler to
implement. The open() method is a good place to initialize the
"private_data" member in the file structure if you want to point
to a device structure
flush: called by the close(2) system call to flush a file
release: called when the last reference to an open file is closed
fsync: called by the fsync(2) system call
fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
commands
readv: called by the readv(2) system call
writev: called by the writev(2) system call
sendfile: called by the sendfile(2) system call
get_unmapped_area: called by the mmap(2) system call
check_flags: called by the fcntl(2) system call for F_SETFL command
dir_notify: called by the fcntl(2) system call for F_NOTIFY command
flock: called by the flock(2) system call
splice_write: called by the VFS to splice data from a pipe to a file. This
method is used by the splice(2) system call
splice_read: called by the VFS to splice data from file to a pipe. This
method is used by the splice(2) system call
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special
support routines in the VFS which will locate the required device
driver information. These support routines replace the filesystem file
operations with those for the device driver, and then proceed to call
the new open() method for the file. This is how opening a device file
in the filesystem eventually ends up calling the device driver open()
method.
Directory Entry Cache (dcache)
==============================
struct dentry_operations
------------------------
This describes how a filesystem can overload the standard dentry
operations. Dentries and the dcache are the domain of the VFS and the
individual filesystem implementations. Device drivers have no business
here. These methods may be set to NULL, as they are either optional or
the VFS uses a default. As of kernel 2.6.13, the following members are
defined:
struct dentry_operations {
int (*d_revalidate)(struct dentry *, struct nameidata *);
int (*d_hash) (struct dentry *, struct qstr *);
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
};
d_revalidate: called when the VFS needs to revalidate a dentry. This
is called whenever a name look-up finds a dentry in the
dcache. Most filesystems leave this as NULL, because all their
dentries in the dcache are valid
d_hash: called when the VFS adds a dentry to the hash table
d_compare: called when a dentry should be compared with another
d_delete: called when the last reference to a dentry is
deleted. This means no-one is using the dentry, however it is
still valid and in the dcache
d_release: called when a dentry is really deallocated
d_iput: called when a dentry loses its inode (just prior to its
being deallocated). The default when this is NULL is that the
VFS calls iput(). If you define this method, you must call
iput() yourself
Each dentry has a pointer to its parent dentry, as well as a hash list
of child dentries. Child dentries are basically like files in a
directory.
Directory Entry Cache API
--------------------------
There are a number of functions defined which permit a filesystem to
manipulate dentries:
dget: open a new handle for an existing dentry (this just increments
the usage count)
dput: close a handle for a dentry (decrements the usage count). If
the usage count drops to 0, the "d_delete" method is called
and the dentry is placed on the unused list if the dentry is
still in its parents hash list. Putting the dentry on the
unused list just means that if the system needs some RAM, it
goes through the unused list of dentries and deallocates them.
If the dentry has already been unhashed and the usage count
drops to 0, in this case the dentry is deallocated after the
"d_delete" method is called
d_drop: this unhashes a dentry from its parents hash list. A
subsequent call to dput() will deallocate the dentry if its
usage count drops to 0
d_delete: delete a dentry. If there are no other open references to
the dentry then the dentry is turned into a negative dentry
(the d_iput() method is called). If there are other
references, then d_drop() is called instead
d_add: add a dentry to its parents hash list and then calls
d_instantiate()
d_instantiate: add a dentry to the alias hash list for the inode and
updates the "d_inode" member. The "i_count" member in the
inode structure should be set/incremented. If the inode
pointer is NULL, the dentry is called a "negative
dentry". This function is commonly called when an inode is
created for an existing negative dentry
d_lookup: look up a dentry given its parent and path name component
It looks up the child of that given name from the dcache
hash table. If it is found, the reference count is incremented
and the dentry is returned. The caller must use d_put()
to free the dentry when it finishes using it.
For further information on dentry locking, please refer to the document
Documentation/filesystems/dentry-locking.txt.
Resources
=========
(Note some of these resources are not up-to-date with the latest kernel
version.)
Creating Linux virtual filesystems. 2002
<http://lwn.net/Articles/13325/>
The Linux Virtual File-system Layer by Neil Brown. 1999
<http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
A tour of the Linux VFS by Michael K. Johnson. 1996
<http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
A small trail through the Linux kernel by Andries Brouwer. 2001
<http://www.win.tue.nl/~aeb/linux/vfs/trail.html>

View File

@@ -0,0 +1,262 @@
The SGI XFS Filesystem
======================
XFS is a high performance journaling filesystem which originated
on the SGI IRIX platform. It is completely multi-threaded, can
support large files and large filesystems, extended attributes,
variable block sizes, is extent based, and makes extensive use of
Btrees (directories, extents, free space) to aid both performance
and scalability.
Refer to the documentation at http://oss.sgi.com/projects/xfs/
for further details. This implementation is on-disk compatible
with the IRIX version of XFS.
Mount Options
=============
When mounting an XFS filesystem, the following options are accepted.
allocsize=size
Sets the buffered I/O end-of-file preallocation size when
doing delayed allocation writeout (default size is 64KiB).
Valid values for this option are page size (typically 4KiB)
through to 1GiB, inclusive, in power-of-2 increments.
attr2/noattr2
The options enable/disable (default is disabled for backward
compatibility on-disk) an "opportunistic" improvement to be
made in the way inline extended attributes are stored on-disk.
When the new form is used for the first time (by setting or
removing extended attributes) the on-disk superblock feature
bit field will be updated to reflect this format being in use.
barrier
Enables the use of block layer write barriers for writes into
the journal and unwritten extent conversion. This allows for
drive level write caching to be enabled, for devices that
support write barriers.
dmapi
Enable the DMAPI (Data Management API) event callouts.
Use with the "mtpt" option.
grpid/bsdgroups and nogrpid/sysvgroups
These options define what group ID a newly created file gets.
When grpid is set, it takes the group ID of the directory in
which it is created; otherwise (the default) it takes the fsgid
of the current process, unless the directory has the setgid bit
set, in which case it takes the gid from the parent directory,
and also gets the setgid bit set if it is a directory itself.
ihashsize=value
Sets the number of hash buckets available for hashing the
in-memory inodes of the specified mount point. If a value
of zero is used, the value selected by the default algorithm
will be displayed in /proc/mounts.
ikeep/noikeep
When inode clusters are emptied of inodes, keep them around
on the disk (ikeep) - this is the traditional XFS behaviour
and is still the default for now. Using the noikeep option,
inode clusters are returned to the free space pool.
inode64
Indicates that XFS is allowed to create inodes at any location
in the filesystem, including those which will result in inode
numbers occupying more than 32 bits of significance. This is
provided for backwards compatibility, but causes problems for
backup applications that cannot handle large inode numbers.
largeio/nolargeio
If "nolargeio" is specified, the optimal I/O reported in
st_blksize by stat(2) will be as small as possible to allow user
applications to avoid inefficient read/modify/write I/O.
If "largeio" specified, a filesystem that has a "swidth" specified
will return the "swidth" value (in bytes) in st_blksize. If the
filesystem does not have a "swidth" specified but does specify
an "allocsize" then "allocsize" (in bytes) will be returned
instead.
If neither of these two options are specified, then filesystem
will behave as if "nolargeio" was specified.
logbufs=value
Set the number of in-memory log buffers. Valid numbers range
from 2-8 inclusive.
The default value is 8 buffers for filesystems with a
blocksize of 64KiB, 4 buffers for filesystems with a blocksize
of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB
and 2 buffers for all other configurations. Increasing the
number of buffers may increase performance on some workloads
at the cost of the memory used for the additional log buffers
and their associated control structures.
logbsize=value
Set the size of each in-memory log buffer.
Size may be specified in bytes, or in kilobytes with a "k" suffix.
Valid sizes for version 1 and version 2 logs are 16384 (16k) and
32768 (32k). Valid sizes for version 2 logs also include
65536 (64k), 131072 (128k) and 262144 (256k).
The default value for machines with more than 32MiB of memory
is 32768, machines with less memory use 16384 by default.
logdev=device and rtdev=device
Use an external log (metadata journal) and/or real-time device.
An XFS filesystem has up to three parts: a data section, a log
section, and a real-time section. The real-time section is
optional, and the log section can be separate from the data
section or contained within it.
mtpt=mountpoint
Use with the "dmapi" option. The value specified here will be
included in the DMAPI mount event, and should be the path of
the actual mountpoint that is used.
noalign
Data allocations will not be aligned at stripe unit boundaries.
noatime
Access timestamps are not updated when a file is read.
norecovery
The filesystem will be mounted without running log recovery.
If the filesystem was not cleanly unmounted, it is likely to
be inconsistent when mounted in "norecovery" mode.
Some files or directories may not be accessible because of this.
Filesystems mounted "norecovery" must be mounted read-only or
the mount will fail.
nouuid
Don't check for double mounted file systems using the file system uuid.
This is useful to mount LVM snapshot volumes.
osyncisosync
Make O_SYNC writes implement true O_SYNC. WITHOUT this option,
Linux XFS behaves as if an "osyncisdsync" option is used,
which will make writes to files opened with the O_SYNC flag set
behave as if the O_DSYNC flag had been used instead.
This can result in better performance without compromising
data safety.
However if this option is not in effect, timestamp updates from
O_SYNC writes can be lost if the system crashes.
If timestamp updates are critical, use the osyncisosync option.
uquota/usrquota/uqnoenforce/quota
User disk quota accounting enabled, and limits (optionally)
enforced. Refer to xfs_quota(8) for further details.
gquota/grpquota/gqnoenforce
Group disk quota accounting enabled and limits (optionally)
enforced. Refer to xfs_quota(8) for further details.
pquota/prjquota/pqnoenforce
Project disk quota accounting enabled and limits (optionally)
enforced. Refer to xfs_quota(8) for further details.
sunit=value and swidth=value
Used to specify the stripe unit and width for a RAID device or
a stripe volume. "value" must be specified in 512-byte block
units.
If this option is not specified and the filesystem was made on
a stripe volume or the stripe width or unit were specified for
the RAID device at mkfs time, then the mount system call will
restore the value from the superblock. For filesystems that
are made directly on RAID devices, these options can be used
to override the information in the superblock if the underlying
disk layout changes after the filesystem has been created.
The "swidth" option is required if the "sunit" option has been
specified, and must be a multiple of the "sunit" value.
swalloc
Data allocations will be rounded up to stripe width boundaries
when the current end of file is being extended and the file
size is larger than the stripe width size.
sysctls
=======
The following sysctls are available for the XFS filesystem:
fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)
Setting this to "1" clears accumulated XFS statistics
in /proc/fs/xfs/stat. It then immediately resets to "0".
fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)
The interval at which the xfssyncd thread flushes metadata
out to disk. This thread will flush log activity out, and
do some processing on unlinked inodes.
fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000)
The interval at which xfsbufd scans the dirty metadata buffers list.
fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000)
The age at which xfsbufd flushes dirty metadata buffers to disk.
fs.xfs.error_level (Min: 0 Default: 3 Max: 11)
A volume knob for error reporting when internal errors occur.
This will generate detailed messages & backtraces for filesystem
shutdowns, for example. Current threshold values are:
XFS_ERRLEVEL_OFF: 0
XFS_ERRLEVEL_LOW: 1
XFS_ERRLEVEL_HIGH: 5
fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127)
Causes certain error conditions to call BUG(). Value is a bitmask;
AND together the tags which represent errors which should cause panics:
XFS_NO_PTAG 0
XFS_PTAG_IFLUSH 0x00000001
XFS_PTAG_LOGRES 0x00000002
XFS_PTAG_AILDELETE 0x00000004
XFS_PTAG_ERROR_REPORT 0x00000008
XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010
XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
This option is intended for debugging only.
fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)
Controls whether symlinks are created with mode 0777 (default)
or whether their mode is affected by the umask (irix mode).
fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)
Controls files created in SGID directories.
If the group ID of the new file does not match the effective group
ID or one of the supplementary group IDs of the parent dir, the
ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
is set.
fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1)
Controls whether unprivileged users can use chown to "give away"
a file to another user.
fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1)
Setting this to "1" will cause the "sync" flag set
by the xfs_io(8) chattr command on a directory to be
inherited by files in that directory.
fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1)
Setting this to "1" will cause the "nodump" flag set
by the xfs_io(8) chattr command on a directory to be
inherited by files in that directory.
fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1)
Setting this to "1" will cause the "noatime" flag set
by the xfs_io(8) chattr command on a directory to be
inherited by files in that directory.
fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1)
Setting this to "1" will cause the "nosymlinks" flag set
by the xfs_io(8) chattr command on a directory to be
inherited by files in that directory.
fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256)
In "inode32" allocation mode, this option determines how many
files the allocator attempts to allocate in the same allocation
group before moving to the next allocation group. The intent
is to control the rate at which the allocator moves between
allocation groups when allocating extents for new files.

View File

@@ -0,0 +1,67 @@
Execute-in-place for file mappings
----------------------------------
Motivation
----------
File mappings are performed by mapping page cache pages to userspace. In
addition, read&write type file operations also transfer data from/to the page
cache.
For memory backed storage devices that use the block device interface, the page
cache pages are in fact copies of the original storage. Various approaches
exist to work around the need for an extra copy. The ramdisk driver for example
does read the data into the page cache, keeps a reference, and discards the
original data behind later on.
Execute-in-place solves this issue the other way around: instead of keeping
data in the page cache, the need to have a page cache copy is eliminated
completely. With execute-in-place, read&write type operations are performed
directly from/to the memory backed storage device. For file mappings, the
storage device itself is mapped directly into userspace.
This implementation was initialy written for shared memory segments between
different virtual machines on s390 hardware to allow multiple machines to
share the same binaries and libraries.
Implementation
--------------
Execute-in-place is implemented in three steps: block device operation,
address space operation, and file operations.
A block device operation named direct_access is used to retrieve a
reference (pointer) to a block on-disk. The reference is supposed to be
cpu-addressable, physical address and remain valid until the release operation
is performed. A struct block_device reference is used to address the device,
and a sector_t argument is used to identify the individual block. As an
alternative, memory technology devices can be used for this.
The block device operation is optional, these block devices support it as of
today:
- dcssblk: s390 dcss block device driver
An address space operation named get_xip_page is used to retrieve reference
to a struct page. To address the target page, a reference to an address_space,
and a sector number is provided. A 3rd argument indicates whether the
function should allocate blocks if needed.
This address space operation is mutually exclusive with readpage&writepage that
do page cache read/write operations.
The following filesystems support it as of today:
- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
A set of file operations that do utilize get_xip_page can be found in
mm/filemap_xip.c . The following file operation implementations are provided:
- aio_read/aio_write
- readv/writev
- sendfile
The generic file operations do_sync_read/do_sync_write can be used to implement
classic synchronous IO calls.
Shortcomings
------------
This implementation is limited to storage devices that are cpu addressable at
all times (no highmem or such). It works well on rom/ram, but enhancements are
needed to make it work with flash in read+write mode.
Putting the Linux kernel and/or its modules on a xip filesystem does not mean
they are not copied.