Creation of Cybook 2416 (actually Gen4) repository

2009-12-18 17:10:00 +00:00
commit 76f20f4d40
13791 changed files with 6812321 additions and 0 deletions
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -0,0 +1,94 @@
+00-INDEX
+	- this file (info on some of the filesystems supported by linux).
+Exporting
+	- explanation of how to make filesystems exportable.
+Locking
+	- info on locking rules as they pertain to Linux VFS.
+9p.txt
+	- 9p (v9fs) is an implementation of the Plan 9 remote fs protocol.
+adfs.txt
+	- info and mount options for the Acorn Advanced Disc Filing System.
+afs.txt
+	- info and examples for the distributed AFS (Andrew File System) fs.
+affs.txt
+	- info and mount options for the Amiga Fast File System.
+automount-support.txt
+	- information about filesystem automount support.
+befs.txt
+	- information about the BeOS filesystem for Linux.
+bfs.txt
+	- info for the SCO UnixWare Boot Filesystem (BFS).
+cifs.txt
+	- description of the CIFS filesystem.
+coda.txt
+	- description of the CODA filesystem.
+configfs/
+	- directory containing configfs documentation and example code.
+cramfs.txt
+	- info on the cram filesystem for small storage (ROMs etc).
+dentry-locking.txt
+	- info on the RCU-based dcache locking model.
+directory-locking
+	- info about the locking scheme used for directory operations.
+dlmfs.txt
+	- info on the userspace interface to the OCFS2 DLM.
+ext2.txt
+	- info, mount options and specifications for the Ext2 filesystem.
+ext3.txt
+	- info, mount options and specifications for the Ext3 filesystem.
+ext4.txt
+	- info, mount options and specifications for the Ext4 filesystem.
+files.txt
+	- info on file management in the Linux kernel.
+fuse.txt
+	- info on the Filesystem in User SpacE including mount options.
+hfs.txt
+	- info on the Macintosh HFS Filesystem for Linux.
+hpfs.txt
+	- info and mount options for the OS/2 HPFS.
+isofs.txt
+	- info and mount options for the ISO 9660 (CDROM) filesystem.
+jfs.txt
+	- info and mount options for the JFS filesystem.
+ncpfs.txt
+	- info on Novell Netware(tm) filesystem using NCP protocol.
+ntfs.txt
+	- info and mount options for the NTFS filesystem (Windows NT).
+ocfs2.txt
+	- info and mount options for the OCFS2 clustered filesystem.
+porting
+	- various information on filesystem porting.
+proc.txt
+	- info on Linux's /proc filesystem.
+ramfs-rootfs-initramfs.txt
+	- info on the 'in memory' filesystems ramfs, rootfs and initramfs.
+reiser4.txt
+	- info on the Reiser4 filesystem based on dancing tree algorithms.
+relay.txt
+	- info on relay, for efficient streaming from kernel to user space.
+romfs.txt
+	- description of the ROMFS filesystem.
+smbfs.txt
+	- info on using filesystems with the SMB protocol (Win 3.11 and NT).
+spufs.txt
+	- info and mount options for the SPU filesystem used on Cell.
+sysfs-pci.txt
+	- info on accessing PCI device resources through sysfs.
+sysfs.txt
+	- info on sysfs, a ram-based filesystem for exporting kernel objects.
+sysv-fs.txt
+	- info on the SystemV/V7/Xenix/Coherent filesystem.
+tmpfs.txt
+	- info on tmpfs, a filesystem that holds all files in virtual memory.
+udf.txt
+	- info and mount options for the UDF filesystem.
+ufs.txt
+	- info on the ufs filesystem.
+vfat.txt
+	- info on using the VFAT filesystem used in Windows NT and Windows 95
+vfs.txt
+	- overview of the Virtual File System
+xfs.txt
+	- info and mount options for the XFS filesystem.
+xip.txt
+	- info on execute-in-place for file mappings.
--- a/Documentation/filesystems/9p.txt
+++ b/Documentation/filesystems/9p.txt
@@ -0,0 +1,118 @@
+	  	    v9fs: Plan 9 Resource Sharing for Linux
+		    =======================================
+
+ABOUT
+=====
+
+v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
+
+This software was originally developed by Ron Minnich <rminnich@lanl.gov>
+and Maya Gokhale <maya@lanl.gov>.  Additional development by Greg Watson
+<gwatson@lanl.gov> and most recently Eric Van Hensbergen
+<ericvh@gmail.com>, Latchesar Ionkov <lucho@ionkov.net> and Russ Cox
+<rsc@swtch.com>.
+
+USAGE
+=====
+
+For remote file server:
+
+	mount -t 9p 10.10.1.2 /mnt/9
+
+For Plan 9 From User Space applications (http://swtch.com/plan9)
+
+	mount -t 9p `namespace`/acme /mnt/9 -o proto=unix,uname=$USER
+
+OPTIONS
+=======
+
+  proto=name	select an alternative transport.  Valid options are
+  		currently:
+ 			unix - specifying a named pipe mount point
+ 			tcp  - specifying a normal TCP/IP connection
+ 			fd   - used passed file descriptors for connection
+                                (see rfdno and wfdno)
+
+  uname=name	user name to attempt mount as on the remote server.  The
+  		server may override or ignore this value.  Certain user
+		names may require authentication.
+
+  aname=name	aname specifies the file tree to access when the server is
+  		offering several exported file systems.
+
+  cache=mode	specifies a cacheing policy.  By default, no caches are used.
+			loose = no attempts are made at consistency,
+                                intended for exclusive, read-only mounts
+
+  debug=n	specifies debug level.  The debug level is a bitmask.
+  			0x01 = display verbose error messages
+			0x02 = developer debug (DEBUG_CURRENT)
+			0x04 = display 9p trace
+			0x08 = display VFS trace
+			0x10 = display Marshalling debug
+			0x20 = display RPC debug
+			0x40 = display transport debug
+			0x80 = display allocation debug
+
+  rfdno=n	the file descriptor for reading with proto=fd
+
+  wfdno=n	the file descriptor for writing with proto=fd
+
+  maxdata=n	the number of bytes to use for 9p packet payload (msize)
+
+  port=n	port to connect to on the remote server
+
+  noextend	force legacy mode (no 9p2000.u semantics)
+
+  uid		attempt to mount as a particular uid
+
+  gid		attempt to mount with a particular gid
+
+  afid		security channel - used by Plan 9 authentication protocols
+
+  nodevmap	do not map special files - represent them as normal files.
+  		This can be used to share devices/named pipes/sockets between
+		hosts.  This functionality will be expanded in later versions.
+
+RESOURCES
+=========
+
+Our current recommendation is to use Inferno (http://www.vitanuova.com/inferno)
+as the 9p server.  You can start a 9p server under Inferno by issuing the
+following command:
+   ; styxlisten -A tcp!*!564 export '#U*'
+
+The -A specifies an unauthenticated export.  The 564 is the port # (you may
+have to choose a higher port number if running as a normal user).  The '#U*'
+specifies exporting the root of the Linux name space.  You may specify a
+subset of the namespace by extending the path: '#U*'/tmp would just export
+/tmp.  For more information, see the Inferno manual pages covering styxlisten
+and export.
+
+A Linux version of the 9p server is now maintained under the npfs project
+on sourceforge (http://sourceforge.net/projects/npfs).  There is also a
+more stable single-threaded version of the server (named spfs) available from
+the same CVS repository.
+
+There are user and developer mailing lists available through the v9fs project
+on sourceforge (http://sourceforge.net/projects/v9fs).
+
+News and other information is maintained on SWiK (http://swik.net/v9fs).
+
+Bug reports may be issued through the kernel.org bugzilla 
+(http://bugzilla.kernel.org)
+
+For more information on the Plan 9 Operating System check out
+http://plan9.bell-labs.com/plan9
+
+For information on Plan 9 from User Space (Plan 9 applications and libraries
+ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
+
+
+STATUS
+======
+
+The 2.6 kernel support is working on PPC and x86.
+
+PLEASE USE THE KERNEL BUGZILLA TO REPORT PROBLEMS. (http://bugzilla.kernel.org)
+
--- a/Documentation/filesystems/Exporting
+++ b/Documentation/filesystems/Exporting
@@ -0,0 +1,176 @@
+
+Making Filesystems Exportable
+=============================
+
+Most filesystem operations require a dentry (or two) as a starting
+point.  Local applications have a reference-counted hold on suitable
+dentrys via open file descriptors or cwd/root.  However remote
+applications that access a filesystem via a remote filesystem protocol
+such as NFS may not be able to hold such a reference, and so need a
+different way to refer to a particular dentry.  As the alternative
+form of reference needs to be stable across renames, truncates, and
+server-reboot (among other things, though these tend to be the most
+problematic), there is no simple answer like 'filename'.
+
+The mechanism discussed here allows each filesystem implementation to
+specify how to generate an opaque (out side of the filesystem) byte
+string for any dentry, and how to find an appropriate dentry for any
+given opaque byte string.
+This byte string will be called a "filehandle fragment" as it
+corresponds to part of an NFS filehandle.
+
+A filesystem which supports the mapping between filehandle fragments
+and dentrys will be termed "exportable".
+
+
+
+Dcache Issues
+-------------
+
+The dcache normally contains a proper prefix of any given filesystem
+tree.  This means that if any filesystem object is in the dcache, then
+all of the ancestors of that filesystem object are also in the dcache.
+As normal access is by filename this prefix is created naturally and
+maintained easily (by each object maintaining a reference count on
+its parent).
+
+However when objects are included into the dcache by interpreting a
+filehandle fragment, there is no automatic creation of a path prefix
+for the object.  This leads to two related but distinct features of
+the dcache that are not needed for normal filesystem access.
+
+1/ The dcache must sometimes contain objects that are not part of the
+   proper prefix. i.e that are not connected to the root.
+2/ The dcache must be prepared for a newly found (via ->lookup) directory
+   to already have a (non-connected) dentry, and must be able to move
+   that dentry into place (based on the parent and name in the
+   ->lookup).   This is particularly needed for directories as
+   it is a dcache invariant that directories only have one dentry.
+
+To implement these features, the dcache has:
+
+a/ A dentry flag DCACHE_DISCONNECTED which is set on
+   any dentry that might not be part of the proper prefix.
+   This is set when anonymous dentries are created, and cleared when a
+   dentry is noticed to be a child of a dentry which is in the proper
+   prefix. 
+
+b/ A per-superblock list "s_anon" of dentries which are the roots of
+   subtrees that are not in the proper prefix.  These dentries, as
+   well as the proper prefix, need to be released at unmount time.  As
+   these dentries will not be hashed, they are linked together on the
+   d_hash list_head.
+
+c/ Helper routines to allocate anonymous dentries, and to help attach
+   loose directory dentries at lookup time. They are:
+    d_alloc_anon(inode) will return a dentry for the given inode.
+      If the inode already has a dentry, one of those is returned.
+      If it doesn't, a new anonymous (IS_ROOT and
+        DCACHE_DISCONNECTED) dentry is allocated and attached.
+      In the case of a directory, care is taken that only one dentry
+      can ever be attached.
+    d_splice_alias(inode, dentry) will make sure that there is a
+      dentry with the same name and parent as the given dentry, and
+      which refers to the given inode.
+      If the inode is a directory and already has a dentry, then that
+      dentry is d_moved over the given dentry.
+      If the passed dentry gets attached, care is taken that this is
+      mutually exclusive to a d_alloc_anon operation.
+      If the passed dentry is used, NULL is returned, else the used
+      dentry is returned.  This corresponds to the calling pattern of
+      ->lookup.
+  
+ 
+Filesystem Issues
+-----------------
+
+For a filesystem to be exportable it must:
+ 
+   1/ provide the filehandle fragment routines described below.
+   2/ make sure that d_splice_alias is used rather than d_add
+      when ->lookup finds an inode for a given parent and name.
+      Typically the ->lookup routine will end:
+		if (inode)
+			return d_splice(inode, dentry);
+		d_add(dentry, inode);
+		return NULL;
+	}
+
+
+
+  A file system implementation declares that instances of the filesystem
+are exportable by setting the s_export_op field in the struct
+super_block.  This field must point to a "struct export_operations"
+struct which could potentially be full of NULLs, though normally at
+least get_parent will be set.
+
+ The primary operations are decode_fh and encode_fh.  
+decode_fh takes a filehandle fragment and tries to find or create a
+dentry for the object referred to by the filehandle.
+encode_fh takes a dentry and creates a filehandle fragment which can
+later be used to find/create a dentry for the same object.
+
+decode_fh will probably make use of "find_exported_dentry".
+This function lives in the "exportfs" module which a filesystem does
+not need unless it is being exported.  So rather that calling
+find_exported_dentry directly, each filesystem should call it through
+the find_exported_dentry pointer in it's export_operations table.
+This field is set correctly by the exporting agent (e.g. nfsd) when a
+filesystem is exported, and before any export operations are called.
+
+find_exported_dentry needs three support functions from the
+filesystem:
+  get_name.  When given a parent dentry and a child dentry, this
+    should find a name in the directory identified by the parent
+    dentry, which leads to the object identified by the child dentry.
+    If no get_name function is supplied, a default implementation is
+    provided which uses vfs_readdir to find potential names, and
+    matches inode numbers to find the correct match.
+
+  get_parent.  When given a dentry for a directory, this should return 
+    a dentry for the parent.  Quite possibly the parent dentry will
+    have been allocated by d_alloc_anon.  
+    The default get_parent function just returns an error so any
+    filehandle lookup that requires finding a parent will fail.
+    ->lookup("..") is *not* used as a default as it can leave ".."
+    entries in the dcache which are too messy to work with.
+
+  get_dentry.  When given an opaque datum, this should find the
+    implied object and create a dentry for it (possibly with
+    d_alloc_anon). 
+    The opaque datum is whatever is passed down by the decode_fh
+    function, and is often simply a fragment of the filehandle
+    fragment.
+    decode_fh passes two datums through find_exported_dentry.  One that 
+    should be used to identify the target object, and one that can be
+    used to identify the object's parent, should that be necessary.
+    The default get_dentry function assumes that the datum contains an
+    inode number and a generation number, and it attempts to get the
+    inode using "iget" and check it's validity by matching the
+    generation number.  A filesystem should only depend on the default
+    if iget can safely be used this way.
+
+If decode_fh and/or encode_fh are left as NULL, then default
+implementations are used.  These defaults are suitable for ext2 and 
+extremely similar filesystems (like ext3).
+
+The default encode_fh creates a filehandle fragment from the inode
+number and generation number of the target together with the inode
+number and generation number of the parent (if the parent is
+required).
+
+The default decode_fh extract the target and parent datums from the
+filehandle assuming the format used by the default encode_fh and
+passed them to find_exported_dentry.
+
+
+A filehandle fragment consists of an array of 1 or more 4byte words,
+together with a one byte "type".
+The decode_fh routine should not depend on the stated size that is
+passed to it.  This size may be larger than the original filehandle
+generated by encode_fh, in which case it will have been padded with
+nuls.  Rather, the encode_fh routine should choose a "type" which
+indicates the decode_fh how much of the filehandle is valid, and how
+it should be interpreted.
+
+ 
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -0,0 +1,527 @@
+	The text below describes the locking rules for VFS-related methods.
+It is (believed to be) up-to-date. *Please*, if you change anything in
+prototypes or locking protocols - update this file. And update the relevant
+instances in the tree, don't leave that to maintainers of filesystems/devices/
+etc. At the very least, put the list of dubious cases in the end of this file.
+Don't turn it into log - maintainers of out-of-the-tree code are supposed to
+be able to use diff(1).
+	Thing currently missing here: socket operations. Alexey?
+
+--------------------------- dentry_operations --------------------------
+prototypes:
+	int (*d_revalidate)(struct dentry *, int);
+	int (*d_hash) (struct dentry *, struct qstr *);
+	int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
+	int (*d_delete)(struct dentry *);
+	void (*d_release)(struct dentry *);
+	void (*d_iput)(struct dentry *, struct inode *);
+
+locking rules:
+	none have BKL
+		dcache_lock	rename_lock	->d_lock	may block
+d_revalidate:	no		no		no		yes
+d_hash		no		no		no		yes
+d_compare:	no		yes		no		no 
+d_delete:	yes		no		yes		no
+d_release:	no		no		no		yes
+d_iput:		no		no		no		yes
+
+--------------------------- inode_operations --------------------------- 
+prototypes:
+	int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
+	struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid
+ata *);
+	int (*link) (struct dentry *,struct inode *,struct dentry *);
+	int (*unlink) (struct inode *,struct dentry *);
+	int (*symlink) (struct inode *,struct dentry *,const char *);
+	int (*mkdir) (struct inode *,struct dentry *,int);
+	int (*rmdir) (struct inode *,struct dentry *);
+	int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+	int (*rename) (struct inode *, struct dentry *,
+			struct inode *, struct dentry *);
+	int (*readlink) (struct dentry *, char __user *,int);
+	int (*follow_link) (struct dentry *, struct nameidata *);
+	void (*truncate) (struct inode *);
+	int (*permission) (struct inode *, int, struct nameidata *);
+	int (*setattr) (struct dentry *, struct iattr *);
+	int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *);
+	int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
+	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
+	ssize_t (*listxattr) (struct dentry *, char *, size_t);
+	int (*removexattr) (struct dentry *, const char *);
+
+locking rules:
+	all may block, none have BKL
+		i_sem(inode)
+lookup:		yes
+create:		yes
+link:		yes (both)
+mknod:		yes
+symlink:	yes
+mkdir:		yes
+unlink:		yes (both)
+rmdir:		yes (both)	(see below)
+rename:		yes (all)	(see below)
+readlink:	no
+follow_link:	no
+truncate:	yes		(see below)
+setattr:	yes
+permission:	no
+getattr:	no
+setxattr:	yes
+getxattr:	no
+listxattr:	no
+removexattr:	yes
+	Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on
+victim.
+	cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
+	->truncate() is never called directly - it's a callback, not a
+method. It's called by vmtruncate() - library function normally used by
+->setattr(). Locking information above applies to that call (i.e. is
+inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
+passed).
+
+See Documentation/filesystems/directory-locking for more detailed discussion
+of the locking scheme for directory operations.
+
+--------------------------- super_operations ---------------------------
+prototypes:
+	struct inode *(*alloc_inode)(struct super_block *sb);
+	void (*destroy_inode)(struct inode *);
+	void (*read_inode) (struct inode *);
+	void (*dirty_inode) (struct inode *);
+	int (*write_inode) (struct inode *, int);
+	void (*put_inode) (struct inode *);
+	void (*drop_inode) (struct inode *);
+	void (*delete_inode) (struct inode *);
+	void (*put_super) (struct super_block *);
+	void (*write_super) (struct super_block *);
+	int (*sync_fs)(struct super_block *sb, int wait);
+	void (*write_super_lockfs) (struct super_block *);
+	void (*unlockfs) (struct super_block *);
+	int (*statfs) (struct dentry *, struct kstatfs *);
+	int (*remount_fs) (struct super_block *, int *, char *);
+	void (*clear_inode) (struct inode *);
+	void (*umount_begin) (struct super_block *);
+	int (*show_options)(struct seq_file *, struct vfsmount *);
+	ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
+	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
+
+locking rules:
+	All may block.
+			BKL	s_lock	s_umount
+alloc_inode:		no	no	no
+destroy_inode:		no
+read_inode:		no				(see below)
+dirty_inode:		no				(must not sleep)
+write_inode:		no
+put_inode:		no
+drop_inode:		no				!!!inode_lock!!!
+delete_inode:		no
+put_super:		yes	yes	no
+write_super:		no	yes	read
+sync_fs:		no	no	read
+write_super_lockfs:	?
+unlockfs:		?
+statfs:			no	no	no
+remount_fs:		yes	yes	maybe		(see below)
+clear_inode:		no
+umount_begin:		yes	no	no
+show_options:		no				(vfsmount->sem)
+quota_read:		no	no	no		(see below)
+quota_write:		no	no	no		(see below)
+
+->read_inode() is not a method - it's a callback used in iget().
+->remount_fs() will have the s_umount lock if it's already mounted.
+When called from get_sb_single, it does NOT have the s_umount lock.
+->quota_read() and ->quota_write() functions are both guaranteed to
+be the only ones operating on the quota file by the quota code (via
+dqio_sem) (unless an admin really wants to screw up something and
+writes to quota files with quotas on). For other details about locking
+see also dquot_operations section.
+
+--------------------------- file_system_type ---------------------------
+prototypes:
+	int (*get_sb) (struct file_system_type *, int,
+		       const char *, void *, struct vfsmount *);
+	void (*kill_sb) (struct super_block *);
+locking rules:
+		may block	BKL
+get_sb		yes		yes
+kill_sb		yes		yes
+
+->get_sb() returns error or 0 with locked superblock attached to the vfsmount
+(exclusive on ->s_umount).
+->kill_sb() takes a write-locked superblock, does all shutdown work on it,
+unlocks and drops the reference.
+
+--------------------------- address_space_operations --------------------------
+prototypes:
+	int (*writepage)(struct page *page, struct writeback_control *wbc);
+	int (*readpage)(struct file *, struct page *);
+	int (*sync_page)(struct page *);
+	int (*writepages)(struct address_space *, struct writeback_control *);
+	int (*set_page_dirty)(struct page *page);
+	int (*readpages)(struct file *filp, struct address_space *mapping,
+			struct list_head *pages, unsigned nr_pages);
+	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
+	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+	sector_t (*bmap)(struct address_space *, sector_t);
+	int (*invalidatepage) (struct page *, unsigned long);
+	int (*releasepage) (struct page *, int);
+	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
+			loff_t offset, unsigned long nr_segs);
+	int (*launder_page) (struct page *);
+
+locking rules:
+	All except set_page_dirty may block
+
+			BKL	PageLocked(page)
+writepage:		no	yes, unlocks (see below)
+readpage:		no	yes, unlocks
+sync_page:		no	maybe
+writepages:		no
+set_page_dirty		no	no
+readpages:		no
+prepare_write:		no	yes
+commit_write:		no	yes
+bmap:			yes
+invalidatepage:		no	yes
+releasepage:		no	yes
+direct_IO:		no
+launder_page:		no	yes
+
+	->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
+may be called from the request handler (/dev/loop).
+
+	->readpage() unlocks the page, either synchronously or via I/O
+completion.
+
+	->readpages() populates the pagecache with the passed pages and starts
+I/O against them.  They come unlocked upon I/O completion.
+
+	->writepage() is used for two purposes: for "memory cleansing" and for
+"sync".  These are quite different operations and the behaviour may differ
+depending upon the mode.
+
+If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
+it *must* start I/O against the page, even if that would involve
+blocking on in-progress I/O.
+
+If writepage is called for memory cleansing (sync_mode ==
+WBC_SYNC_NONE) then its role is to get as much writeout underway as
+possible.  So writepage should try to avoid blocking against
+currently-in-progress I/O.
+
+If the filesystem is not called for "sync" and it determines that it
+would need to block against in-progress I/O to be able to start new I/O
+against the page the filesystem should redirty the page with
+redirty_page_for_writepage(), then unlock the page and return zero.
+This may also be done to avoid internal deadlocks, but rarely.
+
+If the filesytem is called for sync then it must wait on any
+in-progress I/O and then start new I/O.
+
+The filesystem should unlock the page synchronously, before returning to the
+caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
+value. WRITEPAGE_ACTIVATE means that page cannot really be written out
+currently, and VM should stop calling ->writepage() on this page for some
+time. VM does this by moving page to the head of the active list, hence the
+name.
+
+Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
+and return zero, writepage *must* run set_page_writeback() against the page,
+followed by unlocking it.  Once set_page_writeback() has been run against the
+page, write I/O can be submitted and the write I/O completion handler must run
+end_page_writeback() once the I/O is complete.  If no I/O is submitted, the
+filesystem must run end_page_writeback() against the page before returning from
+writepage.
+
+That is: after 2.5.12, pages which are under writeout are *not* locked.  Note,
+if the filesystem needs the page to be locked during writeout, that is ok, too,
+the page is allowed to be unlocked at any point in time between the calls to
+set_page_writeback() and end_page_writeback().
+
+Note, failure to run either redirty_page_for_writepage() or the combination of
+set_page_writeback()/end_page_writeback() on a page submitted to writepage
+will leave the page itself marked clean but it will be tagged as dirty in the
+radix tree.  This incoherency can lead to all sorts of hard-to-debug problems
+in the filesystem like having dirty inodes at umount and losing written data.
+
+	->sync_page() locking rules are not well-defined - usually it is called
+with lock on page, but that is not guaranteed. Considering the currently
+existing instances of this method ->sync_page() itself doesn't look
+well-defined...
+
+	->writepages() is used for periodic writeback and for syscall-initiated
+sync operations.  The address_space should start I/O against at least
+*nr_to_write pages.  *nr_to_write must be decremented for each page which is
+written.  The address_space implementation may write more (or less) pages
+than *nr_to_write asks for, but it should try to be reasonably close.  If
+nr_to_write is NULL, all dirty pages must be written.
+
+writepages should _only_ write pages which are present on
+mapping->io_pages.
+
+	->set_page_dirty() is called from various places in the kernel
+when the target page is marked as needing writeback.  It may be called
+under spinlock (it cannot block) and is sometimes called with the page
+not locked.
+
+	->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
+filesystems and by the swapper. The latter will eventually go away. All
+instances do not actually need the BKL. Please, keep it that way and don't
+breed new callers.
+
+	->invalidatepage() is called when the filesystem must attempt to drop
+some or all of the buffers from the page when it is being truncated.  It
+returns zero on success.  If ->invalidatepage is zero, the kernel uses
+block_invalidatepage() instead.
+
+	->releasepage() is called when the kernel is about to try to drop the
+buffers from the page in preparation for freeing it.  It returns zero to
+indicate that the buffers are (or may be) freeable.  If ->releasepage is zero,
+the kernel assumes that the fs has no private interest in the buffers.
+
+	->launder_page() may be called prior to releasing a page if
+it is still found to be dirty. It returns zero if the page was successfully
+cleaned, or an error value if not. Note that in order to prevent the page
+getting mapped back in and redirtied, it needs to be kept locked
+across the entire operation.
+
+	Note: currently almost all instances of address_space methods are
+using BKL for internal serialization and that's one of the worst sources
+of contention. Normally they are calling library functions (in fs/buffer.c)
+and pass foo_get_block() as a callback (on local block-based filesystems,
+indeed). BKL is not needed for library stuff and is usually taken by
+foo_get_block(). It's an overkill, since block bitmaps can be protected by
+internal fs locking and real critical areas are much smaller than the areas
+filesystems protect now.
+
+----------------------- file_lock_operations ------------------------------
+prototypes:
+	void (*fl_insert)(struct file_lock *);	/* lock insertion callback */
+	void (*fl_remove)(struct file_lock *);	/* lock removal callback */
+	void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
+	void (*fl_release_private)(struct file_lock *);
+
+
+locking rules:
+			BKL	may block
+fl_insert:		yes	no
+fl_remove:		yes	no
+fl_copy_lock:		yes	no
+fl_release_private:	yes	yes
+
+----------------------- lock_manager_operations ---------------------------
+prototypes:
+	int (*fl_compare_owner)(struct file_lock *, struct file_lock *);
+	void (*fl_notify)(struct file_lock *);  /* unblock callback */
+	void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
+	void (*fl_release_private)(struct file_lock *);
+	void (*fl_break)(struct file_lock *); /* break_lease callback */
+
+locking rules:
+			BKL	may block
+fl_compare_owner:	yes	no
+fl_notify:		yes	no
+fl_copy_lock:		yes	no
+fl_release_private:	yes	yes
+fl_break:		yes	no
+
+	Currently only NFSD and NLM provide instances of this class. None of the
+them block. If you have out-of-tree instances - please, show up. Locking
+in that area will change.
+--------------------------- buffer_head -----------------------------------
+prototypes:
+	void (*b_end_io)(struct buffer_head *bh, int uptodate);
+
+locking rules:
+	called from interrupts. In other words, extreme care is needed here.
+bh is locked, but that's all warranties we have here. Currently only RAID1,
+highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
+call this method upon the IO completion.
+
+--------------------------- block_device_operations -----------------------
+prototypes:
+	int (*open) (struct inode *, struct file *);
+	int (*release) (struct inode *, struct file *);
+	int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
+	int (*media_changed) (struct gendisk *);
+	int (*revalidate_disk) (struct gendisk *);
+
+locking rules:
+			BKL	bd_sem
+open:			yes	yes
+release:		yes	yes
+ioctl:			yes	no
+media_changed:		no	no
+revalidate_disk:	no	no
+
+The last two are called only from check_disk_change().
+
+--------------------------- file_operations -------------------------------
+prototypes:
+	loff_t (*llseek) (struct file *, loff_t, int);
+	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
+	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
+	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+	int (*readdir) (struct file *, void *, filldir_t);
+	unsigned int (*poll) (struct file *, struct poll_table_struct *);
+	int (*ioctl) (struct inode *, struct file *, unsigned int,
+			unsigned long);
+	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
+	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+	int (*mmap) (struct file *, struct vm_area_struct *);
+	int (*open) (struct inode *, struct file *);
+	int (*flush) (struct file *);
+	int (*release) (struct inode *, struct file *);
+	int (*fsync) (struct file *, struct dentry *, int datasync);
+	int (*aio_fsync) (struct kiocb *, int datasync);
+	int (*fasync) (int, struct file *, int);
+	int (*lock) (struct file *, int, struct file_lock *);
+	ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
+			loff_t *);
+	ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
+			loff_t *);
+	ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
+			void __user *);
+	ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
+			loff_t *, int);
+	unsigned long (*get_unmapped_area)(struct file *, unsigned long,
+			unsigned long, unsigned long, unsigned long);
+	int (*check_flags)(int);
+	int (*dir_notify)(struct file *, unsigned long);
+};
+
+locking rules:
+	All except ->poll() may block.
+			BKL
+llseek:			no	(see below)
+read:			no
+aio_read:		no
+write:			no
+aio_write:		no
+readdir: 		no
+poll:			no
+ioctl:			yes	(see below)
+unlocked_ioctl:		no	(see below)
+compat_ioctl:		no
+mmap:			no
+open:			maybe	(see below)
+flush:			no
+release:		no
+fsync:			no	(see below)
+aio_fsync:		no
+fasync:			yes	(see below)
+lock:			yes
+readv:			no
+writev:			no
+sendfile:		no
+sendpage:		no
+get_unmapped_area:	no
+check_flags:		no
+dir_notify:		no
+
+->llseek() locking has moved from llseek to the individual llseek
+implementations.  If your fs is not using generic_file_llseek, you
+need to acquire and release the appropriate locks in your ->llseek().
+For many filesystems, it is probably safe to acquire the inode
+semaphore.  Note some filesystems (i.e. remote ones) provide no
+protection for i_size so you will need to use the BKL.
+
+->open() locking is in-transit: big lock partially moved into the methods.
+The only exception is ->open() in the instances of file_operations that never
+end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices
+(chrdev_open() takes lock before replacing ->f_op and calling the secondary
+method. As soon as we fix the handling of module reference counters all
+instances of ->open() will be called without the BKL.
+
+Note: ext2_release() was *the* source of contention on fs-intensive
+loads and dropping BKL on ->release() helps to get rid of that (we still
+grab BKL for cases when we close a file that had been opened r/w, but that
+can and should be done using the internal locking with smaller critical areas).
+Current worst offender is ext2_get_block()...
+
+->fasync() is a mess. This area needs a big cleanup and that will probably
+affect locking.
+
+->readdir() and ->ioctl() on directories must be changed. Ideally we would
+move ->readdir() to inode_operations and use a separate method for directory
+->ioctl() or kill the latter completely. One of the problems is that for
+anything that resembles union-mount we won't have a struct file for all
+components. And there are other reasons why the current interface is a mess...
+
+->ioctl() on regular files is superceded by the ->unlocked_ioctl() that
+doesn't take the BKL.
+
+->read on directories probably must go away - we should just enforce -EISDIR
+in sys_read() and friends.
+
+->fsync() has i_sem on inode.
+
+--------------------------- dquot_operations -------------------------------
+prototypes:
+	int (*initialize) (struct inode *, int);
+	int (*drop) (struct inode *);
+	int (*alloc_space) (struct inode *, qsize_t, int);
+	int (*alloc_inode) (const struct inode *, unsigned long);
+	int (*free_space) (struct inode *, qsize_t);
+	int (*free_inode) (const struct inode *, unsigned long);
+	int (*transfer) (struct inode *, struct iattr *);
+	int (*write_dquot) (struct dquot *);
+	int (*acquire_dquot) (struct dquot *);
+	int (*release_dquot) (struct dquot *);
+	int (*mark_dirty) (struct dquot *);
+	int (*write_info) (struct super_block *, int);
+
+These operations are intended to be more or less wrapping functions that ensure
+a proper locking wrt the filesystem and call the generic quota operations.
+
+What filesystem should expect from the generic quota functions:
+
+		FS recursion	Held locks when called
+initialize:	yes		maybe dqonoff_sem
+drop:		yes		-
+alloc_space:	->mark_dirty()	-
+alloc_inode:	->mark_dirty()	-
+free_space:	->mark_dirty()	-
+free_inode:	->mark_dirty()	-
+transfer:	yes		-
+write_dquot:	yes		dqonoff_sem or dqptr_sem
+acquire_dquot:	yes		dqonoff_sem or dqptr_sem
+release_dquot:	yes		dqonoff_sem or dqptr_sem
+mark_dirty:	no		-
+write_info:	yes		dqonoff_sem
+
+FS recursion means calling ->quota_read() and ->quota_write() from superblock
+operations.
+
+->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called
+only directly by the filesystem and do not call any fs functions only
+the ->mark_dirty() operation.
+
+More details about quota locking can be found in fs/dquot.c.
+
+--------------------------- vm_operations_struct -----------------------------
+prototypes:
+	void (*open)(struct vm_area_struct*);
+	void (*close)(struct vm_area_struct*);
+	struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *);
+
+locking rules:
+		BKL	mmap_sem
+open:		no	yes
+close:		no	yes
+nopage:		no	yes
+
+================================================================================
+			Dubious stuff
+
+(if you break something or notice that it is broken and do not fix it yourself
+- at least put it here)
+
+ipc/shm.c::shm_delete() - may need BKL.
+->read() and ->write() in many drivers are (probably) missing BKL.
+drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL.
--- a/Documentation/filesystems/adfs.txt
+++ b/Documentation/filesystems/adfs.txt
@@ -0,0 +1,57 @@
+Mount options for ADFS
+----------------------
+
+  uid=nnn	All files in the partition will be owned by
+		user id nnn.  Default 0 (root).
+  gid=nnn	All files in the partition will be in group
+		nnn.  Default 0 (root).
+  ownmask=nnn	The permission mask for ADFS 'owner' permissions
+		will be nnn.  Default 0700.
+  othmask=nnn	The permission mask for ADFS 'other' permissions
+		will be nnn.  Default 0077.
+
+Mapping of ADFS permissions to Linux permissions
+------------------------------------------------
+
+  ADFS permissions consist of the following:
+
+	Owner read
+	Owner write
+	Other read
+	Other write
+
+  (In older versions, an 'execute' permission did exist, but this
+   does not hold the same meaning as the Linux 'execute' permission
+   and is now obsolete).
+
+  The mapping is performed as follows:
+
+	Owner read				-> -r--r--r--
+	Owner write				-> --w--w---w
+	Owner read and filetype UnixExec	-> ---x--x--x
+    These are then masked by ownmask, eg 700	-> -rwx------
+	Possible owner mode permissions		-> -rwx------
+
+	Other read				-> -r--r--r--
+	Other write				-> --w--w--w-
+	Other read and filetype UnixExec	-> ---x--x--x
+    These are then masked by othmask, eg 077	-> ----rwxrwx
+	Possible other mode permissions		-> ----rwxrwx
+
+  Hence, with the default masks, if a file is owner read/write, and
+  not a UnixExec filetype, then the permissions will be:
+
+			-rw-------
+
+  However, if the masks were ownmask=0770,othmask=0007, then this would
+  be modified to:
+			-rw-rw----
+
+  There is no restriction on what you can do with these masks.  You may
+  wish that either read bits give read access to the file for all, but
+  keep the default write protection (ownmask=0755,othmask=0577):
+
+			-rw-r--r--
+
+  You can therefore tailor the permission translation to whatever you
+  desire the permissions should be under Linux.
--- a/Documentation/filesystems/affs.txt
+++ b/Documentation/filesystems/affs.txt
@@ -0,0 +1,219 @@
+Overview of Amiga Filesystems
+=============================
+
+Not all varieties of the Amiga filesystems are supported for reading and
+writing. The Amiga currently knows six different filesystems:
+
+DOS\0		The old or original filesystem, not really suited for
+		hard disks and normally not used on them, either.
+		Supported read/write.
+
+DOS\1		The original Fast File System. Supported read/write.
+
+DOS\2		The old "international" filesystem. International means that
+		a bug has been fixed so that accented ("international") letters
+		in file names are case-insensitive, as they ought to be.
+		Supported read/write.
+
+DOS\3		The "international" Fast File System.  Supported read/write.
+
+DOS\4		The original filesystem with directory cache. The directory
+		cache speeds up directory accesses on floppies considerably,
+		but slows down file creation/deletion. Doesn't make much
+		sense on hard disks. Supported read only.
+
+DOS\5		The Fast File System with directory cache. Supported read only.
+
+All of the above filesystems allow block sizes from 512 to 32K bytes.
+Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
+speed up almost everything at the expense of wasted disk space. The speed
+gain above 4K seems not really worth the price, so you don't lose too
+much here, either.
+
+The muFS (multi user File System) equivalents of the above file systems
+are supported, too.
+
+Mount options for the AFFS
+==========================
+
+protect		If this option is set, the protection bits cannot be altered.
+
+setuid[=uid]	This sets the owner of all files and directories in the file
+		system to uid or the uid of the current user, respectively.
+
+setgid[=gid]	Same as above, but for gid.
+
+mode=mode	Sets the mode flags to the given (octal) value, regardless
+		of the original permissions. Directories will get an x
+		permission if the corresponding r bit is set.
+		This is useful since most of the plain AmigaOS files
+		will map to 600.
+
+reserved=num	Sets the number of reserved blocks at the start of the
+		partition to num. You should never need this option.
+		Default is 2.
+
+root=block	Sets the block number of the root block. This should never
+		be necessary.
+
+bs=blksize	Sets the blocksize to blksize. Valid block sizes are 512,
+		1024, 2048 and 4096. Like the root option, this should
+		never be necessary, as the affs can figure it out itself.
+
+quiet		The file system will not return an error for disallowed
+		mode changes.
+
+verbose		The volume name, file system type and block size will
+		be written to the syslog when the filesystem is mounted.
+
+mufs		The filesystem is really a muFS, also it doesn't
+		identify itself as one. This option is necessary if
+		the filesystem wasn't formatted as muFS, but is used
+		as one.
+
+prefix=path	Path will be prefixed to every absolute path name of
+		symbolic links on an AFFS partition. Default = "/".
+		(See below.)
+
+volume=name	When symbolic links with an absolute path are created
+		on an AFFS partition, name will be prepended as the
+		volume name. Default = "" (empty string).
+		(See below.)
+
+Handling of the Users/Groups and protection flags
+=================================================
+
+Amiga -> Linux:
+
+The Amiga protection flags RWEDRWEDHSPARWED are handled as follows:
+
+  - R maps to r for user, group and others. On directories, R implies x.
+
+  - If both W and D are allowed, w will be set.
+
+  - E maps to x.
+
+  - H and P are always retained and ignored under Linux.
+
+  - A is always reset when a file is written to.
+
+User id and group id will be used unless set[gu]id are given as mount
+options. Since most of the Amiga file systems are single user systems
+they will be owned by root. The root directory (the mount point) of the
+Amiga filesystem will be owned by the user who actually mounts the
+filesystem (the root directory doesn't have uid/gid fields).
+
+Linux -> Amiga:
+
+The Linux rwxrwxrwx file mode is handled as follows:
+
+  - r permission will set R for user, group and others.
+
+  - w permission will set W and D for user, group and others.
+
+  - x permission of the user will set E for plain files.
+
+  - All other flags (suid, sgid, ...) are ignored and will
+    not be retained.
+    
+Newly created files and directories will get the user and group ID
+of the current user and a mode according to the umask.
+
+Symbolic links
+==============
+
+Although the Amiga and Linux file systems resemble each other, there
+are some, not always subtle, differences. One of them becomes apparent
+with symbolic links. While Linux has a file system with exactly one
+root directory, the Amiga has a separate root directory for each
+file system (for example, partition, floppy disk, ...). With the Amiga,
+these entities are called "volumes". They have symbolic names which
+can be used to access them. Thus, symbolic links can point to a
+different volume. AFFS turns the volume name into a directory name
+and prepends the prefix path (see prefix option) to it.
+
+Example:
+You mount all your Amiga partitions under /amiga/<volume> (where
+<volume> is the name of the volume), and you give the option
+"prefix=/amiga/" when mounting all your AFFS partitions. (They
+might be "User", "WB" and "Graphics", the mount points /amiga/User,
+/amiga/WB and /amiga/Graphics). A symbolic link referring to
+"User:sc/include/dos/dos.h" will be followed to
+"/amiga/User/sc/include/dos/dos.h".
+
+Examples
+========
+
+Command line:
+    mount  Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
+    mount  /dev/sda3 /Amiga -t affs
+
+/etc/fstab entry:
+    /dev/sdb5	/amiga/Workbench    affs    noauto,user,exec,verbose 0 0
+
+IMPORTANT NOTE
+==============
+
+If you boot Windows 95 (don't know about 3.x, 98 and NT) while you
+have an Amiga harddisk connected to your PC, it will overwrite
+the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating
+the Rigid Disk Block. Sheer luck has it that this is an unused
+area of the RDB, so only the checksum doesn't match anymore.
+Linux will ignore this garbage and recognize the RDB anyway, but
+before you connect that drive to your Amiga again, you must
+restore or repair your RDB. So please do make a backup copy of it
+before booting Windows!
+
+If the damage is already done, the following should fix the RDB
+(where <disk> is the device name).
+DO AT YOUR OWN RISK:
+
+  dd if=/dev/<disk> of=rdb.tmp count=1
+  cp rdb.tmp rdb.fixed
+  dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4
+  dd if=rdb.fixed of=/dev/<disk>
+
+Bugs, Restrictions, Caveats
+===========================
+
+Quite a few things may not work as advertised. Not everything is
+tested, though several hundred MB have been read and written using
+this fs. For a most up-to-date list of bugs please consult
+fs/affs/Changes.
+
+Filenames are truncated to 30 characters without warning (this
+can be changed by setting the compile-time option AFFS_NO_TRUNCATE
+in include/linux/amigaffs.h).
+
+Case is ignored by the affs in filename matching, but Linux shells
+do care about the case. Example (with /wb being an affs mounted fs):
+    rm /wb/WRONGCASE
+will remove /mnt/wrongcase, but
+    rm /wb/WR*
+will not since the names are matched by the shell.
+
+The block allocation is designed for hard disk partitions. If more
+than 1 process writes to a (small) diskette, the blocks are allocated
+in an ugly way (but the real AFFS doesn't do much better). This
+is also true when space gets tight.
+
+You cannot execute programs on an OFS (Old File System), since the
+program files cannot be memory mapped due to the 488 byte blocks.
+For the same reason you cannot mount an image on such a filesystem
+via the loopback device.
+
+The bitmap valid flag in the root block may not be accurate when the
+system crashes while an affs partition is mounted. There's currently
+no way to fix a garbled filesystem without an Amiga (disk validator)
+or manually (who would do this?). Maybe later.
+
+If you mount affs partitions on system startup, you may want to tell
+fsck that the fs should not be checked (place a '0' in the sixth field
+of /etc/fstab).
+
+It's not possible to read floppy disks with a normal PC or workstation
+due to an incompatibility with the Amiga floppy controller.
+
+If you are interested in an Amiga Emulator for Linux, look at
+
+http://www.freiburg.linux.de/~uae/
--- a/Documentation/filesystems/afs.txt
+++ b/Documentation/filesystems/afs.txt
@@ -0,0 +1,155 @@
+			     kAFS: AFS FILESYSTEM
+			     ====================
+
+ABOUT
+=====
+
+This filesystem provides a fairly simple AFS filesystem driver. It is under
+development and only provides very basic facilities. It does not yet support
+the following AFS features:
+
+	(*) Write support.
+	(*) Communications security.
+	(*) Local caching.
+	(*) pioctl() system call.
+	(*) Automatic mounting of embedded mountpoints.
+
+
+USAGE
+=====
+
+When inserting the driver modules the root cell must be specified along with a
+list of volume location server IP addresses:
+
+	insmod rxrpc.o
+	insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
+
+The first module is a driver for the RxRPC remote operation protocol, and the
+second is the actual filesystem driver for the AFS filesystem.
+
+Once the module has been loaded, more modules can be added by the following
+procedure:
+
+	echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
+
+Where the parameters to the "add" command are the name of a cell and a list of
+volume location servers within that cell.
+
+Filesystems can be mounted anywhere by commands similar to the following:
+
+	mount -t afs "%cambridge.redhat.com:root.afs." /afs
+	mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
+	mount -t afs "#root.afs." /afs
+	mount -t afs "#root.cell." /afs/cambridge
+
+  NB: When using this on Linux 2.4, the mount command has to be different,
+      since the filesystem doesn't have access to the device name argument:
+
+	mount -t afs none /afs -ovol="#root.afs."
+
+Where the initial character is either a hash or a percent symbol depending on
+whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
+volume, but are willing to use a R/W volume instead (percent).
+
+The name of the volume can be suffixes with ".backup" or ".readonly" to
+specify connection to only volumes of those types.
+
+The name of the cell is optional, and if not given during a mount, then the
+named volume will be looked up in the cell specified during insmod.
+
+Additional cells can be added through /proc (see later section).
+
+
+MOUNTPOINTS
+===========
+
+AFS has a concept of mountpoints. These are specially formatted symbolic links
+(of the same form as the "device name" passed to mount). kAFS presents these
+to the user as directories that have special properties:
+
+  (*) They cannot be listed. Running a program like "ls" on them will incur an
+      EREMOTE error (Object is remote).
+
+  (*) Other objects can't be looked up inside of them. This also incurs an
+      EREMOTE error.
+
+  (*) They can be queried with the readlink() system call, which will return
+      the name of the mountpoint to which they point. The "readlink" program
+      will also work.
+
+  (*) They can be mounted on (which symbolic links can't).
+
+
+PROC FILESYSTEM
+===============
+
+The rxrpc module creates a number of files in various places in the /proc
+filesystem:
+
+  (*) Firstly, some information files are made available in a directory called
+      "/proc/net/rxrpc/". These list the extant transport endpoint, peer,
+      connection and call records.
+
+  (*) Secondly, some control files are made available in a directory called
+      "/proc/sys/rxrpc/". Currently, all these files can be used for is to
+      turn on various levels of tracing.
+
+The AFS modules creates a "/proc/fs/afs/" directory and populates it:
+
+  (*) A "cells" file that lists cells currently known to the afs module.
+
+  (*) A directory per cell that contains files that list volume location
+      servers, volumes, and active servers known within that cell.
+
+
+THE CELL DATABASE
+=================
+
+The filesystem maintains an internal database of all the cells it knows and
+the IP addresses of the volume location servers for those cells. The cell to
+which the computer belongs is added to the database when insmod is performed
+by the "rootcell=" argument.
+
+Further cells can be added by commands similar to the following:
+
+	echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
+	echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
+
+No other cell database operations are available at this time.
+
+
+EXAMPLES
+========
+
+Here's what I use to test this. Some of the names and IP addresses are local
+to my internal DNS. My "root.afs" partition has a mount point within it for
+some public volumes volumes.
+
+insmod -S /tmp/rxrpc.o 
+insmod -S /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
+
+mount -t afs \%root.afs. /afs
+mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/
+
+echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells 
+mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/
+mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive
+mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib
+mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc
+mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project
+mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service
+mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software
+mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user
+
+umount /afs/grand.central.org/user
+umount /afs/grand.central.org/software
+umount /afs/grand.central.org/service
+umount /afs/grand.central.org/project
+umount /afs/grand.central.org/doc
+umount /afs/grand.central.org/contrib
+umount /afs/grand.central.org/archive
+umount /afs/grand.central.org
+umount /afs/cambridge.redhat.com
+umount /afs
+rmmod kafs
+rmmod rxrpc
--- a/Documentation/filesystems/automount-support.txt
+++ b/Documentation/filesystems/automount-support.txt
@@ -0,0 +1,118 @@
+Support is available for filesystems that wish to do automounting support (such
+as kAFS which can be found in fs/afs/). This facility includes allowing
+in-kernel mounts to be performed and mountpoint degradation to be
+requested. The latter can also be requested by userspace.
+
+
+======================
+IN-KERNEL AUTOMOUNTING
+======================
+
+A filesystem can now mount another filesystem on one of its directories by the
+following procedure:
+
+ (1) Give the directory a follow_link() operation.
+
+     When the directory is accessed, the follow_link op will be called, and
+     it will be provided with the location of the mountpoint in the nameidata
+     structure (vfsmount and dentry).
+
+ (2) Have the follow_link() op do the following steps:
+
+     (a) Call vfs_kern_mount() to call the appropriate filesystem to set up a
+         superblock and gain a vfsmount structure representing it.
+
+     (b) Copy the nameidata provided as an argument and substitute the dentry
+	 argument into it the copy.
+
+     (c) Call do_add_mount() to install the new vfsmount into the namespace's
+	 mountpoint tree, thus making it accessible to userspace. Use the
+	 nameidata set up in (b) as the destination.
+
+	 If the mountpoint will be automatically expired, then do_add_mount()
+	 should also be given the location of an expiration list (see further
+	 down).
+
+     (d) Release the path in the nameidata argument and substitute in the new
+	 vfsmount and its root dentry. The ref counts on these will need
+	 incrementing.
+
+Then from userspace, you can just do something like:
+
+	[root@andromeda root]# mount -t afs \#root.afs. /afs
+	[root@andromeda root]# ls /afs
+	asd  cambridge  cambridge.redhat.com  grand.central.org
+	[root@andromeda root]# ls /afs/cambridge
+	afsdoc
+	[root@andromeda root]# ls /afs/cambridge/afsdoc/
+	ChangeLog  html  LICENSE  pdf  RELNOTES-1.2.2
+
+And then if you look in the mountpoint catalogue, you'll see something like:
+
+	[root@andromeda root]# cat /proc/mounts
+	...
+	#root.afs. /afs afs rw 0 0
+	#root.cell. /afs/cambridge.redhat.com afs rw 0 0
+	#afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
+
+
+===========================
+AUTOMATIC MOUNTPOINT EXPIRY
+===========================
+
+Automatic expiration of mountpoints is easy, provided you've mounted the
+mountpoint to be expired in the automounting procedure outlined above.
+
+To do expiration, you need to follow these steps:
+
+ (3) Create at least one list off which the vfsmounts to be expired can be
+     hung. Access to this list will be governed by the vfsmount_lock.
+
+ (4) In step (2c) above, the call to do_add_mount() should be provided with a
+     pointer to this list. It will hang the vfsmount off of it if it succeeds.
+
+ (5) When you want mountpoints to be expired, call mark_mounts_for_expiry()
+     with a pointer to this list. This will process the list, marking every
+     vfsmount thereon for potential expiry on the next call.
+
+     If a vfsmount was already flagged for expiry, and if its usage count is 1
+     (it's only referenced by its parent vfsmount), then it will be deleted
+     from the namespace and thrown away (effectively unmounted).
+
+     It may prove simplest to simply call this at regular intervals, using
+     some sort of timed event to drive it.
+
+The expiration flag is cleared by calls to mntput. This means that expiration
+will only happen on the second expiration request after the last time the
+mountpoint was accessed.
+
+If a mountpoint is moved, it gets removed from the expiration list. If a bind
+mount is made on an expirable mount, the new vfsmount will not be on the
+expiration list and will not expire.
+
+If a namespace is copied, all mountpoints contained therein will be copied,
+and the copies of those that are on an expiration list will be added to the
+same expiration list.
+
+
+=======================
+USERSPACE DRIVEN EXPIRY
+=======================
+
+As an alternative, it is possible for userspace to request expiry of any
+mountpoint (though some will be rejected - the current process's idea of the
+rootfs for example). It does this by passing the MNT_EXPIRE flag to
+umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH.
+
+If the mountpoint in question is in referenced by something other than
+umount() or its parent mountpoint, an EBUSY error will be returned and the
+mountpoint will not be marked for expiration or unmounted.
+
+If the mountpoint was not already marked for expiry at that time, an EAGAIN
+error will be given and it won't be unmounted.
+
+Otherwise if it was already marked and it wasn't referenced, unmounting will
+take place as usual.
+
+Again, the expiration flag is cleared every time anything other than umount()
+looks at a mountpoint.
--- a/Documentation/filesystems/befs.txt
+++ b/Documentation/filesystems/befs.txt
@@ -0,0 +1,117 @@
+BeOS filesystem for Linux
+
+Document last updated: Dec 6, 2001
+
+WARNING
+=======
+Make sure you understand that this is alpha software.  This means that the
+implementation is neither complete nor well-tested. 
+
+I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
+
+LICENSE
+=====
+This software is covered by the GNU General Public License. 
+See the file COPYING for the complete text of the license.
+Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
+
+AUTHOR
+=====
+The largest part of the code written by Will Dyson <will_dyson@pobox.com>
+He has been working on the code since Aug 13, 2001. See the changelog for
+details.
+
+Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
+His original code can still be found at:
+<http://hp.vector.co.jp/authors/VA008030/bfs/>
+Does anyone know of a more current email address for Makoto? He doesn't
+respond to the address given above...
+
+Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru>
+
+WHAT IS THIS DRIVER?
+==================
+This module implements the native filesystem of BeOS <http://www.be.com/>
+for the linux 2.4.1 and later kernels. Currently it is a read-only
+implementation.
+
+Which is it, BFS or BEFS?
+================
+Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". 
+But Unixware Boot Filesystem is called bfs, too. And they are already in
+the kernel. Because of this naming conflict, on Linux the BeOS
+filesystem is called befs.
+
+HOW TO INSTALL
+==============
+step 1.  Install the BeFS  patch into the source code tree of linux.
+
+Apply the patchfile to your kernel source tree.
+Assuming that your kernel source is in /foo/bar/linux and the patchfile
+is called patch-befs-xxx, you would do the following:
+
+	cd /foo/bar/linux
+	patch -p1 < /path/to/patch-befs-xxx
+
+if the patching step fails (i.e. there are rejected hunks), you can try to
+figure it out yourself (it shouldn't be hard), or mail the maintainer 
+(Will Dyson <will_dyson@pobox.com>) for help.
+
+step 2.  Configuration & make kernel
+
+The linux kernel has many compile-time options. Most of them are beyond the
+scope of this document. I suggest the Kernel-HOWTO document as a good general
+reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html>
+
+However, to use the BeFS module, you must enable it at configure time.
+
+	cd /foo/bar/linux
+	make menuconfig (or xconfig)
+
+The BeFS module is not a standard part of the linux kernel, so you must first
+enable support for experimental code under the "Code maturity level" menu.
+
+Then, under the "Filesystems" menu will be an option called "BeFS
+filesystem (experimental)", or something like that. Enable that option
+(it is fine to make it a module).
+
+Save your kernel configuration and then build your kernel.
+
+step 3.  Install
+
+See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
+instructions on this critical step.
+
+USING BFS
+=========
+To use the BeOS filesystem, use filesystem type 'befs'.
+
+ex)
+    mount -t befs /dev/fd0 /beos
+
+MOUNT OPTIONS
+=============
+uid=nnn        All files in the partition will be owned by user id nnn.
+gid=nnn	       All files in the partition will be in group nnn.
+iocharset=xxx  Use xxx as the name of the NLS translation table.
+debug          The driver will output debugging information to the syslog.
+
+HOW TO GET LASTEST VERSION
+==========================
+
+The latest version is currently available at:
+<http://befs-driver.sourceforge.net/>
+
+ANY KNOWN BUGS?
+===========
+As of Jan 20, 2002:
+	
+	None
+
+SPECIAL THANKS
+==============
+Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
+Hiroyuki Yamada  ... Testing LinuxPPC.
+
+
+
--- a/Documentation/filesystems/bfs.txt
+++ b/Documentation/filesystems/bfs.txt
@@ -0,0 +1,57 @@
+BFS FILESYSTEM FOR LINUX
+========================
+
+The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
+usually contains the kernel image and a few other files required for the
+boot process.
+
+In order to access /stand partition under Linux you obviously need to
+know the partition number and the kernel must support UnixWare disk slices
+(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
+depend on having UnixWare disklabel support because one can also mount
+BFS filesystem via loopback:
+
+# losetup /dev/loop0 stand.img
+# mount -t bfs /dev/loop0 /mnt/stand
+
+where stand.img is a file containing the image of BFS filesystem. 
+When you have finished using it and umounted you need to also deallocate
+/dev/loop0 device by:
+
+# losetup -d /dev/loop0
+
+You can simplify mounting by just typing:
+
+# mount -t bfs -o loop stand.img /mnt/stand
+
+this will allocate the first available loopback device (and load loop.o 
+kernel module if necessary) automatically. If the loopback driver is not
+loaded automatically, make sure that your kernel is compiled with kmod 
+support (CONFIG_KMOD) enabled. Beware that umount will not
+deallocate /dev/loopN device if /etc/mtab file on your system is a
+symbolic link to /proc/mounts. You will need to do it manually using
+"-d" switch of losetup(8). Read losetup(8) manpage for more info.
+
+To create the BFS image under UnixWare you need to find out first which
+slice contains it. The command prtvtoc(1M) is your friend:
+
+# prtvtoc /dev/rdsk/c0b0t0d0s0
+
+(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
+look for the slice with tag "STAND", which is usually slice 10. With this
+information you can use dd(1) to create the BFS image:
+
+# umount /stand
+# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
+
+Just in case, you can verify that you have done the right thing by checking
+the magic number:
+
+# od -Ad -tx4 stand.img | more
+
+The first 4 bytes should be 0x1badface.
+
+If you have any patches, questions or suggestions regarding this BFS
+implementation please contact the author:
+
+Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
--- a/Documentation/filesystems/cifs.txt
+++ b/Documentation/filesystems/cifs.txt
@@ -0,0 +1,51 @@
+  This is the client VFS module for the Common Internet File System
+  (CIFS) protocol which is the successor to the Server Message Block 
+  (SMB) protocol, the native file sharing mechanism for most early
+  PC operating systems.  CIFS is fully supported by current network
+  file servers such as Windows 2000, Windows 2003 (including  
+  Windows XP) as well by Samba (which provides excellent CIFS
+  server support for Linux and many other operating systems), so
+  this network filesystem client can mount to a wide variety of
+  servers.  The smbfs module should be used instead of this cifs module
+  for mounting to older SMB servers such as OS/2.  The smbfs and cifs
+  modules can coexist and do not conflict.  The CIFS VFS filesystem
+  module is designed to work well with servers that implement the
+  newer versions (dialects) of the SMB/CIFS protocol such as Samba, 
+  the program written by Andrew Tridgell that turns any Unix host 
+  into a SMB/CIFS file server.
+
+  The intent of this module is to provide the most advanced network
+  file system function for CIFS compliant servers, including better
+  POSIX compliance, secure per-user session establishment, high
+  performance safe distributed caching (oplock), optional packet
+  signing, large files, Unicode support and other internationalization
+  improvements. Since both Samba server and this filesystem client support
+  the CIFS Unix extensions, the combination can provide a reasonable 
+  alternative to NFSv4 for fileserving in some Linux to Linux environments,
+  not just in Linux to Windows environments.
+
+  This filesystem has an optional mount utility (mount.cifs) that can
+  be obtained from the project page and installed in the path in the same
+  directory with the other mount helpers (such as mount.smbfs). 
+  Mounting using the cifs filesystem without installing the mount helper
+  requires specifying the server's ip address.
+
+  For Linux 2.4:
+    mount //anything/here /mnt_target -o
+            user=username,pass=password,unc=//ip_address_of_server/sharename
+
+  For Linux 2.5: 
+    mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password
+
+
+  For more information on the module see the project page at
+
+      http://us1.samba.org/samba/Linux_CIFS_client.html 
+
+  For more information on CIFS see:
+
+      http://www.snia.org/tech_activities/CIFS
+
+  or the Samba site:
+     
+      http://www.samba.org
--- a/Documentation/filesystems/coda.txt
+++ b/Documentation/filesystems/coda.txt
--- a/Documentation/filesystems/configfs/configfs.txt
+++ b/Documentation/filesystems/configfs/configfs.txt
@@ -0,0 +1,434 @@
+
+configfs - Userspace-driven kernel object configuration.
+
+Joel Becker <joel.becker@oracle.com>
+
+Updated: 31 March 2005
+
+Copyright (c) 2005 Oracle Corporation,
+	Joel Becker <joel.becker@oracle.com>
+
+
+[What is configfs?]
+
+configfs is a ram-based filesystem that provides the converse of
+sysfs's functionality.  Where sysfs is a filesystem-based view of
+kernel objects, configfs is a filesystem-based manager of kernel
+objects, or config_items.
+
+With sysfs, an object is created in kernel (for example, when a device
+is discovered) and it is registered with sysfs.  Its attributes then
+appear in sysfs, allowing userspace to read the attributes via
+readdir(3)/read(2).  It may allow some attributes to be modified via
+write(2).  The important point is that the object is created and
+destroyed in kernel, the kernel controls the lifecycle of the sysfs
+representation, and sysfs is merely a window on all this.
+
+A configfs config_item is created via an explicit userspace operation:
+mkdir(2).  It is destroyed via rmdir(2).  The attributes appear at
+mkdir(2) time, and can be read or modified via read(2) and write(2).
+As with sysfs, readdir(3) queries the list of items and/or attributes.
+symlink(2) can be used to group items together.  Unlike sysfs, the
+lifetime of the representation is completely driven by userspace.  The
+kernel modules backing the items must respond to this.
+
+Both sysfs and configfs can and should exist together on the same
+system.  One is not a replacement for the other.
+
+[Using configfs]
+
+configfs can be compiled as a module or into the kernel.  You can access
+it by doing
+
+	mount -t configfs none /config
+
+The configfs tree will be empty unless client modules are also loaded.
+These are modules that register their item types with configfs as
+subsystems.  Once a client subsystem is loaded, it will appear as a
+subdirectory (or more than one) under /config.  Like sysfs, the
+configfs tree is always there, whether mounted on /config or not.
+
+An item is created via mkdir(2).  The item's attributes will also
+appear at this time.  readdir(3) can determine what the attributes are,
+read(2) can query their default values, and write(2) can store new
+values.  Like sysfs, attributes should be ASCII text files, preferably
+with only one value per file.  The same efficiency caveats from sysfs
+apply.  Don't mix more than one attribute in one attribute file.
+
+Like sysfs, configfs expects write(2) to store the entire buffer at
+once.  When writing to configfs attributes, userspace processes should
+first read the entire file, modify the portions they wish to change, and
+then write the entire buffer back.  Attribute files have a maximum size
+of one page (PAGE_SIZE, 4096 on i386).
+
+When an item needs to be destroyed, remove it with rmdir(2).  An
+item cannot be destroyed if any other item has a link to it (via
+symlink(2)).  Links can be removed via unlink(2).
+
+[Configuring FakeNBD: an Example]
+
+Imagine there's a Network Block Device (NBD) driver that allows you to
+access remote block devices.  Call it FakeNBD.  FakeNBD uses configfs
+for its configuration.  Obviously, there will be a nice program that
+sysadmins use to configure FakeNBD, but somehow that program has to tell
+the driver about it.  Here's where configfs comes in.
+
+When the FakeNBD driver is loaded, it registers itself with configfs.
+readdir(3) sees this just fine:
+
+	# ls /config
+	fakenbd
+
+A fakenbd connection can be created with mkdir(2).  The name is
+arbitrary, but likely the tool will make some use of the name.  Perhaps
+it is a uuid or a disk name:
+
+	# mkdir /config/fakenbd/disk1
+	# ls /config/fakenbd/disk1
+	target device rw
+
+The target attribute contains the IP address of the server FakeNBD will
+connect to.  The device attribute is the device on the server.
+Predictably, the rw attribute determines whether the connection is
+read-only or read-write.
+
+	# echo 10.0.0.1 > /config/fakenbd/disk1/target
+	# echo /dev/sda1 > /config/fakenbd/disk1/device
+	# echo 1 > /config/fakenbd/disk1/rw
+
+That's it.  That's all there is.  Now the device is configured, via the
+shell no less.
+
+[Coding With configfs]
+
+Every object in configfs is a config_item.  A config_item reflects an
+object in the subsystem.  It has attributes that match values on that
+object.  configfs handles the filesystem representation of that object
+and its attributes, allowing the subsystem to ignore all but the
+basic show/store interaction.
+
+Items are created and destroyed inside a config_group.  A group is a
+collection of items that share the same attributes and operations.
+Items are created by mkdir(2) and removed by rmdir(2), but configfs
+handles that.  The group has a set of operations to perform these tasks
+
+A subsystem is the top level of a client module.  During initialization,
+the client module registers the subsystem with configfs, the subsystem
+appears as a directory at the top of the configfs filesystem.  A
+subsystem is also a config_group, and can do everything a config_group
+can.
+
+[struct config_item]
+
+	struct config_item {
+		char                    *ci_name;
+		char                    ci_namebuf[UOBJ_NAME_LEN];
+		struct kref             ci_kref;
+		struct list_head        ci_entry;
+		struct config_item      *ci_parent;
+		struct config_group     *ci_group;
+		struct config_item_type *ci_type;
+		struct dentry           *ci_dentry;
+	};
+
+	void config_item_init(struct config_item *);
+	void config_item_init_type_name(struct config_item *,
+					const char *name,
+					struct config_item_type *type);
+	struct config_item *config_item_get(struct config_item *);
+	void config_item_put(struct config_item *);
+
+Generally, struct config_item is embedded in a container structure, a
+structure that actually represents what the subsystem is doing.  The
+config_item portion of that structure is how the object interacts with
+configfs.
+
+Whether statically defined in a source file or created by a parent
+config_group, a config_item must have one of the _init() functions
+called on it.  This initializes the reference count and sets up the
+appropriate fields.
+
+All users of a config_item should have a reference on it via
+config_item_get(), and drop the reference when they are done via
+config_item_put().
+
+By itself, a config_item cannot do much more than appear in configfs.
+Usually a subsystem wants the item to display and/or store attributes,
+among other things.  For that, it needs a type.
+
+[struct config_item_type]
+
+	struct configfs_item_operations {
+		void (*release)(struct config_item *);
+		ssize_t (*show_attribute)(struct config_item *,
+					  struct configfs_attribute *,
+					  char *);
+		ssize_t (*store_attribute)(struct config_item *,
+					   struct configfs_attribute *,
+					   const char *, size_t);
+		int (*allow_link)(struct config_item *src,
+				  struct config_item *target);
+		int (*drop_link)(struct config_item *src,
+				 struct config_item *target);
+	};
+
+	struct config_item_type {
+		struct module                           *ct_owner;
+		struct configfs_item_operations         *ct_item_ops;
+		struct configfs_group_operations        *ct_group_ops;
+		struct configfs_attribute               **ct_attrs;
+	};
+
+The most basic function of a config_item_type is to define what
+operations can be performed on a config_item.  All items that have been
+allocated dynamically will need to provide the ct_item_ops->release()
+method.  This method is called when the config_item's reference count
+reaches zero.  Items that wish to display an attribute need to provide
+the ct_item_ops->show_attribute() method.  Similarly, storing a new
+attribute value uses the store_attribute() method.
+
+[struct configfs_attribute]
+
+	struct configfs_attribute {
+		char                    *ca_name;
+		struct module           *ca_owner;
+		mode_t                  ca_mode;
+	};
+
+When a config_item wants an attribute to appear as a file in the item's
+configfs directory, it must define a configfs_attribute describing it.
+It then adds the attribute to the NULL-terminated array
+config_item_type->ct_attrs.  When the item appears in configfs, the
+attribute file will appear with the configfs_attribute->ca_name
+filename.  configfs_attribute->ca_mode specifies the file permissions.
+
+If an attribute is readable and the config_item provides a
+ct_item_ops->show_attribute() method, that method will be called
+whenever userspace asks for a read(2) on the attribute.  The converse
+will happen for write(2).
+
+[struct config_group]
+
+A config_item cannot live in a vacuum.  The only way one can be created
+is via mkdir(2) on a config_group.  This will trigger creation of a
+child item.
+
+	struct config_group {
+		struct config_item		cg_item;
+		struct list_head		cg_children;
+		struct configfs_subsystem 	*cg_subsys;
+		struct config_group		**default_groups;
+	};
+
+	void config_group_init(struct config_group *group);
+	void config_group_init_type_name(struct config_group *group,
+					 const char *name,
+					 struct config_item_type *type);
+
+
+The config_group structure contains a config_item.  Properly configuring
+that item means that a group can behave as an item in its own right.
+However, it can do more: it can create child items or groups.  This is
+accomplished via the group operations specified on the group's
+config_item_type.
+
+	struct configfs_group_operations {
+		struct config_item *(*make_item)(struct config_group *group,
+						 const char *name);
+		struct config_group *(*make_group)(struct config_group *group,
+						   const char *name);
+		int (*commit_item)(struct config_item *item);
+		void (*drop_item)(struct config_group *group,
+				  struct config_item *item);
+	};
+
+A group creates child items by providing the
+ct_group_ops->make_item() method.  If provided, this method is called from mkdir(2) in the group's directory.  The subsystem allocates a new
+config_item (or more likely, its container structure), initializes it,
+and returns it to configfs.  Configfs will then populate the filesystem
+tree to reflect the new item.
+
+If the subsystem wants the child to be a group itself, the subsystem
+provides ct_group_ops->make_group().  Everything else behaves the same,
+using the group _init() functions on the group.
+
+Finally, when userspace calls rmdir(2) on the item or group,
+ct_group_ops->drop_item() is called.  As a config_group is also a
+config_item, it is not necessary for a separate drop_group() method.
+The subsystem must config_item_put() the reference that was initialized
+upon item allocation.  If a subsystem has no work to do, it may omit
+the ct_group_ops->drop_item() method, and configfs will call
+config_item_put() on the item on behalf of the subsystem.
+
+IMPORTANT: drop_item() is void, and as such cannot fail.  When rmdir(2)
+is called, configfs WILL remove the item from the filesystem tree
+(assuming that it has no children to keep it busy).  The subsystem is
+responsible for responding to this.  If the subsystem has references to
+the item in other threads, the memory is safe.  It may take some time
+for the item to actually disappear from the subsystem's usage.  But it
+is gone from configfs.
+
+A config_group cannot be removed while it still has child items.  This
+is implemented in the configfs rmdir(2) code.  ->drop_item() will not be
+called, as the item has not been dropped.  rmdir(2) will fail, as the
+directory is not empty.
+
+[struct configfs_subsystem]
+
+A subsystem must register itself, usually at module_init time.  This
+tells configfs to make the subsystem appear in the file tree.
+
+	struct configfs_subsystem {
+		struct config_group	su_group;
+		struct semaphore	su_sem;
+	};
+
+	int configfs_register_subsystem(struct configfs_subsystem *subsys);
+	void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
+
+	A subsystem consists of a toplevel config_group and a semaphore.
+The group is where child config_items are created.  For a subsystem,
+this group is usually defined statically.  Before calling
+configfs_register_subsystem(), the subsystem must have initialized the
+group via the usual group _init() functions, and it must also have
+initialized the semaphore.
+	When the register call returns, the subsystem is live, and it
+will be visible via configfs.  At that point, mkdir(2) can be called and
+the subsystem must be ready for it.
+
+[An Example]
+
+The best example of these basic concepts is the simple_children
+subsystem/group and the simple_child item in configfs_example.c  It
+shows a trivial object displaying and storing an attribute, and a simple
+group creating and destroying these children.
+
+[Hierarchy Navigation and the Subsystem Semaphore]
+
+There is an extra bonus that configfs provides.  The config_groups and
+config_items are arranged in a hierarchy due to the fact that they
+appear in a filesystem.  A subsystem is NEVER to touch the filesystem
+parts, but the subsystem might be interested in this hierarchy.  For
+this reason, the hierarchy is mirrored via the config_group->cg_children
+and config_item->ci_parent structure members.
+
+A subsystem can navigate the cg_children list and the ci_parent pointer
+to see the tree created by the subsystem.  This can race with configfs'
+management of the hierarchy, so configfs uses the subsystem semaphore to
+protect modifications.  Whenever a subsystem wants to navigate the
+hierarchy, it must do so under the protection of the subsystem
+semaphore.
+
+A subsystem will be prevented from acquiring the semaphore while a newly
+allocated item has not been linked into this hierarchy.   Similarly, it
+will not be able to acquire the semaphore while a dropping item has not
+yet been unlinked.  This means that an item's ci_parent pointer will
+never be NULL while the item is in configfs, and that an item will only
+be in its parent's cg_children list for the same duration.  This allows
+a subsystem to trust ci_parent and cg_children while they hold the
+semaphore.
+
+[Item Aggregation Via symlink(2)]
+
+configfs provides a simple group via the group->item parent/child
+relationship.  Often, however, a larger environment requires aggregation
+outside of the parent/child connection.  This is implemented via
+symlink(2).
+
+A config_item may provide the ct_item_ops->allow_link() and
+ct_item_ops->drop_link() methods.  If the ->allow_link() method exists,
+symlink(2) may be called with the config_item as the source of the link.
+These links are only allowed between configfs config_items.  Any
+symlink(2) attempt outside the configfs filesystem will be denied.
+
+When symlink(2) is called, the source config_item's ->allow_link()
+method is called with itself and a target item.  If the source item
+allows linking to target item, it returns 0.  A source item may wish to
+reject a link if it only wants links to a certain type of object (say,
+in its own subsystem).
+
+When unlink(2) is called on the symbolic link, the source item is
+notified via the ->drop_link() method.  Like the ->drop_item() method,
+this is a void function and cannot return failure.  The subsystem is
+responsible for responding to the change.
+
+A config_item cannot be removed while it links to any other item, nor
+can it be removed while an item links to it.  Dangling symlinks are not
+allowed in configfs.
+
+[Automatically Created Subgroups]
+
+A new config_group may want to have two types of child config_items.
+While this could be codified by magic names in ->make_item(), it is much
+more explicit to have a method whereby userspace sees this divergence.
+
+Rather than have a group where some items behave differently than
+others, configfs provides a method whereby one or many subgroups are
+automatically created inside the parent at its creation.  Thus,
+mkdir("parent) results in "parent", "parent/subgroup1", up through
+"parent/subgroupN".  Items of type 1 can now be created in
+"parent/subgroup1", and items of type N can be created in
+"parent/subgroupN".
+
+These automatic subgroups, or default groups, do not preclude other
+children of the parent group.  If ct_group_ops->make_group() exists,
+other child groups can be created on the parent group directly.
+
+A configfs subsystem specifies default groups by filling in the
+NULL-terminated array default_groups on the config_group structure.
+Each group in that array is populated in the configfs tree at the same
+time as the parent group.  Similarly, they are removed at the same time
+as the parent.  No extra notification is provided.  When a ->drop_item()
+method call notifies the subsystem the parent group is going away, it
+also means every default group child associated with that parent group.
+
+As a consequence of this, default_groups cannot be removed directly via
+rmdir(2).  They also are not considered when rmdir(2) on the parent
+group is checking for children.
+
+[Committable Items]
+
+NOTE: Committable items are currently unimplemented.
+
+Some config_items cannot have a valid initial state.  That is, no
+default values can be specified for the item's attributes such that the
+item can do its work.  Userspace must configure one or more attributes,
+after which the subsystem can start whatever entity this item
+represents.
+
+Consider the FakeNBD device from above.  Without a target address *and*
+a target device, the subsystem has no idea what block device to import.
+The simple example assumes that the subsystem merely waits until all the
+appropriate attributes are configured, and then connects.  This will,
+indeed, work, but now every attribute store must check if the attributes
+are initialized.  Every attribute store must fire off the connection if
+that condition is met.
+
+Far better would be an explicit action notifying the subsystem that the
+config_item is ready to go.  More importantly, an explicit action allows
+the subsystem to provide feedback as to whether the attributes are
+initialized in a way that makes sense.  configfs provides this as
+committable items.
+
+configfs still uses only normal filesystem operations.  An item is
+committed via rename(2).  The item is moved from a directory where it
+can be modified to a directory where it cannot.
+
+Any group that provides the ct_group_ops->commit_item() method has
+committable items.  When this group appears in configfs, mkdir(2) will
+not work directly in the group.  Instead, the group will have two
+subdirectories: "live" and "pending".  The "live" directory does not
+support mkdir(2) or rmdir(2) either.  It only allows rename(2).  The
+"pending" directory does allow mkdir(2) and rmdir(2).  An item is
+created in the "pending" directory.  Its attributes can be modified at
+will.  Userspace commits the item by renaming it into the "live"
+directory.  At this point, the subsystem receives the ->commit_item()
+callback.  If all required attributes are filled to satisfaction, the
+method returns zero and the item is moved to the "live" directory.
+
+As rmdir(2) does not work in the "live" directory, an item must be
+shutdown, or "uncommitted".  Again, this is done via rename(2), this
+time from the "live" directory back to the "pending" one.  The subsystem
+is notified by the ct_group_ops->uncommit_object() method.
+
+
--- a/Documentation/filesystems/configfs/configfs_example.c
+++ b/Documentation/filesystems/configfs/configfs_example.c
@@ -0,0 +1,487 @@
+/*
+ * vim: noexpandtab ts=8 sts=0 sw=8:
+ *
+ * configfs_example.c - This file is a demonstration module containing
+ *      a number of configfs subsystems.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on sysfs:
+ * 	sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+#include <linux/configfs.h>
+
+
+
+/*
+ * 01-childless
+ *
+ * This first example is a childless subsystem.  It cannot create
+ * any config_items.  It just has attributes.
+ *
+ * Note that we are enclosing the configfs_subsystem inside a container.
+ * This is not necessary if a subsystem has no attributes directly
+ * on the subsystem.  See the next example, 02-simple-children, for
+ * such a subsystem.
+ */
+
+struct childless {
+	struct configfs_subsystem subsys;
+	int showme;
+	int storeme;
+};
+
+struct childless_attribute {
+	struct configfs_attribute attr;
+	ssize_t (*show)(struct childless *, char *);
+	ssize_t (*store)(struct childless *, const char *, size_t);
+};
+
+static inline struct childless *to_childless(struct config_item *item)
+{
+	return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL;
+}
+
+static ssize_t childless_showme_read(struct childless *childless,
+				     char *page)
+{
+	ssize_t pos;
+
+	pos = sprintf(page, "%d\n", childless->showme);
+	childless->showme++;
+
+	return pos;
+}
+
+static ssize_t childless_storeme_read(struct childless *childless,
+				      char *page)
+{
+	return sprintf(page, "%d\n", childless->storeme);
+}
+
+static ssize_t childless_storeme_write(struct childless *childless,
+				       const char *page,
+				       size_t count)
+{
+	unsigned long tmp;
+	char *p = (char *) page;
+
+	tmp = simple_strtoul(p, &p, 10);
+	if (!p || (*p && (*p != '\n')))
+		return -EINVAL;
+
+	if (tmp > INT_MAX)
+		return -ERANGE;
+
+	childless->storeme = tmp;
+
+	return count;
+}
+
+static ssize_t childless_description_read(struct childless *childless,
+					  char *page)
+{
+	return sprintf(page,
+"[01-childless]\n"
+"\n"
+"The childless subsystem is the simplest possible subsystem in\n"
+"configfs.  It does not support the creation of child config_items.\n"
+"It only has a few attributes.  In fact, it isn't much different\n"
+"than a directory in /proc.\n");
+}
+
+static struct childless_attribute childless_attr_showme = {
+	.attr	= { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO },
+	.show	= childless_showme_read,
+};
+static struct childless_attribute childless_attr_storeme = {
+	.attr	= { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR },
+	.show	= childless_storeme_read,
+	.store	= childless_storeme_write,
+};
+static struct childless_attribute childless_attr_description = {
+	.attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO },
+	.show = childless_description_read,
+};
+
+static struct configfs_attribute *childless_attrs[] = {
+	&childless_attr_showme.attr,
+	&childless_attr_storeme.attr,
+	&childless_attr_description.attr,
+	NULL,
+};
+
+static ssize_t childless_attr_show(struct config_item *item,
+				   struct configfs_attribute *attr,
+				   char *page)
+{
+	struct childless *childless = to_childless(item);
+	struct childless_attribute *childless_attr =
+		container_of(attr, struct childless_attribute, attr);
+	ssize_t ret = 0;
+
+	if (childless_attr->show)
+		ret = childless_attr->show(childless, page);
+	return ret;
+}
+
+static ssize_t childless_attr_store(struct config_item *item,
+				    struct configfs_attribute *attr,
+				    const char *page, size_t count)
+{
+	struct childless *childless = to_childless(item);
+	struct childless_attribute *childless_attr =
+		container_of(attr, struct childless_attribute, attr);
+	ssize_t ret = -EINVAL;
+
+	if (childless_attr->store)
+		ret = childless_attr->store(childless, page, count);
+	return ret;
+}
+
+static struct configfs_item_operations childless_item_ops = {
+	.show_attribute		= childless_attr_show,
+	.store_attribute	= childless_attr_store,
+};
+
+static struct config_item_type childless_type = {
+	.ct_item_ops	= &childless_item_ops,
+	.ct_attrs	= childless_attrs,
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct childless childless_subsys = {
+	.subsys = {
+		.su_group = {
+			.cg_item = {
+				.ci_namebuf = "01-childless",
+				.ci_type = &childless_type,
+			},
+		},
+	},
+};
+
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * 02-simple-children
+ *
+ * This example merely has a simple one-attribute child.  Note that
+ * there is no extra attribute structure, as the child's attribute is
+ * known from the get-go.  Also, there is no container for the
+ * subsystem, as it has no attributes of its own.
+ */
+
+struct simple_child {
+	struct config_item item;
+	int storeme;
+};
+
+static inline struct simple_child *to_simple_child(struct config_item *item)
+{
+	return item ? container_of(item, struct simple_child, item) : NULL;
+}
+
+static struct configfs_attribute simple_child_attr_storeme = {
+	.ca_owner = THIS_MODULE,
+	.ca_name = "storeme",
+	.ca_mode = S_IRUGO | S_IWUSR,
+};
+
+static struct configfs_attribute *simple_child_attrs[] = {
+	&simple_child_attr_storeme,
+	NULL,
+};
+
+static ssize_t simple_child_attr_show(struct config_item *item,
+				      struct configfs_attribute *attr,
+				      char *page)
+{
+	ssize_t count;
+	struct simple_child *simple_child = to_simple_child(item);
+
+	count = sprintf(page, "%d\n", simple_child->storeme);
+
+	return count;
+}
+
+static ssize_t simple_child_attr_store(struct config_item *item,
+				       struct configfs_attribute *attr,
+				       const char *page, size_t count)
+{
+	struct simple_child *simple_child = to_simple_child(item);
+	unsigned long tmp;
+	char *p = (char *) page;
+
+	tmp = simple_strtoul(p, &p, 10);
+	if (!p || (*p && (*p != '\n')))
+		return -EINVAL;
+
+	if (tmp > INT_MAX)
+		return -ERANGE;
+
+	simple_child->storeme = tmp;
+
+	return count;
+}
+
+static void simple_child_release(struct config_item *item)
+{
+	kfree(to_simple_child(item));
+}
+
+static struct configfs_item_operations simple_child_item_ops = {
+	.release		= simple_child_release,
+	.show_attribute		= simple_child_attr_show,
+	.store_attribute	= simple_child_attr_store,
+};
+
+static struct config_item_type simple_child_type = {
+	.ct_item_ops	= &simple_child_item_ops,
+	.ct_attrs	= simple_child_attrs,
+	.ct_owner	= THIS_MODULE,
+};
+
+
+struct simple_children {
+	struct config_group group;
+};
+
+static inline struct simple_children *to_simple_children(struct config_item *item)
+{
+	return item ? container_of(to_config_group(item), struct simple_children, group) : NULL;
+}
+
+static struct config_item *simple_children_make_item(struct config_group *group, const char *name)
+{
+	struct simple_child *simple_child;
+
+	simple_child = kmalloc(sizeof(struct simple_child), GFP_KERNEL);
+	if (!simple_child)
+		return NULL;
+
+	memset(simple_child, 0, sizeof(struct simple_child));
+
+	config_item_init_type_name(&simple_child->item, name,
+				   &simple_child_type);
+
+	simple_child->storeme = 0;
+
+	return &simple_child->item;
+}
+
+static struct configfs_attribute simple_children_attr_description = {
+	.ca_owner = THIS_MODULE,
+	.ca_name = "description",
+	.ca_mode = S_IRUGO,
+};
+
+static struct configfs_attribute *simple_children_attrs[] = {
+	&simple_children_attr_description,
+	NULL,
+};
+
+static ssize_t simple_children_attr_show(struct config_item *item,
+			   		 struct configfs_attribute *attr,
+			   		 char *page)
+{
+	return sprintf(page,
+"[02-simple-children]\n"
+"\n"
+"This subsystem allows the creation of child config_items.  These\n"
+"items have only one attribute that is readable and writeable.\n");
+}
+
+static void simple_children_release(struct config_item *item)
+{
+	kfree(to_simple_children(item));
+}
+
+static struct configfs_item_operations simple_children_item_ops = {
+	.release 	= simple_children_release,
+	.show_attribute	= simple_children_attr_show,
+};
+
+/*
+ * Note that, since no extra work is required on ->drop_item(),
+ * no ->drop_item() is provided.
+ */
+static struct configfs_group_operations simple_children_group_ops = {
+	.make_item	= simple_children_make_item,
+};
+
+static struct config_item_type simple_children_type = {
+	.ct_item_ops	= &simple_children_item_ops,
+	.ct_group_ops	= &simple_children_group_ops,
+	.ct_attrs	= simple_children_attrs,
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct configfs_subsystem simple_children_subsys = {
+	.su_group = {
+		.cg_item = {
+			.ci_namebuf = "02-simple-children",
+			.ci_type = &simple_children_type,
+		},
+	},
+};
+
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * 03-group-children
+ *
+ * This example reuses the simple_children group from above.  However,
+ * the simple_children group is not the subsystem itself, it is a
+ * child of the subsystem.  Creation of a group in the subsystem creates
+ * a new simple_children group.  That group can then have simple_child
+ * children of its own.
+ */
+
+static struct config_group *group_children_make_group(struct config_group *group, const char *name)
+{
+	struct simple_children *simple_children;
+
+	simple_children = kmalloc(sizeof(struct simple_children),
+				  GFP_KERNEL);
+	if (!simple_children)
+		return NULL;
+
+	memset(simple_children, 0, sizeof(struct simple_children));
+
+	config_group_init_type_name(&simple_children->group, name,
+				    &simple_children_type);
+
+	return &simple_children->group;
+}
+
+static struct configfs_attribute group_children_attr_description = {
+	.ca_owner = THIS_MODULE,
+	.ca_name = "description",
+	.ca_mode = S_IRUGO,
+};
+
+static struct configfs_attribute *group_children_attrs[] = {
+	&group_children_attr_description,
+	NULL,
+};
+
+static ssize_t group_children_attr_show(struct config_item *item,
+			   		struct configfs_attribute *attr,
+			   		char *page)
+{
+	return sprintf(page,
+"[03-group-children]\n"
+"\n"
+"This subsystem allows the creation of child config_groups.  These\n"
+"groups are like the subsystem simple-children.\n");
+}
+
+static struct configfs_item_operations group_children_item_ops = {
+	.show_attribute	= group_children_attr_show,
+};
+
+/*
+ * Note that, since no extra work is required on ->drop_item(),
+ * no ->drop_item() is provided.
+ */
+static struct configfs_group_operations group_children_group_ops = {
+	.make_group	= group_children_make_group,
+};
+
+static struct config_item_type group_children_type = {
+	.ct_item_ops	= &group_children_item_ops,
+	.ct_group_ops	= &group_children_group_ops,
+	.ct_attrs	= group_children_attrs,
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct configfs_subsystem group_children_subsys = {
+	.su_group = {
+		.cg_item = {
+			.ci_namebuf = "03-group-children",
+			.ci_type = &group_children_type,
+		},
+	},
+};
+
+/* ----------------------------------------------------------------- */
+
+/*
+ * We're now done with our subsystem definitions.
+ * For convenience in this module, here's a list of them all.  It
+ * allows the init function to easily register them.  Most modules
+ * will only have one subsystem, and will only call register_subsystem
+ * on it directly.
+ */
+static struct configfs_subsystem *example_subsys[] = {
+	&childless_subsys.subsys,
+	&simple_children_subsys,
+	&group_children_subsys,
+	NULL,
+};
+
+static int __init configfs_example_init(void)
+{
+	int ret;
+	int i;
+	struct configfs_subsystem *subsys;
+
+	for (i = 0; example_subsys[i]; i++) {
+		subsys = example_subsys[i];
+
+		config_group_init(&subsys->su_group);
+		init_MUTEX(&subsys->su_sem);
+		ret = configfs_register_subsystem(subsys);
+		if (ret) {
+			printk(KERN_ERR "Error %d while registering subsystem %s\n",
+			       ret,
+			       subsys->su_group.cg_item.ci_namebuf);
+			goto out_unregister;
+		}
+	}
+
+	return 0;
+
+out_unregister:
+	for (; i >= 0; i--) {
+		configfs_unregister_subsystem(example_subsys[i]);
+	}
+
+	return ret;
+}
+
+static void __exit configfs_example_exit(void)
+{
+	int i;
+
+	for (i = 0; example_subsys[i]; i++) {
+		configfs_unregister_subsystem(example_subsys[i]);
+	}
+}
+
+module_init(configfs_example_init);
+module_exit(configfs_example_exit);
+MODULE_LICENSE("GPL");
--- a/Documentation/filesystems/cramfs.txt
+++ b/Documentation/filesystems/cramfs.txt
@@ -0,0 +1,76 @@
+
+	Cramfs - cram a filesystem onto a small ROM
+
+cramfs is designed to be simple and small, and to compress things well. 
+
+It uses the zlib routines to compress a file one page at a time, and
+allows random page access.  The meta-data is not compressed, but is
+expressed in a very terse representation to make it use much less
+diskspace than traditional filesystems. 
+
+You can't write to a cramfs filesystem (making it compressible and
+compact also makes it _very_ hard to update on-the-fly), so you have to
+create the disk image with the "mkcramfs" utility.
+
+
+Usage Notes
+-----------
+
+File sizes are limited to less than 16MB.
+
+Maximum filesystem size is a little over 256MB.  (The last file on the
+filesystem is allowed to extend past 256MB.)
+
+Only the low 8 bits of gid are stored.  The current version of
+mkcramfs simply truncates to 8 bits, which is a potential security
+issue.
+
+Hard links are supported, but hard linked files
+will still have a link count of 1 in the cramfs image.
+
+Cramfs directories have no `.' or `..' entries.  Directories (like
+every other file on cramfs) always have a link count of 1.  (There's
+no need to use -noleaf in `find', btw.)
+
+No timestamps are stored in a cramfs, so these default to the epoch
+(1970 GMT).  Recently-accessed files may have updated timestamps, but
+the update lasts only as long as the inode is cached in memory, after
+which the timestamp reverts to 1970, i.e. moves backwards in time.
+
+Currently, cramfs must be written and read with architectures of the
+same endianness, and can be read only by kernels with PAGE_CACHE_SIZE
+== 4096.  At least the latter of these is a bug, but it hasn't been
+decided what the best fix is.  For the moment if you have larger pages
+you can just change the #define in mkcramfs.c, so long as you don't
+mind the filesystem becoming unreadable to future kernels.
+
+
+For /usr/share/magic
+--------------------
+
+0	ulelong	0x28cd3d45	Linux cramfs offset 0
+>4	ulelong	x		size %d
+>8	ulelong	x		flags 0x%x
+>12	ulelong	x		future 0x%x
+>16	string	>\0		signature "%.16s"
+>32	ulelong	x		fsid.crc 0x%x
+>36	ulelong	x		fsid.edition %d
+>40	ulelong	x		fsid.blocks %d
+>44	ulelong	x		fsid.files %d
+>48	string	>\0		name "%.16s"
+512	ulelong	0x28cd3d45	Linux cramfs offset 512
+>516	ulelong	x		size %d
+>520	ulelong	x		flags 0x%x
+>524	ulelong	x		future 0x%x
+>528	string	>\0		signature "%.16s"
+>544	ulelong	x		fsid.crc 0x%x
+>548	ulelong	x		fsid.edition %d
+>552	ulelong	x		fsid.blocks %d
+>556	ulelong	x		fsid.files %d
+>560	string	>\0		name "%.16s"
+
+
+Hacker Notes
+------------
+
+See fs/cramfs/README for filesystem layout and implementation notes.
--- a/Documentation/filesystems/dentry-locking.txt
+++ b/Documentation/filesystems/dentry-locking.txt
@@ -0,0 +1,173 @@
+RCU-based dcache locking model
+==============================
+
+On many workloads, the most common operation on dcache is to look up a
+dentry, given a parent dentry and the name of the child. Typically,
+for every open(), stat() etc., the dentry corresponding to the
+pathname will be looked up by walking the tree starting with the first
+component of the pathname and using that dentry along with the next
+component to look up the next level and so on. Since it is a frequent
+operation for workloads like multiuser environments and web servers,
+it is important to optimize this path.
+
+Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in
+every component during path look-up. Since 2.5.10 onwards, fast-walk
+algorithm changed this by holding the dcache_lock at the beginning and
+walking as many cached path component dentries as possible. This
+significantly decreases the number of acquisition of
+dcache_lock. However it also increases the lock hold time
+significantly and affects performance in large SMP machines. Since
+2.5.62 kernel, dcache has been using a new locking model that uses RCU
+to make dcache look-up lock-free.
+
+The current dcache locking model is not very different from the
+existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock
+protected the hash chain, d_child, d_alias, d_lru lists as well as
+d_inode and several other things like mount look-up. RCU-based changes
+affect only the way the hash chain is protected. For everything else
+the dcache_lock must be taken for both traversing as well as
+updating. The hash chain updates too take the dcache_lock.  The
+significant change is the way d_lookup traverses the hash chain, it
+doesn't acquire the dcache_lock for this and rely on RCU to ensure
+that the dentry has not been *freed*.
+
+
+Dcache locking details
+======================
+
+For many multi-user workloads, open() and stat() on files are very
+frequently occurring operations. Both involve walking of path names to
+find the dentry corresponding to the concerned file. In 2.4 kernel,
+dcache_lock was held during look-up of each path component. Contention
+and cache-line bouncing of this global lock caused significant
+scalability problems. With the introduction of RCU in Linux kernel,
+this was worked around by making the look-up of path components during
+path walking lock-free.
+
+
+Safe lock-free look-up of dcache hash table
+===========================================
+
+Dcache is a complex data structure with the hash table entries also
+linked together in other lists. In 2.4 kernel, dcache_lock protected
+all the lists. We applied RCU only on hash chain walking. The rest of
+the lists are still protected by dcache_lock.  Some of the important
+changes are :
+
+1. The deletion from hash chain is done using hlist_del_rcu() macro
+   which doesn't initialize next pointer of the deleted dentry and
+   this allows us to walk safely lock-free while a deletion is
+   happening.
+
+2. Insertion of a dentry into the hash table is done using
+   hlist_add_head_rcu() which take care of ordering the writes - the
+   writes to the dentry must be visible before the dentry is
+   inserted. This works in conjunction with hlist_for_each_rcu() while
+   walking the hash chain. The only requirement is that all
+   initialization to the dentry must be done before
+   hlist_add_head_rcu() since we don't have dcache_lock protection
+   while traversing the hash chain. This isn't different from the
+   existing code.
+
+3. The dentry looked up without holding dcache_lock by cannot be
+   returned for walking if it is unhashed. It then may have a NULL
+   d_inode or other bogosity since RCU doesn't protect the other
+   fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
+   indicate unhashed dentries and use this in conjunction with a
+   per-dentry lock (d_lock). Once looked up without the dcache_lock,
+   we acquire the per-dentry lock (d_lock) and check if the dentry is
+   unhashed. If so, the look-up is failed. If not, the reference count
+   of the dentry is increased and the dentry is returned.
+
+4. Once a dentry is looked up, it must be ensured during the path walk
+   for that component it doesn't go away. In pre-2.5.10 code, this was
+   done holding a reference to the dentry. dcache_rcu does the same.
+   In some sense, dcache_rcu path walking looks like the pre-2.5.10
+   version.
+
+5. All dentry hash chain updates must take the dcache_lock as well as
+   the per-dentry lock in that order. dput() does this to ensure that
+   a dentry that has just been looked up in another CPU doesn't get
+   deleted before dget() can be done on it.
+
+6. There are several ways to do reference counting of RCU protected
+   objects. One such example is in ipv4 route cache where deferred
+   freeing (using call_rcu()) is done as soon as the reference count
+   goes to zero. This cannot be done in the case of dentries because
+   tearing down of dentries require blocking (dentry_iput()) which
+   isn't supported from RCU callbacks. Instead, tearing down of
+   dentries happen synchronously in dput(), but actual freeing happens
+   later when RCU grace period is over. This allows safe lock-free
+   walking of the hash chains, but a matched dentry may have been
+   partially torn down. The checking of DCACHE_UNHASHED flag with
+   d_lock held detects such dentries and prevents them from being
+   returned from look-up.
+
+
+Maintaining POSIX rename semantics
+==================================
+
+Since look-up of dentries is lock-free, it can race against a
+concurrent rename operation. For example, during rename of file A to
+B, look-up of either A or B must succeed.  So, if look-up of B happens
+after A has been removed from the hash chain but not added to the new
+hash chain, it may fail.  Also, a comparison while the name is being
+written concurrently by a rename may result in false positive matches
+violating rename semantics.  Issues related to race with rename are
+handled as described below :
+
+1. Look-up can be done in two ways - d_lookup() which is safe from
+   simultaneous renames and __d_lookup() which is not.  If
+   __d_lookup() fails, it must be followed up by a d_lookup() to
+   correctly determine whether a dentry is in the hash table or
+   not. d_lookup() protects look-ups using a sequence lock
+   (rename_lock).
+
+2. The name associated with a dentry (d_name) may be changed if a
+   rename is allowed to happen simultaneously. To avoid memcmp() in
+   __d_lookup() go out of bounds due to a rename and false positive
+   comparison, the name comparison is done while holding the
+   per-dentry lock. This prevents concurrent renames during this
+   operation.
+
+3. Hash table walking during look-up may move to a different bucket as
+   the current dentry is moved to a different bucket due to rename.
+   But we use hlists in dcache hash table and they are
+   null-terminated.  So, even if a dentry moves to a different bucket,
+   hash chain walk will terminate. [with a list_head list, it may not
+   since termination is when the list_head in the original bucket is
+   reached].  Since we redo the d_parent check and compare name while
+   holding d_lock, lock-free look-up will not race against d_move().
+
+4. There can be a theoretical race when a dentry keeps coming back to
+   original bucket due to double moves. Due to this look-up may
+   consider that it has never moved and can end up in a infinite loop.
+   But this is not any worse that theoretical livelocks we already
+   have in the kernel.
+
+
+Important guidelines for filesystem developers related to dcache_rcu
+====================================================================
+
+1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
+   don't change. Only dcache internal implementation changes. However
+   filesystems *must not* delete from the dentry hash chains directly
+   using the list macros like allowed earlier. They must use dcache
+   APIs like d_drop() or __d_drop() depending on the situation.
+
+2. d_flags is now protected by a per-dentry lock (d_lock). All access
+   to d_flags must be protected by it.
+
+3. For a hashed dentry, checking of d_count needs to be protected by
+   d_lock.
+
+
+Papers and other documentation on dcache locking
+================================================
+
+1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
+
+2. http://lse.sourceforge.net/locking/dcache/dcache.html
+
+
+
--- a/Documentation/filesystems/directory-locking
+++ b/Documentation/filesystems/directory-locking
@@ -0,0 +1,113 @@
+	Locking scheme used for directory operations is based on two
+kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem).
+
+	For our purposes all operations fall in 5 classes:
+
+1) read access.  Locking rules: caller locks directory we are accessing.
+
+2) object creation.  Locking rules: same as above.
+
+3) object removal.  Locking rules: caller locks parent, finds victim,
+locks victim and calls the method.
+
+4) rename() that is _not_ cross-directory.  Locking rules: caller locks
+the parent, finds source and target, if target already exists - locks it
+and then calls the method.
+
+5) link creation.  Locking rules:
+	* lock parent
+	* check that source is not a directory
+	* lock source
+	* call the method.
+
+6) cross-directory rename.  The trickiest in the whole bunch.  Locking
+rules:
+	* lock the filesystem
+	* lock parents in "ancestors first" order.
+	* find source and target.
+	* if old parent is equal to or is a descendent of target
+		fail with -ENOTEMPTY
+	* if new parent is equal to or is a descendent of source
+		fail with -ELOOP
+	* if target exists - lock it.
+	* call the method.
+
+
+The rules above obviously guarantee that all directories that are going to be
+read, modified or removed by method will be locked by caller.
+
+
+If no directory is its own ancestor, the scheme above is deadlock-free.
+Proof:
+
+	First of all, at any moment we have a partial ordering of the
+objects - A < B iff A is an ancestor of B.
+
+	That ordering can change.  However, the following is true:
+
+(1) if object removal or non-cross-directory rename holds lock on A and
+    attempts to acquire lock on B, A will remain the parent of B until we
+    acquire the lock on B.  (Proof: only cross-directory rename can change
+    the parent of object and it would have to lock the parent).
+
+(2) if cross-directory rename holds the lock on filesystem, order will not
+    change until rename acquires all locks.  (Proof: other cross-directory
+    renames will be blocked on filesystem lock and we don't start changing
+    the order until we had acquired all locks).
+
+(3) any operation holds at most one lock on non-directory object and
+    that lock is acquired after all other locks.  (Proof: see descriptions
+    of operations).
+
+	Now consider the minimal deadlock.  Each process is blocked on
+attempt to acquire some lock and already holds at least one lock.  Let's
+consider the set of contended locks.  First of all, filesystem lock is
+not contended, since any process blocked on it is not holding any locks.
+Thus all processes are blocked on ->i_sem.
+
+	Non-directory objects are not contended due to (3).  Thus link
+creation can't be a part of deadlock - it can't be blocked on source
+and it means that it doesn't hold any locks.
+
+	Any contended object is either held by cross-directory rename or
+has a child that is also contended.  Indeed, suppose that it is held by
+operation other than cross-directory rename.  Then the lock this operation
+is blocked on belongs to child of that object due to (1).
+
+	It means that one of the operations is cross-directory rename.
+Otherwise the set of contended objects would be infinite - each of them
+would have a contended child and we had assumed that no object is its
+own descendent.  Moreover, there is exactly one cross-directory rename
+(see above).
+
+	Consider the object blocking the cross-directory rename.  One
+of its descendents is locked by cross-directory rename (otherwise we
+would again have an infinite set of contended objects).  But that
+means that cross-directory rename is taking locks out of order.  Due
+to (2) the order hadn't changed since we had acquired filesystem lock.
+But locking rules for cross-directory rename guarantee that we do not
+try to acquire lock on descendent before the lock on ancestor.
+Contradiction.  I.e.  deadlock is impossible.  Q.E.D.
+
+
+	These operations are guaranteed to avoid loop creation.  Indeed,
+the only operation that could introduce loops is cross-directory rename.
+Since the only new (parent, child) pair added by rename() is (new parent,
+source), such loop would have to contain these objects and the rest of it
+would have to exist before rename().  I.e. at the moment of loop creation
+rename() responsible for that would be holding filesystem lock and new parent
+would have to be equal to or a descendent of source.  But that means that
+new parent had been equal to or a descendent of source since the moment when
+we had acquired filesystem lock and rename() would fail with -ELOOP in that
+case.
+
+	While this locking scheme works for arbitrary DAGs, it relies on
+ability to check that directory is a descendent of another object.  Current
+implementation assumes that directory graph is a tree.  This assumption is
+also preserved by all operations (cross-directory rename on a tree that would
+not introduce a cycle will leave it a tree and link() fails for directories).
+
+	Notice that "directory" in the above == "anything that might have
+children", so if we are going to introduce hybrid objects we will need
+either to make sure that link(2) doesn't work for them or to make changes
+in is_subdir() that would make it work even in presence of such beasts.
--- a/Documentation/filesystems/dlmfs.txt
+++ b/Documentation/filesystems/dlmfs.txt
@@ -0,0 +1,130 @@
+dlmfs
+==================
+A minimal DLM userspace interface implemented via a virtual file
+system.
+
+dlmfs is built with OCFS2 as it requires most of its infrastructure.
+
+Project web page:    http://oss.oracle.com/projects/ocfs2
+Tools web page:      http://oss.oracle.com/projects/ocfs2-tools
+OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+
+All code copyright 2005 Oracle except when otherwise noted.
+
+CREDITS
+=======
+
+Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds
+and Transmeta Corp.
+
+Mark Fasheh <mark.fasheh@oracle.com>
+
+Caveats
+=======
+- Right now it only works with the OCFS2 DLM, though support for other
+  DLM implementations should not be a major issue.
+
+Mount options
+=============
+None
+
+Usage
+=====
+
+If you're just interested in OCFS2, then please see ocfs2.txt. The
+rest of this document will be geared towards those who want to use
+dlmfs for easy to setup and easy to use clustered locking in
+userspace.
+
+Setup
+=====
+
+dlmfs requires that the OCFS2 cluster infrastructure be in
+place. Please download ocfs2-tools from the above url and configure a
+cluster.
+
+You'll want to start heartbeating on a volume which all the nodes in
+your lockspace can access. The easiest way to do this is via
+ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires
+that an OCFS2 file system be in place so that it can automatically
+find it's heartbeat area, though it will eventually support heartbeat
+against raw disks.
+
+Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed
+with ocfs2-tools.
+
+Once you're heartbeating, DLM lock 'domains' can be easily created /
+destroyed and locks within them accessed.
+
+Locking
+=======
+
+Users may access dlmfs via standard file system calls, or they can use
+'libo2dlm' (distributed with ocfs2-tools) which abstracts the file
+system calls and presents a more traditional locking api.
+
+dlmfs handles lock caching automatically for the user, so a lock
+request for an already acquired lock will not generate another DLM
+call. Userspace programs are assumed to handle their own local
+locking.
+
+Two levels of locks are supported - Shared Read, and Exclusive.
+Also supported is a Trylock operation.
+
+For information on the libo2dlm interface, please see o2dlm.h,
+distributed with ocfs2-tools.
+
+Lock value blocks can be read and written to a resource via read(2)
+and write(2) against the fd obtained via your open(2) call. The
+maximum currently supported LVB length is 64 bytes (though that is an
+OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share
+small amounts of data amongst their nodes.
+
+mkdir(2) signals dlmfs to join a domain (which will have the same name
+as the resulting directory)
+
+rmdir(2) signals dlmfs to leave the domain
+
+Locks for a given domain are represented by regular inodes inside the
+domain directory.  Locking against them is done via the open(2) system
+call.
+
+The open(2) call will not return until your lock has been granted or
+an error has occurred, unless it has been instructed to do a trylock
+operation. If the lock succeeds, you'll get an fd.
+
+open(2) with O_CREAT to ensure the resource inode is created - dlmfs does
+not automatically create inodes for existing lock resources.
+
+Open Flag     Lock Request Type
+---------     -----------------
+O_RDONLY      Shared Read
+O_RDWR        Exclusive
+
+Open Flag     Resulting Locking Behavior
+---------     --------------------------
+O_NONBLOCK    Trylock operation
+
+You must provide exactly one of O_RDONLY or O_RDWR.
+
+If O_NONBLOCK is also provided and the trylock operation was valid but
+could not lock the resource then open(2) will return ETXTBUSY.
+
+close(2) drops the lock associated with your fd.
+
+Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is
+supported locally as well. This means you can use them to restrict
+access to the resources via dlmfs on your local node only.
+
+The resource LVB may be read from the fd in either Shared Read or
+Exclusive modes via the read(2) system call. It can be written via
+write(2) only when open in Exclusive mode.
+
+Once written, an LVB will be visible to other nodes who obtain Read
+Only or higher level locks on the resource.
+
+See Also
+========
+http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
+
+For more information on the VMS distributed locking API.
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -0,0 +1,382 @@
+
+The Second Extended Filesystem
+==============================
+
+ext2 was originally released in January 1993.  Written by R\'emy Card,
+Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the
+Extended Filesystem.  It is currently still (April 2001) the predominant
+filesystem in use by Linux.  There are also implementations available
+for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS.
+
+Options
+=======
+
+Most defaults are determined by the filesystem superblock, and can be
+set using tune2fs(8). Kernel-determined defaults are indicated by (*).
+
+bsddf			(*)	Makes `df' act like BSD.
+minixdf				Makes `df' act like Minix.
+
+check=none, nocheck	(*)	Don't do extra checking of bitmaps on mount
+				(check=normal and check=strict options removed)
+
+debug				Extra debugging information is sent to the
+				kernel syslog.  Useful for developers.
+
+errors=continue			Keep going on a filesystem error.
+errors=remount-ro		Remount the filesystem read-only on an error.
+errors=panic			Panic and halt the machine if an error occurs.
+
+grpid, bsdgroups		Give objects the same group ID as their parent.
+nogrpid, sysvgroups		New objects have the group ID of their creator.
+
+nouid32				Use 16-bit UIDs and GIDs.
+
+oldalloc			Enable the old block allocator. Orlov should
+				have better performance, we'd like to get some
+				feedback if it's the contrary for you.
+orlov			(*)	Use the Orlov block allocator.
+				(See http://lwn.net/Articles/14633/ and
+				http://lwn.net/Articles/14446/.)
+
+resuid=n			The user ID which may use the reserved blocks.
+resgid=n			The group ID which may use the reserved blocks.
+
+sb=n				Use alternate superblock at this location.
+
+user_xattr			Enable "user." POSIX Extended Attributes
+				(requires CONFIG_EXT2_FS_XATTR).
+				See also http://acl.bestbits.at
+nouser_xattr			Don't support "user." extended attributes.
+
+acl				Enable POSIX Access Control Lists support
+				(requires CONFIG_EXT2_FS_POSIX_ACL).
+				See also http://acl.bestbits.at
+noacl				Don't support POSIX ACLs.
+
+nobh				Do not attach buffer_heads to file pagecache.
+
+xip				Use execute in place (no caching) if possible
+
+grpquota,noquota,quota,usrquota	Quota options are silently ignored by ext2.
+
+
+Specification
+=============
+
+ext2 shares many properties with traditional Unix filesystems.  It has
+the concepts of blocks, inodes and directories.  It has space in the
+specification for Access Control Lists (ACLs), fragments, undeletion and
+compression though these are not yet implemented (some are available as
+separate patches).  There is also a versioning mechanism to allow new
+features (such as journalling) to be added in a maximally compatible
+manner.
+
+Blocks
+------
+
+The space in the device or file is split up into blocks.  These are
+a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems),
+which is decided when the filesystem is created.  Smaller blocks mean
+less wasted space per file, but require slightly more accounting overhead,
+and also impose other limits on the size of files and the filesystem.
+
+Block Groups
+------------
+
+Blocks are clustered into block groups in order to reduce fragmentation
+and minimise the amount of head seeking when reading a large amount
+of consecutive data.  Information about each block group is kept in a
+descriptor table stored in the block(s) immediately after the superblock.
+Two blocks near the start of each group are reserved for the block usage
+bitmap and the inode usage bitmap which show which blocks and inodes
+are in use.  Since each bitmap is limited to a single block, this means
+that the maximum size of a block group is 8 times the size of a block.
+
+The block(s) following the bitmaps in each block group are designated
+as the inode table for that block group and the remainder are the data
+blocks.  The block allocation algorithm attempts to allocate data blocks
+in the same block group as the inode which contains them.
+
+The Superblock
+--------------
+
+The superblock contains all the information about the configuration of
+the filing system.  The primary copy of the superblock is stored at an
+offset of 1024 bytes from the start of the device, and it is essential
+to mounting the filesystem.  Since it is so important, backup copies of
+the superblock are stored in block groups throughout the filesystem.
+The first version of ext2 (revision 0) stores a copy at the start of
+every block group, along with backups of the group descriptor block(s).
+Because this can consume a considerable amount of space for large
+filesystems, later revisions can optionally reduce the number of backup
+copies by only putting backups in specific groups (this is the sparse
+superblock feature).  The groups chosen are 0, 1 and powers of 3, 5 and 7.
+
+The information in the superblock contains fields such as the total
+number of inodes and blocks in the filesystem and how many are free,
+how many inodes and blocks are in each block group, when the filesystem
+was mounted (and if it was cleanly unmounted), when it was modified,
+what version of the filesystem it is (see the Revisions section below)
+and which OS created it.
+
+If the filesystem is revision 1 or higher, then there are extra fields,
+such as a volume name, a unique identification number, the inode size,
+and space for optional filesystem features to store configuration info.
+
+All fields in the superblock (as in all other ext2 structures) are stored
+on the disc in little endian format, so a filesystem is portable between
+machines without having to know what machine it was created on.
+
+Inodes
+------
+
+The inode (index node) is a fundamental concept in the ext2 filesystem.
+Each object in the filesystem is represented by an inode.  The inode
+structure contains pointers to the filesystem blocks which contain the
+data held in the object and all of the metadata about an object except
+its name.  The metadata about an object includes the permissions, owner,
+group, flags, size, number of blocks used, access time, change time,
+modification time, deletion time, number of links, fragments, version
+(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
+
+There are some reserved fields which are currently unused in the inode
+structure and several which are overloaded.  One field is reserved for the
+directory ACL if the inode is a directory and alternately for the top 32
+bits of the file size if the inode is a regular file (allowing file sizes
+larger than 2GB).  The translator field is unused under Linux, but is used
+by the HURD to reference the inode of a program which will be used to
+interpret this object.  Most of the remaining reserved fields have been
+used up for both Linux and the HURD for larger owner and group fields,
+The HURD also has a larger mode field so it uses another of the remaining
+fields to store the extra more bits.
+
+There are pointers to the first 12 blocks which contain the file's data
+in the inode.  There is a pointer to an indirect block (which contains
+pointers to the next set of blocks), a pointer to a doubly-indirect
+block (which contains pointers to indirect blocks) and a pointer to a
+trebly-indirect block (which contains pointers to doubly-indirect blocks).
+
+The flags field contains some ext2-specific flags which aren't catered
+for by the standard chmod flags.  These flags can be listed with lsattr
+and changed with the chattr command, and allow specific filesystem
+behaviour on a per-file basis.  There are flags for secure deletion,
+undeletable, compression, synchronous updates, immutability, append-only,
+dumpable, no-atime, indexed directories, and data-journaling.  Not all
+of these are supported yet.
+
+Directories
+-----------
+
+A directory is a filesystem object and has an inode just like a file.
+It is a specially formatted file containing records which associate
+each name with an inode number.  Later revisions of the filesystem also
+encode the type of the object (file, directory, symlink, device, fifo,
+socket) to avoid the need to check the inode itself for this information
+(support for taking advantage of this feature does not yet exist in
+Glibc 2.2).
+
+The inode allocation code tries to assign inodes which are in the same
+block group as the directory in which they are first created.
+
+The current implementation of ext2 uses a singly-linked list to store
+the filenames in the directory; a pending enhancement uses hashing of the
+filenames to allow lookup without the need to scan the entire directory.
+
+The current implementation never removes empty directory blocks once they
+have been allocated to hold more files.
+
+Special files
+-------------
+
+Symbolic links are also filesystem objects with inodes.  They deserve
+special mention because the data for them is stored within the inode
+itself if the symlink is less than 60 bytes long.  It uses the fields
+which would normally be used to store the pointers to data blocks.
+This is a worthwhile optimisation as it we avoid allocating a full
+block for the symlink, and most symlinks are less than 60 characters long.
+
+Character and block special devices never have data blocks assigned to
+them.  Instead, their device number is stored in the inode, again reusing
+the fields which would be used to point to the data blocks.
+
+Reserved Space
+--------------
+
+In ext2, there is a mechanism for reserving a certain number of blocks
+for a particular user (normally the super-user).  This is intended to
+allow for the system to continue functioning even if non-privileged users
+fill up all the space available to them (this is independent of filesystem
+quotas).  It also keeps the filesystem from filling up entirely which
+helps combat fragmentation.
+
+Filesystem check
+----------------
+
+At boot time, most systems run a consistency check (e2fsck) on their
+filesystems.  The superblock of the ext2 filesystem contains several
+fields which indicate whether fsck should actually run (since checking
+the filesystem at boot can take a long time if it is large).  fsck will
+run if the filesystem was not cleanly unmounted, if the maximum mount
+count has been exceeded or if the maximum time between checks has been
+exceeded.
+
+Feature Compatibility
+---------------------
+
+The compatibility feature mechanism used in ext2 is sophisticated.
+It safely allows features to be added to the filesystem, without
+unnecessarily sacrificing compatibility with older versions of the
+filesystem code.  The feature compatibility mechanism is not supported by
+the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in
+revision 1.  There are three 32-bit fields, one for compatible features
+(COMPAT), one for read-only compatible (RO_COMPAT) features and one for
+incompatible (INCOMPAT) features.
+
+These feature flags have specific meanings for the kernel as follows:
+
+A COMPAT flag indicates that a feature is present in the filesystem,
+but the on-disk format is 100% compatible with older on-disk formats, so
+a kernel which didn't know anything about this feature could read/write
+the filesystem without any chance of corrupting the filesystem (or even
+making it inconsistent).  This is essentially just a flag which says
+"this filesystem has a (hidden) feature" that the kernel or e2fsck may
+want to be aware of (more on e2fsck and feature flags later).  The ext3
+HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
+a regular file with data blocks in it so the kernel does not need to
+take any special notice of it if it doesn't understand ext3 journaling.
+
+An RO_COMPAT flag indicates that the on-disk format is 100% compatible
+with older on-disk formats for reading (i.e. the feature does not change
+the visible on-disk format).  However, an old kernel writing to such a
+filesystem would/could corrupt the filesystem, so this is prevented. The
+most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
+sparse groups allow file data blocks where superblock/group descriptor
+backups used to live, and ext2_free_blocks() refuses to free these blocks,
+which would leading to inconsistent bitmaps.  An old kernel would also
+get an error if it tried to free a series of blocks which crossed a group
+boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.
+
+An INCOMPAT flag indicates the on-disk format has changed in some
+way that makes it unreadable by older kernels, or would otherwise
+cause a problem if an old kernel tried to mount it.  FILETYPE is an
+INCOMPAT flag because older kernels would think a filename was longer
+than 256 characters, which would lead to corrupt directory listings.
+The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
+doesn't understand compression, you would just get garbage back from
+read() instead of it automatically decompressing your data.  The ext3
+RECOVER flag is needed to prevent a kernel which does not understand the
+ext3 journal from mounting the filesystem without replaying the journal.
+
+For e2fsck, it needs to be more strict with the handling of these
+flags than the kernel.  If it doesn't understand ANY of the COMPAT,
+RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem,
+because it has no way of verifying whether a given feature is valid
+or not.  Allowing e2fsck to succeed on a filesystem with an unknown
+feature is a false sense of security for the user.  Refusing to check
+a filesystem with unknown features is a good incentive for the user to
+update to the latest e2fsck.  This also means that anyone adding feature
+flags to ext2 also needs to update e2fsck to verify these features.
+
+Metadata
+--------
+
+It is frequently claimed that the ext2 implementation of writing
+asynchronous metadata is faster than the ffs synchronous metadata
+scheme but less reliable.  Both methods are equally resolvable by their
+respective fsck programs.
+
+If you're exceptionally paranoid, there are 3 ways of making metadata
+writes synchronous on ext2:
+
+per-file if you have the program source: use the O_SYNC flag to open()
+per-file if you don't have the source: use "chattr +S" on the file
+per-filesystem: add the "sync" option to mount (or in /etc/fstab)
+
+the first and last are not ext2 specific but do force the metadata to
+be written synchronously.  See also Journaling below.
+
+Limitations
+-----------
+
+There are various limits imposed by the on-disk layout of ext2.  Other
+limits are imposed by the current implementation of the kernel code.
+Many of the limits are determined at the time the filesystem is first
+created, and depend upon the block size chosen.  The ratio of inodes to
+data blocks is fixed at filesystem creation time, so the only way to
+increase the number of inodes is to increase the size of the filesystem.
+No tools currently exist which can change the ratio of inodes to blocks.
+
+Most of these limits could be overcome with slight changes in the on-disk
+format and using a compatibility flag to signal the format change (at
+the expense of some compatibility).
+
+Filesystem block size:     1kB        2kB        4kB        8kB
+
+File size limit:          16GB      256GB     2048GB     2048GB
+Filesystem size limit:  2047GB     8192GB    16384GB    32768GB
+
+There is a 2.4 kernel limit of 2048GB for a single block device, so no
+filesystem larger than that can be created at this time.  There is also
+an upper limit on the block size imposed by the page size of the kernel,
+so 8kB blocks are only allowed on Alpha systems (and other architectures
+which support larger pages).
+
+There is an upper limit of 32768 subdirectories in a single directory.
+
+There is a "soft" upper limit of about 10-15k files in a single directory
+with the current linear linked-list directory implementation.  This limit
+stems from performance problems when creating and deleting (and also
+finding) files in such large directories.  Using a hashed directory index
+(under development) allows 100k-1M+ files in a single directory without
+performance problems (although RAM size becomes an issue at this point).
+
+The (meaningless) absolute upper limit of files in a single directory
+(imposed by the file size, the realistic limit is obviously much less)
+is over 130 trillion files.  It would be higher except there are not
+enough 4-character names to make up unique directory entries, so they
+have to be 8 character filenames, even then we are fairly close to
+running out of unique filenames.
+
+Journaling
+----------
+
+A journaling extension to the ext2 code has been developed by Stephen
+Tweedie.  It avoids the risks of metadata corruption and the need to
+wait for e2fsck to complete after a crash, without requiring a change
+to the on-disk ext2 layout.  In a nutshell, the journal is a regular
+file which stores whole metadata (and optionally data) blocks that have
+been modified, prior to writing them into the filesystem.  This means
+it is possible to add a journal to an existing ext2 filesystem without
+the need for data conversion.
+
+When changes to the filesystem (e.g. a file is renamed) they are stored in
+a transaction in the journal and can either be complete or incomplete at
+the time of a crash.  If a transaction is complete at the time of a crash
+(or in the normal case where the system does not crash), then any blocks
+in that transaction are guaranteed to represent a valid filesystem state,
+and are copied into the filesystem.  If a transaction is incomplete at
+the time of the crash, then there is no guarantee of consistency for
+the blocks in that transaction so they are discarded (which means any
+filesystem changes they represent are also lost).
+Check Documentation/filesystems/ext3.txt if you want to read more about
+ext3 and journaling.
+
+References
+==========
+
+The kernel source	file:/usr/src/linux/fs/ext2/
+e2fsprogs (e2fsck)	http://e2fsprogs.sourceforge.net/
+Design & Implementation	http://e2fsprogs.sourceforge.net/ext2intro.html
+Journaling (ext3)	ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
+Filesystem Resizing	http://ext2resize.sourceforge.net/
+Compression (*)		http://e2compr.sourceforge.net/
+
+Implementations for:
+Windows 95/98/NT/2000	http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm
+Windows 95 (*)		http://www.yipton.demon.co.uk/content.html#FSDEXT2
+DOS client (*)		ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
+OS/2			http://perso.wanadoo.fr/matthieu.willm/ext2-os2/
+RISC OS client		ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/
+
+(*) no longer actively developed/supported (as of Apr 2001)
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -0,0 +1,198 @@
+
+Ext3 Filesystem
+===============
+
+Ext3 was originally released in September 1999. Written by Stephen Tweedie
+for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
+Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
+
+Ext3 is the ext2 filesystem enhanced with journalling capabilities.
+
+Options
+=======
+
+When mounting an ext3 filesystem, the following option are accepted:
+(*) == default
+
+journal=update		Update the ext3 file system's journal to the current
+			format.
+
+journal=inum		When a journal already exists, this option is ignored.
+			Otherwise, it specifies the number of the inode which
+			will represent the ext3 file system's journal file.
+
+journal_dev=devnum	When the external journal device's major/minor numbers
+			have changed, this option allows the user to specify
+			the new journal location.  The journal device is
+			identified through its new major/minor numbers encoded
+			in devnum.
+
+noload			Don't load the journal on mounting.
+
+data=journal		All data are committed into the journal prior to being
+			written into the main file system.
+
+data=ordered	(*)	All data are forced directly out to the main file
+			system prior to its metadata being committed to the
+			journal.
+
+data=writeback		Data ordering is not preserved, data may be written
+			into the main file system after its metadata has been
+			committed to the journal.
+
+commit=nrsec	(*)	Ext3 can be told to sync all its data and metadata
+			every 'nrsec' seconds. The default value is 5 seconds.
+			This means that if you lose your power, you will lose
+			as much as the latest 5 seconds of work (your
+			filesystem will not be damaged though, thanks to the
+			journaling).  This default value (or any low value)
+			will hurt performance, but it's good for data-safety.
+			Setting it to 0 will have the same effect as leaving
+			it at the default (5 seconds).
+			Setting it to very large values will improve
+			performance.
+
+barrier=1		This enables/disables barriers.  barrier=0 disables
+			it, barrier=1 enables it.
+
+orlov		(*)	This enables the new Orlov block allocator. It is
+			enabled by default.
+
+oldalloc		This disables the Orlov block allocator and enables
+			the old block allocator.  Orlov should have better
+			performance - we'd like to get some feedback if it's
+			the contrary for you.
+
+user_xattr		Enables Extended User Attributes.  Additionally, you
+			need to have extended attribute support enabled in the
+			kernel configuration (CONFIG_EXT3_FS_XATTR).  See the
+			attr(5) manual page and http://acl.bestbits.at/ to
+			learn more about extended attributes.
+
+nouser_xattr		Disables Extended User Attributes.
+
+acl			Enables POSIX Access Control Lists support.
+			Additionally, you need to have ACL support enabled in
+			the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL).
+			See the acl(5) manual page and http://acl.bestbits.at/
+			for more information.
+
+noacl			This option disables POSIX Access Control List
+			support.
+
+reservation
+
+noreservation
+
+bsddf 		(*)	Make 'df' act like BSD.
+minixdf			Make 'df' act like Minix.
+
+check=none		Don't do extra checking of bitmaps on mount.
+nocheck
+
+debug			Extra debugging information is sent to syslog.
+
+errors=remount-ro(*)	Remount the filesystem read-only on an error.
+errors=continue		Keep going on a filesystem error.
+errors=panic		Panic and halt the machine if an error occurs.
+
+grpid			Give objects the same group ID as their creator.
+bsdgroups
+
+nogrpid		(*)	New objects have the group ID of their creator.
+sysvgroups
+
+resgid=n		The group ID which may use the reserved blocks.
+
+resuid=n		The user ID which may use the reserved blocks.
+
+sb=n			Use alternate superblock at this location.
+
+quota
+noquota
+grpquota
+usrquota
+
+bh		(*)	ext3 associates buffer heads to data pages to
+nobh			(a) cache disk block mapping information
+			(b) link pages into transaction to provide
+			    ordering guarantees.
+			"bh" option forces use of buffer heads.
+			"nobh" option tries to avoid associating buffer
+			heads (supported only for "writeback" mode).
+
+
+Specification
+=============
+Ext3 shares all disk implementation with the ext2 filesystem, and adds
+transactions capabilities to ext2.  Journaling is done by the Journaling Block
+Device layer.
+
+Journaling Block Device layer
+-----------------------------
+The Journaling Block Device layer (JBD) isn't ext3 specific.  It was design to
+add journaling capabilities on a block device.  The ext3 filesystem code will
+inform the JBD of modifications it is performing (called a transaction).  The
+journal supports the transactions start and stop, and in case of crash, the
+journal can replayed the transactions to put the partition back in a
+consistent state fast.
+
+Handles represent a single atomic update to a filesystem.  JBD can handle an
+external journal on a block device.
+
+Data Mode
+---------
+There are 3 different data modes:
+
+* writeback mode
+In data=writeback mode, ext3 does not journal data at all.  This mode provides
+a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
+mode - metadata journaling.  A crash+recovery can cause incorrect data to
+appear in files which were written shortly before the crash.  This mode will
+typically provide the best ext3 performance.
+
+* ordered mode
+In data=ordered mode, ext3 only officially journals metadata, but it logically
+groups metadata and data blocks into a single unit called a transaction.  When
+it's time to write the new metadata out to disk, the associated data blocks
+are written first.  In general, this mode performs slightly slower than
+writeback but significantly faster than journal mode.
+
+* journal mode
+data=journal mode provides full data and metadata journaling.  All new data is
+written to the journal first, and then to its final location.
+In the event of a crash, the journal can be replayed, bringing both data and
+metadata into a consistent state.  This mode is the slowest except when data
+needs to be read from and written to disk at the same time where it
+outperforms all others modes.
+
+Compatibility
+-------------
+
+Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
+Ext3 is fully compatible with Ext2.  Ext3 partitions can easily be mounted as
+Ext2.
+
+
+External Tools
+==============
+See manual pages to learn more.
+
+tune2fs: 	create a ext3 journal on a ext2 partition with the -j flag.
+mke2fs: 	create a ext3 partition with the -j flag.
+debugfs: 	ext2 and ext3 file system debugger.
+ext2online:	online (mounted) ext2 and ext3 filesystem resizer
+
+
+References
+==========
+
+kernel source:	<file:fs/ext3/>
+		<file:fs/jbd/>
+
+programs: 	http://e2fsprogs.sourceforge.net/
+		http://ext2resize.sourceforge.net
+
+useful links:	http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
+		http://www-106.ibm.com/developerworks/linux/library/l-fs7/
+		http://www-106.ibm.com/developerworks/linux/library/l-fs8/
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -0,0 +1,236 @@
+
+Ext4 Filesystem
+===============
+
+This is a development version of the ext4 filesystem, an advanced level
+of the ext3 filesystem which incorporates scalability and reliability
+enhancements for supporting large filesystems (64 bit) in keeping with
+increasing disk capacities and state-of-the-art feature requirements.
+
+Mailing list: linux-ext4@vger.kernel.org
+
+
+1. Quick usage instructions:
+===========================
+
+  - Grab updated e2fsprogs from
+    ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/
+    This is a patchset on top of e2fsprogs-1.39, which can be found at
+    ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
+
+  - It's still mke2fs -j /dev/hda1
+
+  - mount /dev/hda1 /wherever -t ext4dev
+
+  - To enable extents,
+
+	mount /dev/hda1 /wherever -t ext4dev -o extents
+
+  - The filesystem is compatible with the ext3 driver until you add a file
+    which has extents (ie: `mount -o extents', then create a file).
+
+    NOTE: The "extents" mount flag is temporary.  It will soon go away and
+    extents will be enabled by the "-o extents" flag to mke2fs or tune2fs
+
+  - When comparing performance with other filesystems, remember that
+    ext3/4 by default offers higher data integrity guarantees than most.  So
+    when comparing with a metadata-only journalling filesystem, use `mount -o
+    data=writeback'.  And you might as well use `mount -o nobh' too along
+    with it.  Making the journal larger than the mke2fs default often helps
+    performance with metadata-intensive workloads.
+
+2. Features
+===========
+
+2.1 Currently available
+
+* ability to use filesystems > 16TB
+* extent format reduces metadata overhead (RAM, IO for access, transactions)
+* extent format more robust in face of on-disk corruption due to magics,
+* internal redunancy in tree
+
+2.1 Previously available, soon to be enabled by default by "mkefs.ext4":
+
+* dir_index and resize inode will be on by default
+* large inodes will be used by default for fast EAs, nsec timestamps, etc
+
+2.2 Candidate features for future inclusion
+
+There are several under discussion, whether they all make it in is
+partly a function of how much time everyone has to work on them:
+
+* improved file allocation (multi-block alloc, delayed alloc; basically done)
+* fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
+* nsec timestamps for mtime, atime, ctime, create time (patch exists,
+  needs some e2fsck work)
+* inode version field on disk (NFSv4, Lustre; prototype exists)
+* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
+* journal checksumming for robustness, performance (prototype exists)
+* persistent file preallocation (e.g for streaming media, databases)
+
+Features like metadata checksumming have been discussed and planned for
+a bit but no patches exist yet so I'm not sure they're in the near-term
+roadmap.
+
+The big performance win will come with mballoc and delalloc.  CFS has
+been using mballoc for a few years already with Lustre, and IBM + Bull
+did a lot of benchmarking on it.  The reason it isn't in the first set of
+patches is partly a manageability issue, and partly because it doesn't
+directly affect the on-disk format (outside of much better allocation)
+so it isn't critical to get into the first round of changes.  I believe
+Alex is working on a new set of patches right now.
+
+3. Options
+==========
+
+When mounting an ext4 filesystem, the following option are accepted:
+(*) == default
+
+extents			ext4 will use extents to address file data.  The
+			file system will no longer be mountable by ext3.
+
+journal=update		Update the ext4 file system's journal to the current
+			format.
+
+journal=inum		When a journal already exists, this option is ignored.
+			Otherwise, it specifies the number of the inode which
+			will represent the ext4 file system's journal file.
+
+journal_dev=devnum	When the external journal device's major/minor numbers
+			have changed, this option allows the user to specify
+			the new journal location.  The journal device is
+			identified through its new major/minor numbers encoded
+			in devnum.
+
+noload			Don't load the journal on mounting.
+
+data=journal		All data are committed into the journal prior to being
+			written into the main file system.
+
+data=ordered	(*)	All data are forced directly out to the main file
+			system prior to its metadata being committed to the
+			journal.
+
+data=writeback		Data ordering is not preserved, data may be written
+			into the main file system after its metadata has been
+			committed to the journal.
+
+commit=nrsec	(*)	Ext4 can be told to sync all its data and metadata
+			every 'nrsec' seconds. The default value is 5 seconds.
+			This means that if you lose your power, you will lose
+			as much as the latest 5 seconds of work (your
+			filesystem will not be damaged though, thanks to the
+			journaling).  This default value (or any low value)
+			will hurt performance, but it's good for data-safety.
+			Setting it to 0 will have the same effect as leaving
+			it at the default (5 seconds).
+			Setting it to very large values will improve
+			performance.
+
+barrier=1		This enables/disables barriers.  barrier=0 disables
+			it, barrier=1 enables it.
+
+orlov		(*)	This enables the new Orlov block allocator. It is
+			enabled by default.
+
+oldalloc		This disables the Orlov block allocator and enables
+			the old block allocator.  Orlov should have better
+			performance - we'd like to get some feedback if it's
+			the contrary for you.
+
+user_xattr		Enables Extended User Attributes.  Additionally, you
+			need to have extended attribute support enabled in the
+			kernel configuration (CONFIG_EXT4_FS_XATTR).  See the
+			attr(5) manual page and http://acl.bestbits.at/ to
+			learn more about extended attributes.
+
+nouser_xattr		Disables Extended User Attributes.
+
+acl			Enables POSIX Access Control Lists support.
+			Additionally, you need to have ACL support enabled in
+			the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL).
+			See the acl(5) manual page and http://acl.bestbits.at/
+			for more information.
+
+noacl			This option disables POSIX Access Control List
+			support.
+
+reservation
+
+noreservation
+
+bsddf		(*)	Make 'df' act like BSD.
+minixdf			Make 'df' act like Minix.
+
+check=none		Don't do extra checking of bitmaps on mount.
+nocheck
+
+debug			Extra debugging information is sent to syslog.
+
+errors=remount-ro(*)	Remount the filesystem read-only on an error.
+errors=continue		Keep going on a filesystem error.
+errors=panic		Panic and halt the machine if an error occurs.
+
+grpid			Give objects the same group ID as their creator.
+bsdgroups
+
+nogrpid		(*)	New objects have the group ID of their creator.
+sysvgroups
+
+resgid=n		The group ID which may use the reserved blocks.
+
+resuid=n		The user ID which may use the reserved blocks.
+
+sb=n			Use alternate superblock at this location.
+
+quota
+noquota
+grpquota
+usrquota
+
+bh		(*)	ext4 associates buffer heads to data pages to
+nobh			(a) cache disk block mapping information
+			(b) link pages into transaction to provide
+			    ordering guarantees.
+			"bh" option forces use of buffer heads.
+			"nobh" option tries to avoid associating buffer
+			heads (supported only for "writeback" mode).
+
+
+Data Mode
+---------
+There are 3 different data modes:
+
+* writeback mode
+In data=writeback mode, ext4 does not journal data at all.  This mode provides
+a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
+mode - metadata journaling.  A crash+recovery can cause incorrect data to
+appear in files which were written shortly before the crash.  This mode will
+typically provide the best ext4 performance.
+
+* ordered mode
+In data=ordered mode, ext4 only officially journals metadata, but it logically
+groups metadata and data blocks into a single unit called a transaction.  When
+it's time to write the new metadata out to disk, the associated data blocks
+are written first.  In general, this mode performs slightly slower than
+writeback but significantly faster than journal mode.
+
+* journal mode
+data=journal mode provides full data and metadata journaling.  All new data is
+written to the journal first, and then to its final location.
+In the event of a crash, the journal can be replayed, bringing both data and
+metadata into a consistent state.  This mode is the slowest except when data
+needs to be read from and written to disk at the same time where it
+outperforms all others modes.
+
+References
+==========
+
+kernel source:	<file:fs/ext4/>
+		<file:fs/jbd2/>
+
+programs:	http://e2fsprogs.sourceforge.net/
+		http://ext2resize.sourceforge.net
+
+useful links:	http://fedoraproject.org/wiki/ext3-devel
+		http://www.bullopensource.org/ext4/
--- a/Documentation/filesystems/files.txt
+++ b/Documentation/filesystems/files.txt
@@ -0,0 +1,123 @@
+File management in the Linux kernel
+-----------------------------------
+
+This document describes how locking for files (struct file)
+and file descriptor table (struct files) works.
+
+Up until 2.6.12, the file descriptor table has been protected
+with a lock (files->file_lock) and reference count (files->count).
+->file_lock protected accesses to all the file related fields
+of the table. ->count was used for sharing the file descriptor
+table between tasks cloned with CLONE_FILES flag. Typically
+this would be the case for posix threads. As with the common
+refcounting model in the kernel, the last task doing
+a put_files_struct() frees the file descriptor (fd) table.
+The files (struct file) themselves are protected using
+reference count (->f_count).
+
+In the new lock-free model of file descriptor management,
+the reference counting is similar, but the locking is
+based on RCU. The file descriptor table contains multiple
+elements - the fd sets (open_fds and close_on_exec, the
+array of file pointers, the sizes of the sets and the array
+etc.). In order for the updates to appear atomic to
+a lock-free reader, all the elements of the file descriptor
+table are in a separate structure - struct fdtable.
+files_struct contains a pointer to struct fdtable through
+which the actual fd table is accessed. Initially the
+fdtable is embedded in files_struct itself. On a subsequent
+expansion of fdtable, a new fdtable structure is allocated
+and files->fdtab points to the new structure. The fdtable
+structure is freed with RCU and lock-free readers either
+see the old fdtable or the new fdtable making the update
+appear atomic. Here are the locking rules for
+the fdtable structure -
+
+1. All references to the fdtable must be done through
+   the files_fdtable() macro :
+
+	struct fdtable *fdt;
+
+	rcu_read_lock();
+
+	fdt = files_fdtable(files);
+	....
+	if (n <= fdt->max_fds)
+		....
+	...
+	rcu_read_unlock();
+
+   files_fdtable() uses rcu_dereference() macro which takes care of
+   the memory barrier requirements for lock-free dereference.
+   The fdtable pointer must be read within the read-side
+   critical section.
+
+2. Reading of the fdtable as described above must be protected
+   by rcu_read_lock()/rcu_read_unlock().
+
+3. For any update to the fd table, files->file_lock must
+   be held.
+
+4. To look up the file structure given an fd, a reader
+   must use either fcheck() or fcheck_files() APIs. These
+   take care of barrier requirements due to lock-free lookup.
+   An example :
+
+	struct file *file;
+
+	rcu_read_lock();
+	file = fcheck(fd);
+	if (file) {
+		...
+	}
+	....
+	rcu_read_unlock();
+
+5. Handling of the file structures is special. Since the look-up
+   of the fd (fget()/fget_light()) are lock-free, it is possible
+   that look-up may race with the last put() operation on the
+   file structure. This is avoided using the rcuref APIs
+   on ->f_count :
+
+	rcu_read_lock();
+	file = fcheck_files(files, fd);
+	if (file) {
+		if (rcuref_inc_lf(&file->f_count))
+			*fput_needed = 1;
+		else
+		/* Didn't get the reference, someone's freed */
+			file = NULL;
+	}
+	rcu_read_unlock();
+	....
+	return file;
+
+   rcuref_inc_lf() detects if refcounts is already zero or
+   goes to zero during increment. If it does, we fail
+   fget()/fget_light().
+
+6. Since both fdtable and file structures can be looked up
+   lock-free, they must be installed using rcu_assign_pointer()
+   API. If they are looked up lock-free, rcu_dereference()
+   must be used. However it is advisable to use files_fdtable()
+   and fcheck()/fcheck_files() which take care of these issues.
+
+7. While updating, the fdtable pointer must be looked up while
+   holding files->file_lock. If ->file_lock is dropped, then
+   another thread expand the files thereby creating a new
+   fdtable and making the earlier fdtable pointer stale.
+   For example :
+
+	spin_lock(&files->file_lock);
+	fd = locate_fd(files, file, start);
+	if (fd >= 0) {
+		/* locate_fd() may have expanded fdtable, load the ptr */
+		fdt = files_fdtable(files);
+		FD_SET(fd, fdt->open_fds);
+		FD_CLR(fd, fdt->close_on_exec);
+		spin_unlock(&files->file_lock);
+	.....
+
+   Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
+   the fdtable pointer (fdt) must be loaded after locate_fd().
+
--- a/Documentation/filesystems/fuse.txt
+++ b/Documentation/filesystems/fuse.txt
@@ -0,0 +1,423 @@
+Definitions
+~~~~~~~~~~~
+
+Userspace filesystem:
+
+  A filesystem in which data and metadata are provided by an ordinary
+  userspace process.  The filesystem can be accessed normally through
+  the kernel interface.
+
+Filesystem daemon:
+
+  The process(es) providing the data and metadata of the filesystem.
+
+Non-privileged mount (or user mount):
+
+  A userspace filesystem mounted by a non-privileged (non-root) user.
+  The filesystem daemon is running with the privileges of the mounting
+  user.  NOTE: this is not the same as mounts allowed with the "user"
+  option in /etc/fstab, which is not discussed here.
+
+Filesystem connection:
+
+  A connection between the filesystem daemon and the kernel.  The
+  connection exists until either the daemon dies, or the filesystem is
+  umounted.  Note that detaching (or lazy umounting) the filesystem
+  does _not_ break the connection, in this case it will exist until
+  the last reference to the filesystem is released.
+
+Mount owner:
+
+  The user who does the mounting.
+
+User:
+
+  The user who is performing filesystem operations.
+
+What is FUSE?
+~~~~~~~~~~~~~
+
+FUSE is a userspace filesystem framework.  It consists of a kernel
+module (fuse.ko), a userspace library (libfuse.*) and a mount utility
+(fusermount).
+
+One of the most important features of FUSE is allowing secure,
+non-privileged mounts.  This opens up new possibilities for the use of
+filesystems.  A good example is sshfs: a secure network filesystem
+using the sftp protocol.
+
+The userspace library and utilities are available from the FUSE
+homepage:
+
+  http://fuse.sourceforge.net/
+
+Filesystem type
+~~~~~~~~~~~~~~~
+
+The filesystem type given to mount(2) can be one of the following:
+
+'fuse'
+
+  This is the usual way to mount a FUSE filesystem.  The first
+  argument of the mount system call may contain an arbitrary string,
+  which is not interpreted by the kernel.
+
+'fuseblk'
+
+  The filesystem is block device based.  The first argument of the
+  mount system call is interpreted as the name of the device.
+
+Mount options
+~~~~~~~~~~~~~
+
+'fd=N'
+
+  The file descriptor to use for communication between the userspace
+  filesystem and the kernel.  The file descriptor must have been
+  obtained by opening the FUSE device ('/dev/fuse').
+
+'rootmode=M'
+
+  The file mode of the filesystem's root in octal representation.
+
+'user_id=N'
+
+  The numeric user id of the mount owner.
+
+'group_id=N'
+
+  The numeric group id of the mount owner.
+
+'default_permissions'
+
+  By default FUSE doesn't check file access permissions, the
+  filesystem is free to implement it's access policy or leave it to
+  the underlying file access mechanism (e.g. in case of network
+  filesystems).  This option enables permission checking, restricting
+  access based on file mode.  It is usually useful together with the
+  'allow_other' mount option.
+
+'allow_other'
+
+  This option overrides the security measure restricting file access
+  to the user mounting the filesystem.  This option is by default only
+  allowed to root, but this restriction can be removed with a
+  (userspace) configuration option.
+
+'max_read=N'
+
+  With this option the maximum size of read operations can be set.
+  The default is infinite.  Note that the size of read requests is
+  limited anyway to 32 pages (which is 128kbyte on i386).
+
+'blksize=N'
+
+  Set the block size for the filesystem.  The default is 512.  This
+  option is only valid for 'fuseblk' type mounts.
+
+Control filesystem
+~~~~~~~~~~~~~~~~~~
+
+There's a control filesystem for FUSE, which can be mounted by:
+
+  mount -t fusectl none /sys/fs/fuse/connections
+
+Mounting it under the '/sys/fs/fuse/connections' directory makes it
+backwards compatible with earlier versions.
+
+Under the fuse control filesystem each connection has a directory
+named by a unique number.
+
+For each connection the following files exist within this directory:
+
+ 'waiting'
+
+  The number of requests which are waiting to be transferred to
+  userspace or being processed by the filesystem daemon.  If there is
+  no filesystem activity and 'waiting' is non-zero, then the
+  filesystem is hung or deadlocked.
+
+ 'abort'
+
+  Writing anything into this file will abort the filesystem
+  connection.  This means that all waiting requests will be aborted an
+  error returned for all aborted and new requests.
+
+Only the owner of the mount may read or write these files.
+
+Interrupting filesystem operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a process issuing a FUSE filesystem request is interrupted, the
+following will happen:
+
+  1) If the request is not yet sent to userspace AND the signal is
+     fatal (SIGKILL or unhandled fatal signal), then the request is
+     dequeued and returns immediately.
+
+  2) If the request is not yet sent to userspace AND the signal is not
+     fatal, then an 'interrupted' flag is set for the request.  When
+     the request has been successfully transferred to userspace and
+     this flag is set, an INTERRUPT request is queued.
+
+  3) If the request is already sent to userspace, then an INTERRUPT
+     request is queued.
+
+INTERRUPT requests take precedence over other requests, so the
+userspace filesystem will receive queued INTERRUPTs before any others.
+
+The userspace filesystem may ignore the INTERRUPT requests entirely,
+or may honor them by sending a reply to the _original_ request, with
+the error set to EINTR.
+
+It is also possible that there's a race between processing the
+original request and it's INTERRUPT request.  There are two possibilities:
+
+  1) The INTERRUPT request is processed before the original request is
+     processed
+
+  2) The INTERRUPT request is processed after the original request has
+     been answered
+
+If the filesystem cannot find the original request, it should wait for
+some timeout and/or a number of new requests to arrive, after which it
+should reply to the INTERRUPT request with an EAGAIN error.  In case
+1) the INTERRUPT request will be requeued.  In case 2) the INTERRUPT
+reply will be ignored.
+
+Aborting a filesystem connection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to get into certain situations where the filesystem is
+not responding.  Reasons for this may be:
+
+  a) Broken userspace filesystem implementation
+
+  b) Network connection down
+
+  c) Accidental deadlock
+
+  d) Malicious deadlock
+
+(For more on c) and d) see later sections)
+
+In either of these cases it may be useful to abort the connection to
+the filesystem.  There are several ways to do this:
+
+  - Kill the filesystem daemon.  Works in case of a) and b)
+
+  - Kill the filesystem daemon and all users of the filesystem.  Works
+    in all cases except some malicious deadlocks
+
+  - Use forced umount (umount -f).  Works in all cases but only if
+    filesystem is still attached (it hasn't been lazy unmounted)
+
+  - Abort filesystem through the FUSE control filesystem.  Most
+    powerful method, always works.
+
+How do non-privileged mounts work?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since the mount() system call is a privileged operation, a helper
+program (fusermount) is needed, which is installed setuid root.
+
+The implication of providing non-privileged mounts is that the mount
+owner must not be able to use this capability to compromise the
+system.  Obvious requirements arising from this are:
+
+ A) mount owner should not be able to get elevated privileges with the
+    help of the mounted filesystem
+
+ B) mount owner should not get illegitimate access to information from
+    other users' and the super user's processes
+
+ C) mount owner should not be able to induce undesired behavior in
+    other users' or the super user's processes
+
+How are requirements fulfilled?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ A) The mount owner could gain elevated privileges by either:
+
+     1) creating a filesystem containing a device file, then opening
+	this device
+
+     2) creating a filesystem containing a suid or sgid application,
+	then executing this application
+
+    The solution is not to allow opening device files and ignore
+    setuid and setgid bits when executing programs.  To ensure this
+    fusermount always adds "nosuid" and "nodev" to the mount options
+    for non-privileged mounts.
+
+ B) If another user is accessing files or directories in the
+    filesystem, the filesystem daemon serving requests can record the
+    exact sequence and timing of operations performed.  This
+    information is otherwise inaccessible to the mount owner, so this
+    counts as an information leak.
+
+    The solution to this problem will be presented in point 2) of C).
+
+ C) There are several ways in which the mount owner can induce
+    undesired behavior in other users' processes, such as:
+
+     1) mounting a filesystem over a file or directory which the mount
+        owner could otherwise not be able to modify (or could only
+        make limited modifications).
+
+        This is solved in fusermount, by checking the access
+        permissions on the mountpoint and only allowing the mount if
+        the mount owner can do unlimited modification (has write
+        access to the mountpoint, and mountpoint is not a "sticky"
+        directory)
+
+     2) Even if 1) is solved the mount owner can change the behavior
+        of other users' processes.
+
+         i) It can slow down or indefinitely delay the execution of a
+           filesystem operation creating a DoS against the user or the
+           whole system.  For example a suid application locking a
+           system file, and then accessing a file on the mount owner's
+           filesystem could be stopped, and thus causing the system
+           file to be locked forever.
+
+         ii) It can present files or directories of unlimited length, or
+           directory structures of unlimited depth, possibly causing a
+           system process to eat up diskspace, memory or other
+           resources, again causing DoS.
+
+	The solution to this as well as B) is not to allow processes
+	to access the filesystem, which could otherwise not be
+	monitored or manipulated by the mount owner.  Since if the
+	mount owner can ptrace a process, it can do all of the above
+	without using a FUSE mount, the same criteria as used in
+	ptrace can be used to check if a process is allowed to access
+	the filesystem or not.
+
+	Note that the ptrace check is not strictly necessary to
+	prevent B/2/i, it is enough to check if mount owner has enough
+	privilege to send signal to the process accessing the
+	filesystem, since SIGSTOP can be used to get a similar effect.
+
+I think these limitations are unacceptable?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a sysadmin trusts the users enough, or can ensure through other
+measures, that system processes will never enter non-privileged
+mounts, it can relax the last limitation with a "user_allow_other"
+config option.  If this config option is set, the mounting user can
+add the "allow_other" mount option which disables the check for other
+users' processes.
+
+Kernel - userspace interface
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following diagram shows how a filesystem operation (in this
+example unlink) is performed in FUSE.
+
+NOTE: everything in this description is greatly simplified
+
+ |  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
+ |                                    |
+ |                                    |  >sys_read()
+ |                                    |    >fuse_dev_read()
+ |                                    |      >request_wait()
+ |                                    |        [sleep on fc->waitq]
+ |                                    |
+ |  >sys_unlink()                     |
+ |    >fuse_unlink()                  |
+ |      [get request from             |
+ |       fc->unused_list]             |
+ |      >request_send()               |
+ |        [queue req on fc->pending]  |
+ |        [wake up fc->waitq]         |        [woken up]
+ |        >request_wait_answer()      |
+ |          [sleep on req->waitq]     |
+ |                                    |      <request_wait()
+ |                                    |      [remove req from fc->pending]
+ |                                    |      [copy req to read buffer]
+ |                                    |      [add req to fc->processing]
+ |                                    |    <fuse_dev_read()
+ |                                    |  <sys_read()
+ |                                    |
+ |                                    |  [perform unlink]
+ |                                    |
+ |                                    |  >sys_write()
+ |                                    |    >fuse_dev_write()
+ |                                    |      [look up req in fc->processing]
+ |                                    |      [remove from fc->processing]
+ |                                    |      [copy write buffer to req]
+ |          [woken up]                |      [wake up req->waitq]
+ |                                    |    <fuse_dev_write()
+ |                                    |  <sys_write()
+ |        <request_wait_answer()      |
+ |      <request_send()               |
+ |      [add request to               |
+ |       fc->unused_list]             |
+ |    <fuse_unlink()                  |
+ |  <sys_unlink()                     |
+
+There are a couple of ways in which to deadlock a FUSE filesystem.
+Since we are talking about unprivileged userspace programs,
+something must be done about these.
+
+Scenario 1 -  Simple deadlock
+-----------------------------
+
+ |  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
+ |                                    |
+ |  >sys_unlink("/mnt/fuse/file")     |
+ |    [acquire inode semaphore        |
+ |     for "file"]                    |
+ |    >fuse_unlink()                  |
+ |      [sleep on req->waitq]         |
+ |                                    |  <sys_read()
+ |                                    |  >sys_unlink("/mnt/fuse/file")
+ |                                    |    [acquire inode semaphore
+ |                                    |     for "file"]
+ |                                    |    *DEADLOCK*
+
+The solution for this is to allow the filesystem to be aborted.
+
+Scenario 2 - Tricky deadlock
+----------------------------
+
+This one needs a carefully crafted filesystem.  It's a variation on
+the above, only the call back to the filesystem is not explicit,
+but is caused by a pagefault.
+
+ |  Kamikaze filesystem thread 1      |  Kamikaze filesystem thread 2
+ |                                    |
+ |  [fd = open("/mnt/fuse/file")]     |  [request served normally]
+ |  [mmap fd to 'addr']               |
+ |  [close fd]                        |  [FLUSH triggers 'magic' flag]
+ |  [read a byte from addr]           |
+ |    >do_page_fault()                |
+ |      [find or create page]         |
+ |      [lock page]                   |
+ |      >fuse_readpage()              |
+ |         [queue READ request]       |
+ |         [sleep on req->waitq]      |
+ |                                    |  [read request to buffer]
+ |                                    |  [create reply header before addr]
+ |                                    |  >sys_write(addr - headerlength)
+ |                                    |    >fuse_dev_write()
+ |                                    |      [look up req in fc->processing]
+ |                                    |      [remove from fc->processing]
+ |                                    |      [copy write buffer to req]
+ |                                    |        >do_page_fault()
+ |                                    |           [find or create page]
+ |                                    |           [lock page]
+ |                                    |           * DEADLOCK *
+
+Solution is basically the same as above.
+
+An additional problem is that while the write buffer is being copied
+to the request, the request must not be interrupted/aborted.  This is
+because the destination address of the copy may not be valid after the
+request has returned.
+
+This is solved with doing the copy atomically, and allowing abort
+while the page(s) belonging to the write buffer are faulted with
+get_user_pages().  The 'req->locked' flag indicates when the copy is
+taking place, and abort is delayed until this flag is unset.
--- a/Documentation/filesystems/gfs2.txt
+++ b/Documentation/filesystems/gfs2.txt
@@ -0,0 +1,43 @@
+Global File System
+------------------
+
+http://sources.redhat.com/cluster/
+
+GFS is a cluster file system. It allows a cluster of computers to
+simultaneously use a block device that is shared between them (with FC,
+iSCSI, NBD, etc).  GFS reads and writes to the block device like a local
+file system, but also uses a lock module to allow the computers coordinate
+their I/O so file system consistency is maintained.  One of the nifty
+features of GFS is perfect consistency -- changes made to the file system
+on one machine show up immediately on all other machines in the cluster.
+
+GFS uses interchangable inter-node locking mechanisms.  Different lock
+modules can plug into GFS and each file system selects the appropriate
+lock module at mount time.  Lock modules include:
+
+  lock_nolock -- allows gfs to be used as a local file system
+
+  lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking
+  The dlm is found at linux/fs/dlm/
+
+In addition to interfacing with an external locking manager, a gfs lock
+module is responsible for interacting with external cluster management
+systems.  Lock_dlm depends on user space cluster management systems found
+at the URL above.
+
+To use gfs as a local file system, no external clustering systems are
+needed, simply:
+
+  $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device
+  $ mount -t gfs2 /dev/block_device /dir
+
+GFS2 is not on-disk compatible with previous versions of GFS.
+
+The following man pages can be found at the URL above:
+  gfs2_fsck	to repair a filesystem
+  gfs2_grow	to expand a filesystem online
+  gfs2_jadd	to add journals to a filesystem online
+  gfs2_tool	to manipulate, examine and tune a filesystem
+  gfs2_quota	to examine and change quota values in a filesystem
+  mount.gfs2	to help mount(8) mount a filesystem
+  mkfs.gfs2	to make a filesystem
--- a/Documentation/filesystems/hfs.txt
+++ b/Documentation/filesystems/hfs.txt
@@ -0,0 +1,83 @@
+
+Macintosh HFS Filesystem for Linux
+==================================
+
+HFS stands for ``Hierarchical File System'' and is the filesystem used
+by the Mac Plus and all later Macintosh models.  Earlier Macintosh
+models used MFS (``Macintosh File System''), which is not supported,
+MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
+HFS but is extended in various areas.  Use the hfsplus filesystem driver
+to access such filesystems from Linux.
+
+
+Mount options
+=============
+
+When mounting an HFS filesystem, the following options are accepted:
+
+  creator=cccc, type=cccc
+	Specifies the creator/type values as shown by the MacOS finder
+	used for creating new files.  Default values: '????'.
+
+  uid=n, gid=n
+  	Specifies the user/group that owns all files on the filesystems.
+	Default:  user/group id of the mounting process.
+
+  dir_umask=n, file_umask=n, umask=n
+	Specifies the umask used for all files , all directories or all
+	files and directories.  Defaults to the umask of the mounting process.
+
+  session=n
+  	Select the CDROM session to mount as HFS filesystem.  Defaults to
+	leaving that decision to the CDROM driver.  This option will fail
+	with anything but a CDROM as underlying devices.
+
+  part=n
+  	Select partition number n from the devices.  Does only makes
+	sense for CDROMS because they can't be partitioned under Linux.
+	For disk devices the generic partition parsing code does this
+	for us.  Defaults to not parsing the partition table at all.
+
+  quiet
+  	Ignore invalid mount options instead of complaining.
+
+
+Writing to HFS Filesystems
+==========================
+
+HFS is not a UNIX filesystem, thus it does not have the usual features you'd
+expect:
+
+ o You can't modify the set-uid, set-gid, sticky or executable bits or the uid
+   and gid of files.
+ o You can't create hard- or symlinks, device files, sockets or FIFOs.
+
+HFS does on the other have the concepts of multiple forks per file.  These
+non-standard forks are represented as hidden additional files in the normal
+filesystems namespace which is kind of a cludge and makes the semantics for
+the a little strange:
+
+ o You can't create, delete or rename resource forks of files or the
+   Finder's metadata.
+ o They are however created (with default values), deleted and renamed
+   along with the corresponding data fork or directory.
+ o Copying files to a different filesystem will loose those attributes
+   that are essential for MacOS to work.
+
+
+Creating HFS filesystems
+===================================
+
+The hfsutils package from Robert Leslie contains a program called
+hformat that can be used to create HFS filesystem. See
+<http://www.mars.org/home/rob/proj/hfs/> for details.
+
+
+Credits
+=======
+
+The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU)
+and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis
+Technologies.
+Roman rewrote large parts of the code and brought in btree routines derived
+from Brad Boyer's hfsplus driver (also maintained by Roman now).
--- a/Documentation/filesystems/hpfs.txt
+++ b/Documentation/filesystems/hpfs.txt
@@ -0,0 +1,296 @@
+Read/Write HPFS 2.09
+1998-2004, Mikulas Patocka
+
+email: mikulas@artax.karlin.mff.cuni.cz
+homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
+
+CREDITS:
+Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
+	is taken from it
+Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
+Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
+
+Mount options
+
+uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask)
+	Set owner/group/mode for files that do not have it specified in extended
+	attributes. Mode is inverted umask - for example umask 027 gives owner
+	all permission, group read permission and anybody else no access. Note
+	that for files mode is anded with 0666. If you want files to have 'x'
+	rights, you must use extended attributes.
+case=lower,asis (default asis)
+	File name lowercasing in readdir.
+conv=binary,text,auto (default binary)
+	CR/LF -> LF conversion, if auto, decision is made according to extension
+	- there is a list of text extensions (I thing it's better to not convert
+	text file than to damage binary file). If you want to change that list,
+	change it in the source. Original readonly HPFS contained some strange
+	heuristic algorithm that I removed. I thing it's danger to let the
+	computer decide whether file is text or binary. For example, DJGPP
+	binaries contain small text message at the beginning and they could be
+	misidentified and damaged under some circumstances.
+check=none,normal,strict (default normal)
+	Check level. Selecting none will cause only little speedup and big
+	danger. I tried to write it so that it won't crash if check=normal on
+	corrupted filesystems. check=strict means many superfluous checks -
+	used for debugging (for example it checks if file is allocated in
+	bitmaps when accessing it).
+errors=continue,remount-ro,panic (default remount-ro)
+	Behaviour when filesystem errors found.
+chkdsk=no,errors,always (default errors)
+	When to mark filesystem dirty so that OS/2 checks it.
+eas=no,ro,rw (default rw)
+	What to do with extended attributes. 'no' - ignore them and use always
+	values specified in uid/gid/mode options. 'ro' - read extended
+	attributes but do not create them. 'rw' - create extended attributes
+	when you use chmod/chown/chgrp/mknod/ln -s on the filesystem.
+timeshift=(-)nnn (default 0)
+	Shifts the time by nnn seconds. For example, if you see under linux
+	one hour more, than under os/2, use timeshift=-3600.
+
+
+File names
+
+As in OS/2, filenames are case insensitive. However, shell thinks that names
+are case sensitive, so for example when you create a file FOO, you can use
+'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you
+also won't be able to compile linux kernel (and maybe other things) on HPFS
+because kernel creates different files with names like bootsect.S and
+bootsect.s. When searching for file thats name has characters >= 128, codepages
+are used - see below.
+OS/2 ignores dots and spaces at the end of file name, so this driver does as
+well. If you create 'a. ...', the file 'a' will be created, but you can still
+access it under names 'a.', 'a..', 'a .  . . ' etc.
+
+
+Extended attributes
+
+On HPFS partitions, OS/2 can associate to each file a special information called
+extended attributes. Extended attributes are pairs of (key,value) where key is
+an ascii string identifying that attribute and value is any string of bytes of
+variable length. OS/2 stores window and icon positions and file types there. So
+why not use it for unix-specific info like file owner or access rights? This
+driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended
+attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only
+that extended attributes those value differs from defaults specified in mount
+options are created. Once created, the extended attributes are never deleted,
+they're just changed. It means that when your default uid=0 and you type
+something like 'chown luser file; chown root file' the file will contain
+extended attribute UID=0. And when you umount the fs and mount it again with
+uid=luser_uid, the file will be still owned by root! If you chmod file to 444,
+extended attribute "MODE" will not be set, this special case is done by setting
+read-only flag. When you mknod a block or char device, besides "MODE", the
+special 4-byte extended attribute "DEV" will be created containing the device
+number. Currently this driver cannot resize extended attributes - it means
+that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV"
+attributes with different sizes, they won't be rewritten and changing these
+values doesn't work.
+
+
+Symlinks
+
+You can do symlinks on HPFS partition, symlinks are achieved by setting extended
+attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
+chgrp symlinks but I don't know what is it good for. chmoding symlink results
+in chmoding file where symlink points. These symlinks are just for Linux use and
+incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are
+stored in very crazy way. They tried to do it so that link changes when file is
+moved ... sometimes it works. But the link is partly stored in directory
+extended attributes and partly in OS2SYS.INI. I don't want (and don't know how)
+to analyze or change OS2SYS.INI.
+
+
+Codepages
+
+HPFS can contain several uppercasing tables for several codepages and each
+file has a pointer to codepage it's name is in. However OS/2 was created in
+America where people don't care much about codepages and so multiple codepages
+support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk.
+Once I booted English OS/2 working in cp 850 and I created a file on my 852
+partition. It marked file name codepage as 850 - good. But when I again booted
+Czech OS/2, the file was completely inaccessible under any name. It seems that
+OS/2 uppercases the search pattern with its system code page (852) and file
+name it's comparing to with its code page (850). These could never match. Is it
+really what IBM developers wanted? But problems continued. When I created in
+Czech OS/2 another file in that directory, that file was inaccessible too. OS/2
+probably uses different uppercasing method when searching where to place a file
+(note, that files in HPFS directory must be sorted) and when searching for
+a file. Finally when I opened this directory in PmShell, PmShell crashed (the
+funny thing was that, when rebooted, PmShell tried to reopen this directory
+again :-). chkdsk happily ignores these errors and only low-level disk
+modification saved me.  Never mix different language versions of OS/2 on one
+system although HPFS was designed to allow that.
+OK, I could implement complex codepage support to this driver but I think it
+would cause more problems than benefit with such buggy implementation in OS/2.
+So this driver simply uses first codepage it finds for uppercasing and
+lowercasing no matter what's file codepage index. Usually all file names are in
+this codepage - if you don't try to do what I described above :-)
+
+
+Known bugs
+
+HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
+should work. If you have OS/2 server, use only read-only mode. I don't know how
+to handle some HPFS386 structures like access control list or extended perm
+list, I don't know how to delete them when file is deleted and how to not
+overwrite them with extended attributes. Send me some info on these structures
+and I'll make it. However, this driver should detect presence of HPFS386
+structures, remount read-only and not destroy them (I hope).
+
+When there's not enough space for extended attributes, they will be truncated
+and no error is returned.
+
+OS/2 can't access files if the path is longer than about 256 chars but this
+driver allows you to do it. chkdsk ignores such errors.
+
+Sometimes you won't be able to delete some files on a very full filesystem
+(returning error ENOSPC). That's because file in non-leaf node in directory tree
+(one directory, if it's large, has dirents in tree on HPFS) must be replaced
+with another node when deleted. And that new file might have larger name than
+the old one so the new name doesn't fit in directory node (dnode). And that
+would result in directory tree splitting, that takes disk space. Workaround is
+to delete other files that are leaf (probability that the file is non-leaf is
+about 1/50) or to truncate file first to make some space.
+You encounter this problem only if you have many directories so that
+preallocated directory band is full i.e.
+	number_of_directories / size_of_filesystem_in_mb > 4.
+
+You can't delete open directories.
+
+You can't rename over directories (what is it good for?).
+
+Renaming files so that only case changes doesn't work. This driver supports it
+but vfs doesn't. Something like 'mv file FILE' won't work.
+
+All atimes and directory mtimes are not updated. That's because of performance
+reasons. If you extremely wish to update them, let me know, I'll write it (but
+it will be slow).
+
+When the system is out of memory and swap, it may slightly corrupt filesystem
+(lost files, unbalanced directories). (I guess all filesystem may do it).
+
+When compiled, you get warning: function declaration isn't a prototype. Does
+anybody know what does it mean?
+
+
+What does "unbalanced tree" message mean?
+
+Old versions of this driver created sometimes unbalanced dnode trees. OS/2
+chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
+unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely
+crashes when the tree is not balanced. This driver handles unbalanced trees
+correctly and writes warning if it finds them. If you see this message, this is
+probably because of directories created with old version of this driver.
+Workaround is to move all files from that directory to another and then back
+again. Do it in Linux, not OS/2! If you see this message in directory that is
+whole created by this driver, it is BUG - let me know about it.
+
+
+Bugs in OS/2
+
+When you have two (or more) lost directories pointing each to other, chkdsk
+locks up when repairing filesystem.
+
+Sometimes (I think it's random) when you create a file with one-char name under
+OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs
+error corrected".
+
+File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and
+marks them as short (and writes "minor fs error corrected"). This bug is not in
+HPFS386.
+
+Codepage bugs described above.
+
+If you don't install fixpacks, there are many, many more...
+
+
+History
+
+0.90 First public release
+0.91 Fixed bug that caused shooting to memory when write_inode was called on
+	open inode (rarely happened)
+0.92 Fixed a little memory leak in freeing directory inodes
+0.93 Fixed bug that locked up the machine when there were too many filenames
+	with first 15 characters same
+     Fixed write_file to zero file when writing behind file end
+0.94 Fixed a little memory leak when trying to delete busy file or directory
+0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
+1.90 First version for 2.1.1xx kernels
+1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
+     Fixed a race-condition when write_inode is called while deleting file
+     Fixed a bug that could possibly happen (with very low probability) when
+     	using 0xff in filenames
+     Rewritten locking to avoid race-conditions
+     Mount option 'eas' now works
+     Fsync no longer returns error
+     Files beginning with '.' are marked hidden
+     Remount support added
+     Alloc is not so slow when filesystem becomes full
+     Atimes are no more updated because it slows down operation
+     Code cleanup (removed all commented debug prints)
+1.92 Corrected a bug when sync was called just before closing file
+1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
+	works with previous versions
+     Fixed a possible problem with disks > 64G (but I don't have one, so I can't
+     	test it)
+     Fixed a file overflow at 2G
+     Added new option 'timeshift'
+     Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
+     	read-only mode
+     Fixed a bug that slowed down alloc and prevented allocating 100% space
+     	(this bug was not destructive)
+1.94 Added workaround for one bug in Linux
+     Fixed one buffer leak
+     Fixed some incompatibilities with large extended attributes (but it's still
+	not 100% ok, I have no info on it and OS/2 doesn't want to create them)
+     Rewritten allocation
+     Fixed a bug with i_blocks (du sometimes didn't display correct values)
+     Directories have no longer archive attribute set (some programs don't like
+	it)
+     Fixed a bug that it set badly one flag in large anode tree (it was not
+	destructive)
+1.95 Fixed one buffer leak, that could happen on corrupted filesystem
+     Fixed one bug in allocation in 1.94
+1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
+	error sometimes when opening directories in PMSHELL)
+     Fixed a possible bitmap race
+     Fixed possible problem on large disks
+     You can now delete open files
+     Fixed a nondestructive race in rename
+1.97 Support for HPFS v3 (on large partitions)
+     Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
+1.97.1 Changed names of global symbols
+       Fixed a bug when chmoding or chowning root directory
+1.98 Fixed a deadlock when using old_readdir
+     Better directory handling; workaround for "unbalanced tree" bug in OS/2
+1.99 Corrected a possible problem when there's not enough space while deleting
+	file
+     Now it tries to truncate the file if there's not enough space when deleting
+     Removed a lot of redundant code
+2.00 Fixed a bug in rename (it was there since 1.96)
+     Better anti-fragmentation strategy
+2.01 Fixed problem with directory listing over NFS
+     Directory lseek now checks for proper parameters
+     Fixed race-condition in buffer code - it is in all filesystems in Linux;
+        when reading device (cat /dev/hda) while creating files on it, files
+        could be damaged
+2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond
+        end of partition
+2.03 Char, block devices and pipes are correctly created
+     Fixed non-crashing race in unlink (Alexander Viro)
+     Now it works with Japanese version of OS/2
+2.04 Fixed error when ftruncate used to extend file
+2.05 Fixed crash when got mount parameters without =
+     Fixed crash when allocation of anode failed due to full disk
+     Fixed some crashes when block io or inode allocation failed
+2.06 Fixed some crash on corrupted disk structures
+     Better allocation strategy
+     Reschedule points added so that it doesn't lock CPU long time
+     It should work in read-only mode on Warp Server
+2.07 More fixes for Warp Server. Now it really works
+2.08 Creating new files is not so slow on large disks
+     An attempt to sync deleted file does not generate filesystem error
+2.09 Fixed error on extremly fragmented files
+
+
+ vim: set textwidth=80:
--- a/Documentation/filesystems/inotify.txt
+++ b/Documentation/filesystems/inotify.txt
@@ -0,0 +1,269 @@
+				   inotify
+	    a powerful yet simple file change notification system
+
+
+
+Document started 15 Mar 2005 by Robert Love <rml@novell.com>
+
+
+(i) User Interface
+
+Inotify is controlled by a set of three system calls and normal file I/O on a
+returned file descriptor.
+
+First step in using inotify is to initialise an inotify instance:
+
+	int fd = inotify_init ();
+
+Each instance is associated with a unique, ordered queue.
+
+Change events are managed by "watches".  A watch is an (object,mask) pair where
+the object is a file or directory and the mask is a bit mask of one or more
+inotify events that the application wishes to receive.  See <linux/inotify.h>
+for valid events.  A watch is referenced by a watch descriptor, or wd.
+
+Watches are added via a path to the file.
+
+Watches on a directory will return events on any files inside of the directory.
+
+Adding a watch is simple:
+
+	int wd = inotify_add_watch (fd, path, mask);
+
+Where "fd" is the return value from inotify_init(), path is the path to the
+object to watch, and mask is the watch mask (see <linux/inotify.h>).
+
+You can update an existing watch in the same manner, by passing in a new mask.
+
+An existing watch is removed via
+
+	int ret = inotify_rm_watch (fd, wd);
+
+Events are provided in the form of an inotify_event structure that is read(2)
+from a given inotify instance.  The filename is of dynamic length and follows
+the struct. It is of size len.  The filename is padded with null bytes to
+ensure proper alignment.  This padding is reflected in len.
+
+You can slurp multiple events by passing a large buffer, for example
+
+	size_t len = read (fd, buf, BUF_LEN);
+
+Where "buf" is a pointer to an array of "inotify_event" structures at least
+BUF_LEN bytes in size.  The above example will return as many events as are
+available and fit in BUF_LEN.
+
+Each inotify instance fd is also select()- and poll()-able.
+
+You can find the size of the current event queue via the standard FIONREAD
+ioctl on the fd returned by inotify_init().
+
+All watches are destroyed and cleaned up on close.
+
+
+(ii)
+
+Prototypes:
+
+	int inotify_init (void);
+	int inotify_add_watch (int fd, const char *path, __u32 mask);
+	int inotify_rm_watch (int fd, __u32 mask);
+
+
+(iii) Kernel Interface
+
+Inotify's kernel API consists a set of functions for managing watches and an
+event callback.
+
+To use the kernel API, you must first initialize an inotify instance with a set
+of inotify_operations.  You are given an opaque inotify_handle, which you use
+for any further calls to inotify.
+
+    struct inotify_handle *ih = inotify_init(my_event_handler);
+
+You must provide a function for processing events and a function for destroying
+the inotify watch.
+
+    void handle_event(struct inotify_watch *watch, u32 wd, u32 mask,
+    	              u32 cookie, const char *name, struct inode *inode)
+
+	watch - the pointer to the inotify_watch that triggered this call
+	wd - the watch descriptor
+	mask - describes the event that occurred
+	cookie - an identifier for synchronizing events
+	name - the dentry name for affected files in a directory-based event
+	inode - the affected inode in a directory-based event
+
+    void destroy_watch(struct inotify_watch *watch)
+
+You may add watches by providing a pre-allocated and initialized inotify_watch
+structure and specifying the inode to watch along with an inotify event mask.
+You must pin the inode during the call.  You will likely wish to embed the
+inotify_watch structure in a structure of your own which contains other
+information about the watch.  Once you add an inotify watch, it is immediately
+subject to removal depending on filesystem events.  You must grab a reference if
+you depend on the watch hanging around after the call.
+
+    inotify_init_watch(&my_watch->iwatch);
+    inotify_get_watch(&my_watch->iwatch);	// optional
+    s32 wd = inotify_add_watch(ih, &my_watch->iwatch, inode, mask);
+    inotify_put_watch(&my_watch->iwatch);	// optional
+
+You may use the watch descriptor (wd) or the address of the inotify_watch for
+other inotify operations.  You must not directly read or manipulate data in the
+inotify_watch.  Additionally, you must not call inotify_add_watch() more than
+once for a given inotify_watch structure, unless you have first called either
+inotify_rm_watch() or inotify_rm_wd().
+
+To determine if you have already registered a watch for a given inode, you may
+call inotify_find_watch(), which gives you both the wd and the watch pointer for
+the inotify_watch, or an error if the watch does not exist.
+
+    wd = inotify_find_watch(ih, inode, &watchp);
+
+You may use container_of() on the watch pointer to access your own data
+associated with a given watch.  When an existing watch is found,
+inotify_find_watch() bumps the refcount before releasing its locks.  You must
+put that reference with:
+
+    put_inotify_watch(watchp);
+
+Call inotify_find_update_watch() to update the event mask for an existing watch.
+inotify_find_update_watch() returns the wd of the updated watch, or an error if
+the watch does not exist.
+
+    wd = inotify_find_update_watch(ih, inode, mask);
+
+An existing watch may be removed by calling either inotify_rm_watch() or
+inotify_rm_wd().
+
+    int ret = inotify_rm_watch(ih, &my_watch->iwatch);
+    int ret = inotify_rm_wd(ih, wd);
+
+A watch may be removed while executing your event handler with the following:
+
+    inotify_remove_watch_locked(ih, iwatch);
+
+Call inotify_destroy() to remove all watches from your inotify instance and
+release it.  If there are no outstanding references, inotify_destroy() will call
+your destroy_watch op for each watch.
+
+    inotify_destroy(ih);
+
+When inotify removes a watch, it sends an IN_IGNORED event to your callback.
+You may use this event as an indication to free the watch memory.  Note that
+inotify may remove a watch due to filesystem events, as well as by your request.
+If you use IN_ONESHOT, inotify will remove the watch after the first event, at
+which point you may call the final inotify_put_watch.
+
+(iv) Kernel Interface Prototypes
+
+	struct inotify_handle *inotify_init(struct inotify_operations *ops);
+
+	inotify_init_watch(struct inotify_watch *watch);
+
+	s32 inotify_add_watch(struct inotify_handle *ih,
+		              struct inotify_watch *watch,
+			      struct inode *inode, u32 mask);
+
+	s32 inotify_find_watch(struct inotify_handle *ih, struct inode *inode,
+			       struct inotify_watch **watchp);
+
+	s32 inotify_find_update_watch(struct inotify_handle *ih,
+				      struct inode *inode, u32 mask);
+
+	int inotify_rm_wd(struct inotify_handle *ih, u32 wd);
+
+	int inotify_rm_watch(struct inotify_handle *ih,
+			     struct inotify_watch *watch);
+
+	void inotify_remove_watch_locked(struct inotify_handle *ih,
+					 struct inotify_watch *watch);
+
+	void inotify_destroy(struct inotify_handle *ih);
+
+	void get_inotify_watch(struct inotify_watch *watch);
+	void put_inotify_watch(struct inotify_watch *watch);
+
+
+(v) Internal Kernel Implementation
+
+Each inotify instance is represented by an inotify_handle structure.
+Inotify's userspace consumers also have an inotify_device which is
+associated with the inotify_handle, and on which events are queued.
+
+Each watch is associated with an inotify_watch structure.  Watches are chained
+off of each associated inotify_handle and each associated inode.
+
+See fs/inotify.c and fs/inotify_user.c for the locking and lifetime rules.
+
+
+(vi) Rationale
+
+Q: What is the design decision behind not tying the watch to the open fd of
+   the watched object?
+
+A: Watches are associated with an open inotify device, not an open file.
+   This solves the primary problem with dnotify: keeping the file open pins
+   the file and thus, worse, pins the mount.  Dnotify is therefore infeasible
+   for use on a desktop system with removable media as the media cannot be
+   unmounted.  Watching a file should not require that it be open.
+
+Q: What is the design decision behind using an-fd-per-instance as opposed to
+   an fd-per-watch?
+
+A: An fd-per-watch quickly consumes more file descriptors than are allowed,
+   more fd's than are feasible to manage, and more fd's than are optimally
+   select()-able.  Yes, root can bump the per-process fd limit and yes, users
+   can use epoll, but requiring both is a silly and extraneous requirement.
+   A watch consumes less memory than an open file, separating the number
+   spaces is thus sensible.  The current design is what user-space developers
+   want: Users initialize inotify, once, and add n watches, requiring but one
+   fd and no twiddling with fd limits.  Initializing an inotify instance two
+   thousand times is silly.  If we can implement user-space's preferences 
+   cleanly--and we can, the idr layer makes stuff like this trivial--then we 
+   should.
+
+   There are other good arguments.  With a single fd, there is a single
+   item to block on, which is mapped to a single queue of events.  The single
+   fd returns all watch events and also any potential out-of-band data.  If
+   every fd was a separate watch,
+
+   - There would be no way to get event ordering.  Events on file foo and
+     file bar would pop poll() on both fd's, but there would be no way to tell
+     which happened first.  A single queue trivially gives you ordering.  Such
+     ordering is crucial to existing applications such as Beagle.  Imagine
+     "mv a b ; mv b a" events without ordering.
+
+   - We'd have to maintain n fd's and n internal queues with state,
+     versus just one.  It is a lot messier in the kernel.  A single, linear
+     queue is the data structure that makes sense.
+
+   - User-space developers prefer the current API.  The Beagle guys, for
+     example, love it.  Trust me, I asked.  It is not a surprise: Who'd want
+     to manage and block on 1000 fd's via select?
+
+   - No way to get out of band data.
+
+   - 1024 is still too low.  ;-)
+
+   When you talk about designing a file change notification system that
+   scales to 1000s of directories, juggling 1000s of fd's just does not seem
+   the right interface.  It is too heavy.
+
+   Additionally, it _is_ possible to  more than one instance  and
+   juggle more than one queue and thus more than one associated fd.  There
+   need not be a one-fd-per-process mapping; it is one-fd-per-queue and a
+   process can easily want more than one queue.
+
+Q: Why the system call approach?
+
+A: The poor user-space interface is the second biggest problem with dnotify.
+   Signals are a terrible, terrible interface for file notification.  Or for
+   anything, for that matter.  The ideal solution, from all perspectives, is a
+   file descriptor-based one that allows basic file I/O and poll/select.
+   Obtaining the fd and managing the watches could have been done either via a
+   device file or a family of new system calls.  We decided to implement a
+   family of system calls because that is the preferred approach for new kernel
+   interfaces.  The only real difference was whether we wanted to use open(2)
+   and ioctl(2) or a couple of new system calls.  System calls beat ioctls.
+
--- a/Documentation/filesystems/isofs.txt
+++ b/Documentation/filesystems/isofs.txt
@@ -0,0 +1,42 @@
+Mount options that are the same as for msdos and vfat partitions.
+
+  gid=nnn	All files in the partition will be in group nnn.
+  uid=nnn	All files in the partition will be owned by user id nnn.
+  umask=nnn	The permission mask (see umask(1)) for the partition.
+
+Mount options that are the same as vfat partitions. These are only useful
+when using discs encoded using Microsoft's Joliet extensions.
+  iocharset=name Character set to use for converting from Unicode to
+		ASCII.  Joliet filenames are stored in Unicode format, but
+		Unix for the most part doesn't know how to deal with Unicode.
+		There is also an option of doing UTF-8 translations with the
+		utf8 option.
+  utf8          Encode Unicode names in UTF-8 format. Default is no.
+
+Mount options unique to the isofs filesystem.
+  block=512     Set the block size for the disk to 512 bytes
+  block=1024    Set the block size for the disk to 1024 bytes
+  block=2048    Set the block size for the disk to 2048 bytes
+  check=relaxed Matches filenames with different cases
+  check=strict  Matches only filenames with the exact same case
+  cruft         Try to handle badly formatted CDs.
+  map=off       Do not map non-Rock Ridge filenames to lower case
+  map=normal    Map non-Rock Ridge filenames to lower case
+  map=acorn     As map=normal but also apply Acorn extensions if present
+  mode=xxx      Sets the permissions on files to xxx
+  nojoliet      Ignore Joliet extensions if they are present.
+  norock        Ignore Rock Ridge extensions if they are present.
+  hide		Completely strip hidden files from the file system.
+  showassoc	Show files marked with the 'associated' bit
+  unhide	Deprecated; showing hidden files is now default;
+		If given, it is a synonym for 'showassoc' which will
+		recreate previous unhide behavior
+  session=x     Select number of session on multisession CD
+  sbsector=xxx  Session begins from sector xxx
+
+Recommended documents about ISO 9660 standard are located at:
+http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm
+ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
+Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically 
+identical with ISO 9660.", so it is a valid and gratis substitute of the
+official ISO specification.
--- a/Documentation/filesystems/jfs.txt
+++ b/Documentation/filesystems/jfs.txt
@@ -0,0 +1,35 @@
+IBM's Journaled File System (JFS) for Linux
+
+JFS Homepage:  http://jfs.sourceforge.net/
+
+The following mount options are supported:
+
+iocharset=name	Character set to use for converting from Unicode to
+		ASCII.  The default is to do no conversion.  Use
+		iocharset=utf8 for UTF-8 translations.  This requires
+		CONFIG_NLS_UTF8 to be set in the kernel .config file.
+		iocharset=none specifies the default behavior explicitly.
+
+resize=value	Resize the volume to <value> blocks.  JFS only supports
+		growing a volume, not shrinking it.  This option is only
+		valid during a remount, when the volume is mounted
+		read-write.  The resize keyword with no value will grow
+		the volume to the full size of the partition.
+
+nointegrity	Do not write to the journal.  The primary use of this option
+		is to allow for higher performance when restoring a volume
+		from backup media.  The integrity of the volume is not
+		guaranteed if the system abnormally abends.
+
+integrity	Default.  Commit metadata changes to the journal.  Use this
+		option to remount a volume where the nointegrity option was
+		previously specified in order to restore normal behavior.
+
+errors=continue		Keep going on a filesystem error.
+errors=remount-ro	Default. Remount the filesystem read-only on an error.
+errors=panic		Panic and halt the machine if an error occurs.
+
+Please send bugs, comments, cards and letters to shaggy@austin.ibm.com.
+
+The JFS mailing list can be subscribed to by using the link labeled
+"Mail list Subscribe" at our web page http://jfs.sourceforge.net/
--- a/Documentation/filesystems/ncpfs.txt
+++ b/Documentation/filesystems/ncpfs.txt
@@ -0,0 +1,12 @@
+The ncpfs filesystem understands the NCP protocol, designed by the
+Novell Corporation for their NetWare(tm) product.  NCP is functionally
+similar to the NFS used in the TCP/IP community.
+To mount a NetWare filesystem, you need a special mount program, which
+can be found in the ncpfs package.  The home site for ncpfs is
+ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors
+will have it as well.
+
+Related products are linware and mars_nwe, which will give Linux partial
+NetWare server functionality.  Linware's home site is
+klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on
+ftp.gwdg.de/pub/linux/misc/ncpfs.
--- a/Documentation/filesystems/ntfs.txt
+++ b/Documentation/filesystems/ntfs.txt
@@ -0,0 +1,714 @@
+The Linux NTFS filesystem driver
+================================
+
+
+Table of contents
+=================
+
+- Overview
+- Web site
+- Features
+- Supported mount options
+- Known bugs and (mis-)features
+- Using NTFS volume and stripe sets
+  - The Device-Mapper driver
+  - The Software RAID / MD driver
+  - Limitations when using the MD driver
+- ChangeLog
+
+
+Overview
+========
+
+Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
+These include mkntfs, a full-featured ntfs filesystem format utility,
+ntfsundelete used for recovering files that were unintentionally deleted
+from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
+See the web site for more information.
+
+To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
+system type 'ntfs'.  The driver currently supports read-only mode (with no
+fault-tolerance, encryption or journalling) and very limited, but safe, write
+support.
+
+For fault tolerance and raid support (i.e. volume and stripe sets), you can
+use the kernel's Software RAID / MD driver.  See section "Using Software RAID
+with NTFS" for details.
+
+
+Web site
+========
+
+There is plenty of additional information on the linux-ntfs web site
+at http://linux-ntfs.sourceforge.net/
+
+The web site has a lot of additional information, such as a comprehensive
+FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS
+userspace utilities, etc.
+
+
+Features
+========
+
+- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and
+  earlier kernels.  This new driver implements NTFS read support and is
+  functionally equivalent to the old ntfs driver and it also implements limited
+  write support.  The biggest limitation at present is that files/directories
+  cannot be created or deleted.  See below for the list of write features that
+  are so far supported.  Another limitation is that writing to compressed files
+  is not implemented at all.  Also, neither read nor write access to encrypted
+  files is so far implemented.
+- The new driver has full support for sparse files on NTFS 3.x volumes which
+  the old driver isn't happy with.
+- The new driver supports execution of binaries due to mmap() now being
+  supported.
+- The new driver supports loopback mounting of files on NTFS which is used by
+  some Linux distributions to enable the user to run Linux from an NTFS
+  partition by creating a large file while in Windows and then loopback
+  mounting the file while in Linux and creating a Linux filesystem on it that
+  is used to install Linux on it.
+- A comparison of the two drivers using:
+	time find . -type f -exec md5sum "{}" \;
+  run three times in sequence with each driver (after a reboot) on a 1.4GiB
+  NTFS partition, showed the new driver to be 20% faster in total time elapsed
+  (from 9:43 minutes on average down to 7:53).  The time spent in user space
+  was unchanged but the time spent in the kernel was decreased by a factor of
+  2.5 (from 85 CPU seconds down to 33).
+- The driver does not support short file names in general.  For backwards
+  compatibility, we implement access to files using their short file names if
+  they exist.  The driver will not create short file names however, and a
+  rename will discard any existing short file name.
+- The new driver supports exporting of mounted NTFS volumes via NFS.
+- The new driver supports async io (aio).
+- The new driver supports fsync(2), fdatasync(2), and msync(2).
+- The new driver supports readv(2) and writev(2).
+- The new driver supports access time updates (including mtime and ctime).
+- The new driver supports truncate(2) and open(2) with O_TRUNC.  But at present
+  only very limited support for highly fragmented files, i.e. ones which have
+  their data attribute split across multiple extents, is included.  Another
+  limitation is that at present truncate(2) will never create sparse files,
+  since to mark a file sparse we need to modify the directory entry for the
+  file and we do not implement directory modifications yet.
+- The new driver supports write(2) which can both overwrite existing data and
+  extend the file size so that you can write beyond the existing data.  Also,
+  writing into sparse regions is supported and the holes are filled in with
+  clusters.  But at present only limited support for highly fragmented files,
+  i.e. ones which have their data attribute split across multiple extents, is
+  included.  Another limitation is that write(2) will never create sparse
+  files, since to mark a file sparse we need to modify the directory entry for
+  the file and we do not implement directory modifications yet.
+
+Supported mount options
+=======================
+
+In addition to the generic mount options described by the manual page for the
+mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
+following mount options:
+
+iocharset=name		Deprecated option.  Still supported but please use
+			nls=name in the future.  See description for nls=name.
+
+nls=name		Character set to use when returning file names.
+			Unlike VFAT, NTFS suppresses names that contain
+			unconvertible characters.  Note that most character
+			sets contain insufficient characters to represent all
+			possible Unicode characters that can exist on NTFS.
+			To be sure you are not missing any files, you are
+			advised to use nls=utf8 which is capable of
+			representing all Unicode characters.
+
+utf8=<bool>		Option no longer supported.  Currently mapped to
+			nls=utf8 but please use nls=utf8 in the future and
+			make sure utf8 is compiled either as module or into
+			the kernel.  See description for nls=name.
+
+uid=
+gid=
+umask=			Provide default owner, group, and access mode mask.
+			These options work as documented in mount(8).  By
+			default, the files/directories are owned by root and
+			he/she has read and write permissions, as well as
+			browse permission for directories.  No one else has any
+			access permissions.  I.e. the mode on all files is by
+			default rw------- and for directories rwx------, a
+			consequence of the default fmask=0177 and dmask=0077.
+			Using a umask of zero will grant all permissions to
+			everyone, i.e. all files and directories will have mode
+			rwxrwxrwx.
+
+fmask=
+dmask=			Instead of specifying umask which applies both to
+			files and directories, fmask applies only to files and
+			dmask only to directories.
+
+sloppy=<BOOL>		If sloppy is specified, ignore unknown mount options.
+			Otherwise the default behaviour is to abort mount if
+			any unknown options are found.
+
+show_sys_files=<BOOL>	If show_sys_files is specified, show the system files
+			in directory listings.  Otherwise the default behaviour
+			is to hide the system files.
+			Note that even when show_sys_files is specified, "$MFT"
+			will not be visible due to bugs/mis-features in glibc.
+			Further, note that irrespective of show_sys_files, all
+			files are accessible by name, i.e. you can always do
+			"ls -l \$UpCase" for example to specifically show the
+			system file containing the Unicode upcase table.
+
+case_sensitive=<BOOL>	If case_sensitive is specified, treat all file names as
+			case sensitive and create file names in the POSIX
+			namespace.  Otherwise the default behaviour is to treat
+			file names as case insensitive and to create file names
+			in the WIN32/LONG name space.  Note, the Linux NTFS
+			driver will never create short file names and will
+			remove them on rename/delete of the corresponding long
+			file name.
+			Note that files remain accessible via their short file
+			name, if it exists.  If case_sensitive, you will need
+			to provide the correct case of the short file name.
+
+disable_sparse=<BOOL>	If disable_sparse is specified, creation of sparse
+			regions, i.e. holes, inside files is disabled for the
+			volume (for the duration of this mount only).  By
+			default, creation of sparse regions is enabled, which
+			is consistent with the behaviour of traditional Unix
+			filesystems.
+
+errors=opt		What to do when critical filesystem errors are found.
+			Following values can be used for "opt":
+			  continue: DEFAULT, try to clean-up as much as
+				    possible, e.g. marking a corrupt inode as
+				    bad so it is no longer accessed, and then
+				    continue.
+			  recover:  At present only supported is recovery of
+				    the boot sector from the backup copy.
+				    If read-only mount, the recovery is done
+				    in memory only and not written to disk.
+			Note that the options are additive, i.e. specifying:
+			   errors=continue,errors=recover
+			means the driver will attempt to recover and if that
+			fails it will clean-up as much as possible and
+			continue.
+
+mft_zone_multiplier=	Set the MFT zone multiplier for the volume (this
+			setting is not persistent across mounts and can be
+			changed from mount to mount but cannot be changed on
+			remount).  Values of 1 to 4 are allowed, 1 being the
+			default.  The MFT zone multiplier determines how much
+			space is reserved for the MFT on the volume.  If all
+			other space is used up, then the MFT zone will be
+			shrunk dynamically, so this has no impact on the
+			amount of free space.  However, it can have an impact
+			on performance by affecting fragmentation of the MFT.
+			In general use the default.  If you have a lot of small
+			files then use a higher value.  The values have the
+			following meaning:
+			      Value	     MFT zone size (% of volume size)
+				1		12.5%
+				2		25%
+				3		37.5%
+				4		50%
+			Note this option is irrelevant for read-only mounts.
+
+
+Known bugs and (mis-)features
+=============================
+
+- The link count on each directory inode entry is set to 1, due to Linux not
+  supporting directory hard links.  This may well confuse some user space
+  applications, since the directory names will have the same inode numbers.
+  This also speeds up ntfs_read_inode() immensely.  And we haven't found any
+  problems with this approach so far.  If you find a problem with this, please
+  let us know.
+
+
+Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
+list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
+
+
+Using NTFS volume and stripe sets
+=================================
+
+For support of volume and stripe sets, you can either use the kernel's
+Device-Mapper driver or the kernel's Software RAID / MD driver.  The former is
+the recommended one to use for linear raid.  But the latter is required for
+raid level 5.  For striping and mirroring, either driver should work fine.
+
+
+The Device-Mapper driver
+------------------------
+
+You will need to create a table of the components of the volume/stripe set and
+how they fit together and load this into the kernel using the dmsetup utility
+(see man 8 dmsetup).
+
+Linear volume sets, i.e. linear raid, has been tested and works fine.  Even
+though untested, there is no reason why stripe sets, i.e. raid level 0, and
+mirrors, i.e. raid level 1 should not work, too.  Stripes with parity, i.e.
+raid level 5, unfortunately cannot work yet because the current version of the
+Device-Mapper driver does not support raid level 5.  You may be able to use the
+Software RAID / MD driver for raid level 5, see the next section for details.
+
+To create the table describing your volume you will need to know each of its
+components and their sizes in sectors, i.e. multiples of 512-byte blocks.
+
+For NT4 fault tolerant volumes you can obtain the sizes using fdisk.  So for
+example if one of your partitions is /dev/hda2 you would do:
+
+$ fdisk -ul /dev/hda
+
+Disk /dev/hda: 81.9 GB, 81964302336 bytes
+255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
+Units = sectors of 1 * 512 = 512 bytes
+
+   Device Boot      Start         End      Blocks   Id  System
+   /dev/hda1   *          63     4209029     2104483+  83  Linux
+   /dev/hda2         4209030    37768814    16779892+  86  NTFS
+   /dev/hda3        37768815    46170809     4200997+  83  Linux
+
+And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
+33559785 sectors.
+
+For Win2k and later dynamic disks, you can for example use the ldminfo utility
+which is part of the Linux LDM tools (the latest version at the time of
+writing is linux-ldm-0.0.8.tar.bz2).  You can download it from:
+	http://linux-ntfs.sourceforge.net/downloads.html
+Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
+into it (cd linux-ldm-0.0.8) and change to the test directory (cd test).  You
+will find the precompiled (i386) ldminfo utility there.  NOTE: You will not be
+able to compile this yourself easily so use the binary version!
+
+Then you would use ldminfo in dump mode to obtain the necessary information:
+
+$ ./ldminfo --dump /dev/hda
+
+This would dump the LDM database found on /dev/hda which describes all of your
+dynamic disks and all the volumes on them.  At the bottom you will see the
+VOLUME DEFINITIONS section which is all you really need.  You may need to look
+further above to determine which of the disks in the volume definitions is
+which device in Linux.  Hint: Run ldminfo on each of your dynamic disks and
+look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
+section).  You can then find these Disk Ids in the VBLK DATABASE section in the
+<Disk> components where you will get the LDM Name for the disk that is found in
+the VOLUME DEFINITIONS section.
+
+Note you will also need to enable the LDM driver in the Linux kernel.  If your
+distribution did not enable it, you will need to recompile the kernel with it
+enabled.  This will create the LDM partitions on each device at boot time.  You
+would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
+in the Device-Mapper table.
+
+You can also bypass using the LDM driver by using the main device (e.g.
+/dev/hda) and then using the offsets of the LDM partitions into this device as
+the "Start sector of device" when creating the table.  Once again ldminfo would
+give you the correct information to do this.
+
+Assuming you know all your devices and their sizes things are easy.
+
+For a linear raid the table would look like this (note all values are in
+512-byte sectors):
+
+--- cut here ---
+# Offset into	Size of this	Raid type	Device		Start sector
+# volume	device						of device
+0		1028161		linear		/dev/hda1	0
+1028161		3903762		linear		/dev/hdb2	0
+4931923		2103211		linear		/dev/hdc1	0
+--- cut here ---
+
+For a striped volume, i.e. raid level 0, you will need to know the chunk size
+you used when creating the volume.  Windows uses 64kiB as the default, so it
+will probably be this unless you changes the defaults when creating the array.
+
+For a raid level 0 the table would look like this (note all values are in
+512-byte sectors):
+
+--- cut here ---
+# Offset   Size	    Raid     Number   Chunk  1st        Start	2nd	  Start
+# into     of the   type     of	      size   Device	in	Device	  in
+# volume   volume	     stripes			device		  device
+0	   2056320  striped  2	      128    /dev/hda1	0	/dev/hdb1 0
+--- cut here ---
+
+If there are more than two devices, just add each of them to the end of the
+line.
+
+Finally, for a mirrored volume, i.e. raid level 1, the table would look like
+this (note all values are in 512-byte sectors):
+
+--- cut here ---
+# Ofs Size   Raid   Log  Number Region Should Number Source  Start Target Start
+# in  of the type   type of log size   sync?  of     Device  in    Device in
+# vol volume		 params		     mirrors	     Device	  Device
+0    2056320 mirror core 2	16     nosync 2	   /dev/hda1 0   /dev/hdb1 0
+--- cut here ---
+
+If you are mirroring to multiple devices you can specify further targets at the
+end of the line.
+
+Note the "Should sync?" parameter "nosync" means that the two mirrors are
+already in sync which will be the case on a clean shutdown of Windows.  If the
+mirrors are not clean, you can specify the "sync" option instead of "nosync"
+and the Device-Mapper driver will then copy the entirey of the "Source Device"
+to the "Target Device" or if you specified multipled target devices to all of
+them.
+
+Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
+and hand it over to dmsetup to work with, like so:
+
+$ dmsetup create myvolume1 /etc/ntfsvolume1
+
+You can obviously replace "myvolume1" with whatever name you like.
+
+If it all worked, you will now have the device /dev/device-mapper/myvolume1
+which you can then just use as an argument to the mount command as usual to
+mount the ntfs volume.  For example:
+
+$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
+
+(You need to create the directory /mnt/myvol1 first and of course you can use
+anything you like instead of /mnt/myvol1 as long as it is an existing
+directory.)
+
+It is advisable to do the mount read-only to see if the volume has been setup
+correctly to avoid the possibility of causing damage to the data on the ntfs
+volume.
+
+
+The Software RAID / MD driver
+-----------------------------
+
+An alternative to using the Device-Mapper driver is to use the kernel's
+Software RAID / MD driver.  For which you need to set up your /etc/raidtab
+appropriately (see man 5 raidtab).
+
+Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
+0, have been tested and work fine (though see section "Limitations when using
+the MD driver with NTFS volumes" especially if you want to use linear raid).
+Even though untested, there is no reason why mirrors, i.e. raid level 1, and
+stripes with parity, i.e. raid level 5, should not work, too.
+
+You have to use the "persistent-superblock 0" option for each raid-disk in the
+NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
+superblock used by the MD driver would damage the NTFS volume.
+
+Windows by default uses a stripe chunk size of 64k, so you probably want the
+"chunk-size 64k" option for each raid-disk, too.
+
+For example, if you have a stripe set consisting of two partitions /dev/hda5
+and /dev/hdb1 your /etc/raidtab would look like this:
+
+raiddev /dev/md0
+	raid-level	0
+	nr-raid-disks	2
+	nr-spare-disks	0
+	persistent-superblock	0
+	chunk-size	64k
+	device		/dev/hda5
+	raid-disk	0
+	device		/dev/hdb1
+	raid-disl	1
+
+For linear raid, just change the raid-level above to "raid-level linear", for
+mirrors, change it to "raid-level 1", and for stripe sets with parity, change
+it to "raid-level 5".
+
+Note for stripe sets with parity you will also need to tell the MD driver
+which parity algorithm to use by specifying the option "parity-algorithm
+which", where you need to replace "which" with the name of the algorithm to
+use (see man 5 raidtab for available algorithms) and you will have to try the
+different available algorithms until you find one that works.  Make sure you
+are working read-only when playing with this as you may damage your data
+otherwise.  If you find which algorithm works please let us know (email the
+linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
+IRC in channel #ntfs on the irc.freenode.net network) so we can update this
+documentation.
+
+Once the raidtab is setup, run for example raid0run -a to start all devices or
+raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
+
+Then just use the mount command as usual to mount the ntfs volume using for
+example:	mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
+
+It is advisable to do the mount read-only to see if the md volume has been
+setup correctly to avoid the possibility of causing damage to the data on the
+ntfs volume.
+
+
+Limitations when using the Software RAID / MD driver
+-----------------------------------------------------
+
+Using the md driver will not work properly if any of your NTFS partitions have
+an odd number of sectors.  This is especially important for linear raid as all
+data after the first partition with an odd number of sectors will be offset by
+one or more sectors so if you mount such a partition with write support you
+will cause massive damage to the data on the volume which will only become
+apparent when you try to use the volume again under Windows.
+
+So when using linear raid, make sure that all your partitions have an even
+number of sectors BEFORE attempting to use it.  You have been warned!
+
+Even better is to simply use the Device-Mapper for linear raid and then you do
+not have this problem with odd numbers of sectors.
+
+
+ChangeLog
+=========
+
+Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
+
+2.1.28:
+	- Fix a deadlock.
+2.1.27:
+	- Implement page migration support so the kernel can move memory used
+	  by NTFS files and directories around for management purposes.
+	- Add support for writing to sparse files created with Windows XP SP2.
+	- Many minor improvements and bug fixes.
+2.1.26:
+	- Implement support for sector sizes above 512 bytes (up to the maximum
+	  supported by NTFS which is 4096 bytes).
+	- Enhance support for NTFS volumes which were supported by Windows but
+	  not by Linux due to invalid attribute list attribute flags.
+	- A few minor updates and bug fixes.
+2.1.25:
+	- Write support is now extended with write(2) being able to both
+	  overwrite existing file data and to extend files.  Also, if a write
+	  to a sparse region occurs, write(2) will fill in the hole.  Note,
+	  mmap(2) based writes still do not support writing into holes or
+	  writing beyond the initialized size.
+	- Write support has a new feature and that is that truncate(2) and
+	  open(2) with O_TRUNC are now implemented thus files can be both made
+	  smaller and larger.
+	- Note: Both write(2) and truncate(2)/open(2) with O_TRUNC still have
+	  limitations in that they
+	  - only provide limited support for highly fragmented files.
+	  - only work on regular, i.e. uncompressed and unencrypted files.
+	  - never create sparse files although this will change once directory
+	    operations are implemented.
+	- Lots of bug fixes and enhancements across the board.
+2.1.24:
+	- Support journals ($LogFile) which have been modified by chkdsk.  This
+	  means users can boot into Windows after we marked the volume dirty.
+	  The Windows boot will run chkdsk and then reboot.  The user can then
+	  immediately boot into Linux rather than having to do a full Windows
+	  boot first before rebooting into Linux and we will recognize such a
+	  journal and empty it as it is clean by definition.
+	- Support journals ($LogFile) with only one restart page as well as
+	  journals with two different restart pages.  We sanity check both and
+	  either use the only sane one or the more recent one of the two in the
+	  case that both are valid.
+	- Lots of bug fixes and enhancements across the board.
+2.1.23:
+	- Stamp the user space journal, aka transaction log, aka $UsnJrnl, if
+	  it is present and active thus telling Windows and applications using
+	  the transaction log that changes can have happened on the volume
+	  which are not recorded in $UsnJrnl.
+	- Detect the case when Windows has been hibernated (suspended to disk)
+	  and if this is the case do not allow (re)mounting read-write to
+	  prevent data corruption when you boot back into the suspended
+	  Windows session.
+	- Implement extension of resident files using the normal file write
+	  code paths, i.e. most very small files can be extended to be a little
+	  bit bigger but not by much.
+	- Add new mount option "disable_sparse".  (See list of mount options
+	  above for details.)
+	- Improve handling of ntfs volumes with errors and strange boot sectors
+	  in particular.
+	- Fix various bugs including a nasty deadlock that appeared in recent
+	  kernels (around 2.6.11-2.6.12 timeframe).
+2.1.22:
+	- Improve handling of ntfs volumes with errors.
+	- Fix various bugs and race conditions.
+2.1.21:
+	- Fix several race conditions and various other bugs.
+	- Many internal cleanups, code reorganization, optimizations, and mft
+	  and index record writing code rewritten to fit in with the changes.
+	- Update Documentation/filesystems/ntfs.txt with instructions on how to
+	  use the Device-Mapper driver with NTFS ftdisk/LDM raid.
+2.1.20:
+	- Fix two stupid bugs introduced in 2.1.18 release.
+2.1.19:
+	- Minor bugfix in handling of the default upcase table.
+	- Many internal cleanups and improvements.  Many thanks to Linus
+	  Torvalds and Al Viro for the help and advice with the sparse
+	  annotations and cleanups.
+2.1.18:
+	- Fix scheduling latencies at mount time.  (Ingo Molnar)
+	- Fix endianness bug in a little traversed portion of the attribute
+	  lookup code.
+2.1.17:
+	- Fix bugs in mount time error code paths.
+2.1.16:
+	- Implement access time updates (including mtime and ctime).
+	- Implement fsync(2), fdatasync(2), and msync(2) system calls.
+	- Enable the readv(2) and writev(2) system calls.
+	- Enable access via the asynchronous io (aio) API by adding support for
+	  the aio_read(3) and aio_write(3) functions.
+2.1.15:
+	- Invalidate quotas when (re)mounting read-write.
+	  NOTE:  This now only leave user space journalling on the side.  (See
+	  note for version 2.1.13, below.)
+2.1.14:
+	- Fix an NFSd caused deadlock reported by several users.
+2.1.13:
+	- Implement writing of inodes (access time updates are not implemented
+	  yet so mounting with -o noatime,nodiratime is enforced).
+	- Enable writing out of resident files so you can now overwrite any
+	  uncompressed, unencrypted, nonsparse file as long as you do not
+	  change the file size.
+	- Add housekeeping of ntfs system files so that ntfsfix no longer needs
+	  to be run after writing to an NTFS volume.
+	  NOTE:  This still leaves quota tracking and user space journalling on
+	  the side but they should not cause data corruption.  In the worst
+	  case the charged quotas will be out of date ($Quota) and some
+	  userspace applications might get confused due to the out of date
+	  userspace journal ($UsnJrnl).
+2.1.12:
+	- Fix the second fix to the decompression engine from the 2.1.9 release
+	  and some further internals cleanups.
+2.1.11:
+	- Driver internal cleanups.
+2.1.10:
+	- Force read-only (re)mounting of volumes with unsupported volume
+	  flags and various cleanups.
+2.1.9:
+	- Fix two bugs in handling of corner cases in the decompression engine.
+2.1.8:
+	- Read the $MFT mirror and compare it to the $MFT and if the two do not
+	  match, force a read-only mount and do not allow read-write remounts.
+	- Read and parse the $LogFile journal and if it indicates that the
+	  volume was not shutdown cleanly, force a read-only mount and do not
+	  allow read-write remounts.  If the $LogFile indicates a clean
+	  shutdown and a read-write (re)mount is requested, empty $LogFile to
+	  ensure that Windows cannot cause data corruption by replaying a stale
+	  journal after Linux has written to the volume.
+	- Improve time handling so that the NTFS time is fully preserved when
+	  converted to kernel time and only up to 99 nano-seconds are lost when
+	  kernel time is converted to NTFS time.
+2.1.7:
+	- Enable NFS exporting of mounted NTFS volumes.
+2.1.6:
+	- Fix minor bug in handling of compressed directories that fixes the
+	  erroneous "du" and "stat" output people reported.
+2.1.5:
+	- Minor bug fix in attribute list attribute handling that fixes the
+	  I/O errors on "ls" of certain fragmented files found by at least two
+	  people running Windows XP.
+2.1.4:
+	- Minor update allowing compilation with all gcc versions (well, the
+	  ones the kernel can be compiled with anyway).
+2.1.3:
+	- Major bug fixes for reading files and volumes in corner cases which
+	  were being hit by Windows 2k/XP users.
+2.1.2:
+	- Major bug fixes alleviating the hangs in statfs experienced by some
+	  users.
+2.1.1:
+	- Update handling of compressed files so people no longer get the
+	  frequently reported warning messages about initialized_size !=
+	  data_size.
+2.1.0:
+	- Add configuration option for developmental write support.
+	- Initial implementation of file overwriting. (Writes to resident files
+	  are not written out to disk yet, so avoid writing to files smaller
+	  than about 1kiB.)
+	- Intercept/abort changes in file size as they are not implemented yet.
+2.0.25:
+	- Minor bugfixes in error code paths and small cleanups.
+2.0.24:
+	- Small internal cleanups.
+	- Support for sendfile system call. (Christoph Hellwig)
+2.0.23:
+	- Massive internal locking changes to mft record locking. Fixes
+	  various race conditions and deadlocks.
+	- Fix ntfs over loopback for compressed files by adding an
+	  optimization barrier. (gcc was screwing up otherwise ?)
+	Thanks go to Christoph Hellwig for pointing these two out:
+	- Remove now unused function fs/ntfs/malloc.h::vmalloc_nofs().
+	- Fix ntfs_free() for ia64 and parisc.
+2.0.22:
+	- Small internal cleanups.
+2.0.21:
+	These only affect 32-bit architectures:
+	- Check for, and refuse to mount too large volumes (maximum is 2TiB).
+	- Check for, and refuse to open too large files and directories
+	  (maximum is 16TiB).
+2.0.20:
+	- Support non-resident directory index bitmaps. This means we now cope
+	  with huge directories without problems.
+	- Fix a page leak that manifested itself in some cases when reading
+	  directory contents.
+	- Internal cleanups.
+2.0.19:
+	- Fix race condition and improvements in block i/o interface.
+	- Optimization when reading compressed files.
+2.0.18:
+	- Fix race condition in reading of compressed files.
+2.0.17:
+	- Cleanups and optimizations.
+2.0.16:
+	- Fix stupid bug introduced in 2.0.15 in new attribute inode API.
+	- Big internal cleanup replacing the mftbmp access hacks by using the
+	  new attribute inode API instead.
+2.0.15:
+	- Bug fix in parsing of remount options.
+	- Internal changes implementing attribute (fake) inodes allowing all
+	  attribute i/o to go via the page cache and to use all the normal
+	  vfs/mm functionality.
+2.0.14:
+	- Internal changes improving run list merging code and minor locking
+	  change to not rely on BKL in ntfs_statfs().
+2.0.13:
+	- Internal changes towards using iget5_locked() in preparation for
+	  fake inodes and small cleanups to ntfs_volume structure.
+2.0.12:
+	- Internal cleanups in address space operations made possible by the
+	  changes introduced in the previous release.
+2.0.11:
+	- Internal updates and cleanups introducing the first step towards
+	  fake inode based attribute i/o.
+2.0.10:
+	- Microsoft says that the maximum number of inodes is 2^32 - 1. Update
+	  the driver accordingly to only use 32-bits to store inode numbers on
+	  32-bit architectures. This improves the speed of the driver a little.
+2.0.9:
+	- Change decompression engine to use a single buffer. This should not
+	  affect performance except perhaps on the most heavy i/o on SMP
+	  systems when accessing multiple compressed files from multiple
+	  devices simultaneously.
+	- Minor updates and cleanups.
+2.0.8:
+	- Remove now obsolete show_inodes and posix mount option(s).
+	- Restore show_sys_files mount option.
+	- Add new mount option case_sensitive, to determine if the driver
+	  treats file names as case sensitive or not.
+	- Mostly drop support for short file names (for backwards compatibility
+	  we only support accessing files via their short file name if one
+	  exists).
+	- Fix dcache aliasing issues wrt short/long file names.
+	- Cleanups and minor fixes.
+2.0.7:
+	- Just cleanups.
+2.0.6:
+	- Major bugfix to make compatible with other kernel changes. This fixes
+	  the hangs/oopses on umount.
+	- Locking cleanup in directory operations (remove BKL usage).
+2.0.5:
+	- Major buffer overflow bug fix.
+	- Minor cleanups and updates for kernel 2.5.12.
+2.0.4:
+	- Cleanups and updates for kernel 2.5.11.
+2.0.3:
+	- Small bug fixes, cleanups, and performance improvements.
+2.0.2:
+	- Use default fmask of 0177 so that files are no executable by default.
+	  If you want owner executable files, just use fmask=0077.
+	- Update for kernel 2.5.9 but preserve backwards compatibility with
+	  kernel 2.5.7.
+	- Minor bug fixes, cleanups, and updates.
+2.0.1:
+	- Minor updates, primarily set the executable bit by default on files
+	  so they can be executed.
+2.0.0:
+	- Started ChangeLog.
+
--- a/Documentation/filesystems/ocfs2.txt
+++ b/Documentation/filesystems/ocfs2.txt
@@ -0,0 +1,59 @@
+OCFS2 filesystem
+==================
+OCFS2 is a general purpose extent based shared disk cluster file
+system with many similarities to ext3. It supports 64 bit inode
+numbers, and has automatically extending metadata groups which may
+also make it attractive for non-clustered use.
+
+You'll want to install the ocfs2-tools package in order to at least
+get "mount.ocfs2" and "ocfs2_hb_ctl".
+
+Project web page:    http://oss.oracle.com/projects/ocfs2
+Tools web page:      http://oss.oracle.com/projects/ocfs2-tools
+OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+
+All code copyright 2005 Oracle except when otherwise noted.
+
+CREDITS:
+Lots of code taken from ext3 and other projects.
+
+Authors in alphabetical order:
+Joel Becker   <joel.becker@oracle.com>
+Zach Brown    <zach.brown@oracle.com>
+Mark Fasheh   <mark.fasheh@oracle.com>
+Kurt Hackel   <kurt.hackel@oracle.com>
+Sunil Mushran <sunil.mushran@oracle.com>
+Manish Singh  <manish.singh@oracle.com>
+
+Caveats
+=======
+Features which OCFS2 does not support yet:
+	- sparse files
+	- extended attributes
+	- shared writable mmap
+	- loopback is supported, but data written will not
+	  be cluster coherent.
+	- quotas
+	- cluster aware flock
+	- cluster aware lockf
+	- Directory change notification (F_NOTIFY)
+	- Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease)
+	- POSIX ACLs
+	- readpages / writepages (not user visible)
+
+Mount options
+=============
+
+OCFS2 supports the following mount options:
+(*) == default
+
+barrier=1		This enables/disables barriers. barrier=0 disables it,
+			barrier=1 enables it.
+errors=remount-ro(*)	Remount the filesystem read-only on an error.
+errors=panic		Panic and halt the machine if an error occurs.
+intr		(*)	Allow signals to interrupt cluster operations.
+nointr			Do not allow signals to interrupt cluster
+			operations.
+atime_quantum=60(*)	OCFS2 will not update atime unless this number
+			of seconds has passed since the last update.
+			Set to zero to always update atime.
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -0,0 +1,267 @@
+Changes since 2.5.0:
+
+--- 
+[recommended]
+
+New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),
+	sb_set_blocksize() and sb_min_blocksize().
+
+Use them.
+
+(sb_find_get_block() replaces 2.4's get_hash_table())
+
+--- 
+[recommended]
+
+New methods: ->alloc_inode() and ->destroy_inode().
+
+Remove inode->u.foo_inode_i
+Declare
+	struct foo_inode_info {
+		/* fs-private stuff */
+		struct inode vfs_inode;
+	};
+	static inline struct foo_inode_info *FOO_I(struct inode *inode)
+	{
+		return list_entry(inode, struct foo_inode_info, vfs_inode);
+	}
+
+Use FOO_I(inode) instead of &inode->u.foo_inode_i;
+
+Add foo_alloc_inode() and foo_destory_inode() - the former should allocate
+foo_inode_info and return the address of ->vfs_inode, the latter should free
+FOO_I(inode) (see in-tree filesystems for examples).
+
+Make them ->alloc_inode and ->destroy_inode in your super_operations.
+
+Keep in mind that now you need explicit initialization of private data -
+typically in ->read_inode() and after getting an inode from new_inode().
+
+At some point that will become mandatory.
+
+---
+[mandatory]
+
+Change of file_system_type method (->read_super to ->get_sb)
+
+->read_super() is no more.  Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.
+
+Turn your foo_read_super() into a function that would return 0 in case of
+success and negative number in case of error (-EINVAL unless you have more
+informative error value to report).  Call it foo_fill_super().  Now declare
+
+int foo_get_sb(struct file_system_type *fs_type,
+	int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+{
+	return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,
+			   mnt);
+}
+
+(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
+filesystem).
+
+Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as
+foo_get_sb.
+
+---
+[mandatory]
+
+Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.
+Most likely there is no need to change anything, but if you relied on
+global exclusion between renames for some internal purpose - you need to
+change your internal locking.  Otherwise exclusion warranties remain the
+same (i.e. parents and victim are locked, etc.).
+
+---
+[informational]
+
+Now we have the exclusion between ->lookup() and directory removal (by
+->rmdir() and ->rename()).  If you used to need that exclusion and do
+it by internal locking (most of filesystems couldn't care less) - you
+can relax your locking.
+
+---
+[mandatory]
+
+->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
+->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()
+and ->readdir() are called without BKL now.  Grab it on entry, drop upon return
+- that will guarantee the same locking you used to have.  If your method or its
+parts do not need BKL - better yet, now you can shift lock_kernel() and
+unlock_kernel() so that they would protect exactly what needs to be
+protected.
+
+---
+[mandatory]
+
+BKL is also moved from around sb operations.  ->write_super() Is now called 
+without BKL held.  BKL should have been shifted into individual fs sb_op
+functions.  If you don't need it, remove it.  
+
+---
+[informational]
+
+check for ->link() target not being a directory is done by callers.  Feel
+free to drop it...
+
+---
+[informational]
+
+->link() callers hold ->i_sem on the object we are linking to.  Some of your
+problems might be over...
+
+---
+[mandatory]
+
+new file_system_type method - kill_sb(superblock).  If you are converting
+an existing filesystem, set it according to ->fs_flags:
+	FS_REQUIRES_DEV		-	kill_block_super
+	FS_LITTER		-	kill_litter_super
+	neither			-	kill_anon_super
+FS_LITTER is gone - just remove it from fs_flags.
+
+---
+[mandatory]
+
+	FS_SINGLE is gone (actually, that had happened back when ->get_sb()
+went in - and hadn't been documented ;-/).  Just remove it from fs_flags
+(and see ->get_sb() entry for other actions).
+
+---
+[mandatory]
+
+->setattr() is called without BKL now.  Caller _always_ holds ->i_sem, so
+watch for ->i_sem-grabbing code that might be used by your ->setattr().
+Callers of notify_change() need ->i_sem now.
+
+---
+[recommended]
+
+New super_block field "struct export_operations *s_export_op" for
+explicit support for exporting, e.g. via NFS.  The structure is fully
+documented at its declaration in include/linux/fs.h, and in
+Documentation/filesystems/Exporting.
+
+Briefly it allows for the definition of decode_fh and encode_fh operations
+to encode and decode filehandles, and allows the filesystem to use
+a standard helper function for decode_fh, and provide file-system specific
+support for this helper, particularly get_parent.
+
+It is planned that this will be required for exporting once the code
+settles down a bit.
+
+[mandatory]
+
+s_export_op is now required for exporting a filesystem.
+isofs, ext2, ext3, resierfs, fat
+can be used as examples of very different filesystems.
+
+---
+[mandatory]
+
+iget4() and the read_inode2 callback have been superseded by iget5_locked()
+which has the following prototype,
+
+    struct inode *iget5_locked(struct super_block *sb, unsigned long ino,
+				int (*test)(struct inode *, void *),
+				int (*set)(struct inode *, void *),
+				void *data);
+
+'test' is an additional function that can be used when the inode
+number is not sufficient to identify the actual file object. 'set'
+should be a non-blocking function that initializes those parts of a
+newly created inode to allow the test function to succeed. 'data' is
+passed as an opaque value to both test and set functions.
+
+When the inode has been created by iget5_locked(), it will be returned with
+the I_NEW flag set and will still be locked. read_inode has not been
+called so the file system still has to finalize the initialization. Once
+the inode is initialized it must be unlocked by calling unlock_new_inode().
+
+The filesystem is responsible for setting (and possibly testing) i_ino
+when appropriate. There is also a simpler iget_locked function that
+just takes the superblock and inode number as arguments and does the
+test and set for you.
+
+e.g.
+       inode = iget_locked(sb, ino);
+       if (inode->i_state & I_NEW) {
+               read_inode_from_disk(inode);
+               unlock_new_inode(inode);
+       }
+
+---
+[recommended]
+
+->getattr() finally getting used.  See instances in nfs, minix, etc.
+
+---
+[mandatory]
+
+->revalidate() is gone.  If your filesystem had it - provide ->getattr()
+and let it call whatever you had as ->revlidate() + (for symlinks that
+had ->revalidate()) add calls in ->follow_link()/->readlink().
+
+---
+[mandatory]
+
+->d_parent changes are not protected by BKL anymore.  Read access is safe
+if at least one of the following is true:
+	* filesystem has no cross-directory rename()
+	* dcache_lock is held
+	* we know that parent had been locked (e.g. we are looking at
+->d_parent of ->lookup() argument).
+	* we are called from ->rename().
+	* the child's ->d_lock is held
+Audit your code and add locking if needed.  Notice that any place that is
+not protected by the conditions above is risky even in the old tree - you
+had been relying on BKL and that's prone to screwups.  Old tree had quite
+a few holes of that kind - unprotected access to ->d_parent leading to
+anything from oops to silent memory corruption.
+
+---
+[mandatory]
+
+	FS_NOMOUNT is gone.  If you use it - just set MS_NOUSER in flags
+(see rootfs for one kind of solution and bdev/socket/pipe for another).
+
+---
+[recommended]
+
+	Use bdev_read_only(bdev) instead of is_read_only(kdev).  The latter
+is still alive, but only because of the mess in drivers/s390/block/dasd.c.
+As soon as it gets fixed is_read_only() will die.
+
+---
+[mandatory]
+
+->permission() is called without BKL now. Grab it on entry, drop upon
+return - that will guarantee the same locking you used to have.  If
+your method or its parts do not need BKL - better yet, now you can
+shift lock_kernel() and unlock_kernel() so that they would protect
+exactly what needs to be protected.
+
+---
+[mandatory]
+
+->statfs() is now called without BKL held.  BKL should have been
+shifted into individual fs sb_op functions where it's not clear that
+it's safe to remove it.  If you don't need it, remove it.
+
+---
+[mandatory]
+
+	is_read_only() is gone; use bdev_read_only() instead.
+
+---
+[mandatory]
+
+	destroy_buffers() is gone; use invalidate_bdev().
+
+---
+[mandatory]
+
+	fsync_dev() is gone; use fsync_bdev().  NOTE: lvm breakage is
+deliberate; as soon as struct block_device * is propagated in a reasonable
+way by that code fixing will become trivial; until then nothing can be
+done.
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt
@@ -0,0 +1,355 @@
+ramfs, rootfs and initramfs
+October 17, 2005
+Rob Landley <rob@landley.net>
+=============================
+
+What is ramfs?
+--------------
+
+Ramfs is a very simple filesystem that exports Linux's disk caching
+mechanisms (the page cache and dentry cache) as a dynamically resizable
+ram-based filesystem.
+
+Normally all files are cached in memory by Linux.  Pages of data read from
+backing store (usually the block device the filesystem is mounted on) are kept
+around in case it's needed again, but marked as clean (freeable) in case the
+Virtual Memory system needs the memory for something else.  Similarly, data
+written to files is marked clean as soon as it has been written to backing
+store, but kept around for caching purposes until the VM reallocates the
+memory.  A similar mechanism (the dentry cache) greatly speeds up access to
+directories.
+
+With ramfs, there is no backing store.  Files written into ramfs allocate
+dentries and page cache as usual, but there's nowhere to write them to.
+This means the pages are never marked clean, so they can't be freed by the
+VM when it's looking to recycle memory.
+
+The amount of code required to implement ramfs is tiny, because all the
+work is done by the existing Linux caching infrastructure.  Basically,
+you're mounting the disk cache as a filesystem.  Because of this, ramfs is not
+an optional component removable via menuconfig, since there would be negligible
+space savings.
+
+ramfs and ramdisk:
+------------------
+
+The older "ram disk" mechanism created a synthetic block device out of
+an area of ram and used it as backing store for a filesystem.  This block
+device was of fixed size, so the filesystem mounted on it was of fixed
+size.  Using a ram disk also required unnecessarily copying memory from the
+fake block device into the page cache (and copying changes back out), as well
+as creating and destroying dentries.  Plus it needed a filesystem driver
+(such as ext2) to format and interpret this data.
+
+Compared to ramfs, this wastes memory (and memory bus bandwidth), creates
+unnecessary work for the CPU, and pollutes the CPU caches.  (There are tricks
+to avoid this copying by playing with the page tables, but they're unpleasantly
+complicated and turn out to be about as expensive as the copying anyway.)
+More to the point, all the work ramfs is doing has to happen _anyway_,
+since all file access goes through the page and dentry caches.  The ram
+disk is simply unnecessary, ramfs is internally much simpler.
+
+Another reason ramdisks are semi-obsolete is that the introduction of
+loopback devices offered a more flexible and convenient way to create
+synthetic block devices, now from files instead of from chunks of memory.
+See losetup (8) for details.
+
+ramfs and tmpfs:
+----------------
+
+One downside of ramfs is you can keep writing data into it until you fill
+up all memory, and the VM can't free it because the VM thinks that files
+should get written to backing store (rather than swap space), but ramfs hasn't
+got any backing store.  Because of this, only root (or a trusted user) should
+be allowed write access to a ramfs mount.
+
+A ramfs derivative called tmpfs was created to add size limits, and the ability
+to write the data to swap space.  Normal users can be allowed write access to
+tmpfs mounts.  See Documentation/filesystems/tmpfs.txt for more information.
+
+What is rootfs?
+---------------
+
+Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is
+always present in 2.6 systems.  You can't unmount rootfs for approximately the
+same reason you can't kill the init process; rather than having special code
+to check for and handle an empty list, it's smaller and simpler for the kernel
+to just make sure certain lists can't become empty.
+
+Most systems just mount another filesystem over rootfs and ignore it.  The
+amount of space an empty instance of ramfs takes up is tiny.
+
+What is initramfs?
+------------------
+
+All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is
+extracted into rootfs when the kernel boots up.  After extracting, the kernel
+checks to see if rootfs contains a file "init", and if so it executes it as PID
+1.  If found, this init process is responsible for bringing the system the
+rest of the way up, including locating and mounting the real root device (if
+any).  If rootfs does not contain an init program after the embedded cpio
+archive is extracted into it, the kernel will fall through to the older code
+to locate and mount a root partition, then exec some variant of /sbin/init
+out of that.
+
+All this differs from the old initrd in several ways:
+
+  - The old initrd was always a separate file, while the initramfs archive is
+    linked into the linux kernel image.  (The directory linux-*/usr is devoted
+    to generating this archive during the build.)
+
+  - The old initrd file was a gzipped filesystem image (in some file format,
+    such as ext2, that needed a driver built into the kernel), while the new
+    initramfs archive is a gzipped cpio archive (like tar only simpler,
+    see cpio(1) and Documentation/early-userspace/buffer-format.txt).  The
+    kernel's cpio extraction code is not only extremely small, it's also
+    __init data that can be discarded during the boot process.
+
+  - The program run by the old initrd (which was called /initrd, not /init) did
+    some setup and then returned to the kernel, while the init program from
+    initramfs is not expected to return to the kernel.  (If /init needs to hand
+    off control it can overmount / with a new root device and exec another init
+    program.  See the switch_root utility, below.)
+
+  - When switching another root device, initrd would pivot_root and then
+    umount the ramdisk.  But initramfs is rootfs: you can neither pivot_root
+    rootfs, nor unmount it.  Instead delete everything out of rootfs to
+    free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs
+    with the new root (cd /newmount; mount --move . /; chroot .), attach
+    stdin/stdout/stderr to the new /dev/console, and exec the new init.
+
+    Since this is a remarkably persnickity process (and involves deleting
+    commands before you can run them), the klibc package introduced a helper
+    program (utils/run_init.c) to do all this for you.  Most other packages
+    (such as busybox) have named this command "switch_root".
+
+Populating initramfs:
+---------------------
+
+The 2.6 kernel build process always creates a gzipped cpio format initramfs
+archive and links it into the resulting kernel binary.  By default, this
+archive is empty (consuming 134 bytes on x86).
+
+The config option CONFIG_INITRAMFS_SOURCE (for some reason buried under
+devices->block devices in menuconfig, and living in usr/Kconfig) can be used
+to specify a source for the initramfs archive, which will automatically be
+incorporated into the resulting binary.  This option can point to an existing
+gzipped cpio archive, a directory containing files to be archived, or a text
+file specification such as the following example:
+
+  dir /dev 755 0 0
+  nod /dev/console 644 0 0 c 5 1
+  nod /dev/loop0 644 0 0 b 7 0
+  dir /bin 755 1000 1000
+  slink /bin/sh busybox 777 0 0
+  file /bin/busybox initramfs/busybox 755 0 0
+  dir /proc 755 0 0
+  dir /sys 755 0 0
+  dir /mnt 755 0 0
+  file /init initramfs/init.sh 755 0 0
+
+Run "usr/gen_init_cpio" (after the kernel build) to get a usage message
+documenting the above file format.
+
+One advantage of the configuration file is that root access is not required to
+set permissions or create device nodes in the new archive.  (Note that those
+two example "file" entries expect to find files named "init.sh" and "busybox" in
+a directory called "initramfs", under the linux-2.6.* directory.  See
+Documentation/early-userspace/README for more details.)
+
+The kernel does not depend on external cpio tools.  If you specify a
+directory instead of a configuration file, the kernel's build infrastructure
+creates a configuration file from that directory (usr/Makefile calls
+scripts/gen_initramfs_list.sh), and proceeds to package up that directory
+using the config file (by feeding it to usr/gen_init_cpio, which is created
+from usr/gen_init_cpio.c).  The kernel's build-time cpio creation code is
+entirely self-contained, and the kernel's boot-time extractor is also
+(obviously) self-contained.
+
+The one thing you might need external cpio utilities installed for is creating
+or extracting your own preprepared cpio files to feed to the kernel build
+(instead of a config file or directory).
+
+The following command line can extract a cpio image (either by the above script
+or by the kernel build) back into its component files:
+
+  cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
+
+The following shell script can create a prebuilt cpio archive you can
+use in place of the above config file:
+
+  #!/bin/sh
+
+  # Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation.
+  # Licensed under GPL version 2
+
+  if [ $# -ne 2 ]
+  then
+    echo "usage: mkinitramfs directory imagename.cpio.gz"
+    exit 1
+  fi
+
+  if [ -d "$1" ]
+  then
+    echo "creating $2 from $1"
+    (cd "$1"; find . | cpio -o -H newc | gzip) > "$2"
+  else
+    echo "First argument must be a directory"
+    exit 1
+  fi
+
+Note: The cpio man page contains some bad advice that will break your initramfs
+archive if you follow it.  It says "A typical way to generate the list
+of filenames is with the find command; you should give find the -depth option
+to minimize problems with permissions on directories that are unwritable or not
+searchable."  Don't do this when creating initramfs.cpio.gz images, it won't
+work.  The Linux kernel cpio extractor won't create files in a directory that
+doesn't exist, so the directory entries must go before the files that go in
+those directories.  The above script gets them in the right order.
+
+External initramfs images:
+--------------------------
+
+If the kernel has initrd support enabled, an external cpio.gz archive can also
+be passed into a 2.6 kernel in place of an initrd.  In this case, the kernel
+will autodetect the type (initramfs, not initrd) and extract the external cpio
+archive into rootfs before trying to run /init.
+
+This has the memory efficiency advantages of initramfs (no ramdisk block
+device) but the separate packaging of initrd (which is nice if you have
+non-GPL code you'd like to run from initramfs, without conflating it with
+the GPL licensed Linux kernel binary).
+
+It can also be used to supplement the kernel's built-in initamfs image.  The
+files in the external archive will overwrite any conflicting files in
+the built-in initramfs archive.  Some distributors also prefer to customize
+a single kernel image with task-specific initramfs images, without recompiling.
+
+Contents of initramfs:
+----------------------
+
+An initramfs archive is a complete self-contained root filesystem for Linux.
+If you don't already understand what shared libraries, devices, and paths
+you need to get a minimal root filesystem up and running, here are some
+references:
+http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
+http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
+http://www.linuxfromscratch.org/lfs/view/stable/
+
+The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
+designed to be a tiny C library to statically link early userspace
+code against, along with some related utilities.  It is BSD licensed.
+
+I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
+myself.  These are LGPL and GPL, respectively.  (A self-contained initramfs
+package is planned for the busybox 1.3 release.)
+
+In theory you could use glibc, but that's not well suited for small embedded
+uses like this.  (A "hello world" program statically linked against glibc is
+over 400k.  With uClibc it's 7k.  Also note that glibc dlopens libnss to do
+name lookups, even when otherwise statically linked.)
+
+A good first step is to get initramfs to run a statically linked "hello world"
+program as init, and test it under an emulator like qemu (www.qemu.org) or
+User Mode Linux, like so:
+
+  cat > hello.c << EOF
+  #include <stdio.h>
+  #include <unistd.h>
+
+  int main(int argc, char *argv[])
+  {
+    printf("Hello world!\n");
+    sleep(999999999);
+  }
+  EOF
+  gcc -static hello2.c -o init
+  echo init | cpio -o -H newc | gzip > test.cpio.gz
+  # Testing external initramfs using the initrd loading mechanism.
+  qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero
+
+When debugging a normal root filesystem, it's nice to be able to boot with
+"init=/bin/sh".  The initramfs equivalent is "rdinit=/bin/sh", and it's
+just as useful.
+
+Why cpio rather than tar?
+-------------------------
+
+This decision was made back in December, 2001.  The discussion started here:
+
+  http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html
+
+And spawned a second thread (specifically on tar vs cpio), starting here:
+
+  http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html
+
+The quick and dirty summary version (which is no substitute for reading
+the above threads) is:
+
+1) cpio is a standard.  It's decades old (from the AT&T days), and already
+   widely used on Linux (inside RPM, Red Hat's device driver disks).  Here's
+   a Linux Journal article about it from 1996:
+
+      http://www.linuxjournal.com/article/1213
+
+   It's not as popular as tar because the traditional cpio command line tools
+   require _truly_hideous_ command line arguments.  But that says nothing
+   either way about the archive format, and there are alternative tools,
+   such as:
+
+     http://freshmeat.net/projects/afio/
+
+2) The cpio archive format chosen by the kernel is simpler and cleaner (and
+   thus easier to create and parse) than any of the (literally dozens of)
+   various tar archive formats.  The complete initramfs archive format is
+   explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
+   extracted in init/initramfs.c.  All three together come to less than 26k
+   total of human-readable text.
+
+3) The GNU project standardizing on tar is approximately as relevant as
+   Windows standardizing on zip.  Linux is not part of either, and is free
+   to make its own technical decisions.
+
+4) Since this is a kernel internal format, it could easily have been
+   something brand new.  The kernel provides its own tools to create and
+   extract this format anyway.  Using an existing standard was preferable,
+   but not essential.
+
+5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be
+   supported on the kernel side"):
+
+      http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html
+
+   explained his reasoning:
+
+      http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
+      http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
+
+   and, most importantly, designed and implemented the initramfs code.
+
+Future directions:
+------------------
+
+Today (2.6.16), initramfs is always compiled in, but not always used.  The
+kernel falls back to legacy boot code that is reached only if initramfs does
+not contain an /init program.  The fallback is legacy code, there to ensure a
+smooth transition and allowing early boot functionality to gradually move to
+"early userspace" (I.E. initramfs).
+
+The move to early userspace is necessary because finding and mounting the real
+root device is complex.  Root partitions can span multiple devices (raid or
+separate journal).  They can be out on the network (requiring dhcp, setting a
+specific mac address, logging into a server, etc).  They can live on removable
+media, with dynamically allocated major/minor numbers and persistent naming
+issues requiring a full udev implementation to sort out.  They can be
+compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned,
+and so on.
+
+This kind of complexity (which inevitably includes policy) is rightly handled
+in userspace.  Both klibc and busybox/uClibc are working on simple initramfs
+packages to drop into a kernel build.
+
+The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree.
+The kernel's current early boot code (partition detection, etc) will probably
+be migrated into a default initramfs, automatically created and used by the
+kernel build.
--- a/Documentation/filesystems/relay.txt
+++ b/Documentation/filesystems/relay.txt
@@ -0,0 +1,484 @@
+relay interface (formerly relayfs)
+==================================
+
+The relay interface provides a means for kernel applications to
+efficiently log and transfer large quantities of data from the kernel
+to userspace via user-defined 'relay channels'.
+
+A 'relay channel' is a kernel->user data relay mechanism implemented
+as a set of per-cpu kernel buffers ('channel buffers'), each
+represented as a regular file ('relay file') in user space.  Kernel
+clients write into the channel buffers using efficient write
+functions; these automatically log into the current cpu's channel
+buffer.  User space applications mmap() or read() from the relay files
+and retrieve the data as it becomes available.  The relay files
+themselves are files created in a host filesystem, e.g. debugfs, and
+are associated with the channel buffers using the API described below.
+
+The format of the data logged into the channel buffers is completely
+up to the kernel client; the relay interface does however provide
+hooks which allow kernel clients to impose some structure on the
+buffer data.  The relay interface doesn't implement any form of data
+filtering - this also is left to the kernel client.  The purpose is to
+keep things as simple as possible.
+
+This document provides an overview of the relay interface API.  The
+details of the function parameters are documented along with the
+functions in the relay interface code - please see that for details.
+
+Semantics
+=========
+
+Each relay channel has one buffer per CPU, each buffer has one or more
+sub-buffers.  Messages are written to the first sub-buffer until it is
+too full to contain a new message, in which case it it is written to
+the next (if available).  Messages are never split across sub-buffers.
+At this point, userspace can be notified so it empties the first
+sub-buffer, while the kernel continues writing to the next.
+
+When notified that a sub-buffer is full, the kernel knows how many
+bytes of it are padding i.e. unused space occurring because a complete
+message couldn't fit into a sub-buffer.  Userspace can use this
+knowledge to copy only valid data.
+
+After copying it, userspace can notify the kernel that a sub-buffer
+has been consumed.
+
+A relay channel can operate in a mode where it will overwrite data not
+yet collected by userspace, and not wait for it to be consumed.
+
+The relay channel itself does not provide for communication of such
+data between userspace and kernel, allowing the kernel side to remain
+simple and not impose a single interface on userspace.  It does
+provide a set of examples and a separate helper though, described
+below.
+
+The read() interface both removes padding and internally consumes the
+read sub-buffers; thus in cases where read(2) is being used to drain
+the channel buffers, special-purpose communication between kernel and
+user isn't necessary for basic operation.
+
+One of the major goals of the relay interface is to provide a low
+overhead mechanism for conveying kernel data to userspace.  While the
+read() interface is easy to use, it's not as efficient as the mmap()
+approach; the example code attempts to make the tradeoff between the
+two approaches as small as possible.
+
+klog and relay-apps example code
+================================
+
+The relay interface itself is ready to use, but to make things easier,
+a couple simple utility functions and a set of examples are provided.
+
+The relay-apps example tarball, available on the relay sourceforge
+site, contains a set of self-contained examples, each consisting of a
+pair of .c files containing boilerplate code for each of the user and
+kernel sides of a relay application.  When combined these two sets of
+boilerplate code provide glue to easily stream data to disk, without
+having to bother with mundane housekeeping chores.
+
+The 'klog debugging functions' patch (klog.patch in the relay-apps
+tarball) provides a couple of high-level logging functions to the
+kernel which allow writing formatted text or raw data to a channel,
+regardless of whether a channel to write into exists or not, or even
+whether the relay interface is compiled into the kernel or not.  These
+functions allow you to put unconditional 'trace' statements anywhere
+in the kernel or kernel modules; only when there is a 'klog handler'
+registered will data actually be logged (see the klog and kleak
+examples for details).
+
+It is of course possible to use the relay interface from scratch,
+i.e. without using any of the relay-apps example code or klog, but
+you'll have to implement communication between userspace and kernel,
+allowing both to convey the state of buffers (full, empty, amount of
+padding).  The read() interface both removes padding and internally
+consumes the read sub-buffers; thus in cases where read(2) is being
+used to drain the channel buffers, special-purpose communication
+between kernel and user isn't necessary for basic operation.  Things
+such as buffer-full conditions would still need to be communicated via
+some channel though.
+
+klog and the relay-apps examples can be found in the relay-apps
+tarball on http://relayfs.sourceforge.net
+
+The relay interface user space API
+==================================
+
+The relay interface implements basic file operations for user space
+access to relay channel buffer data.  Here are the file operations
+that are available and some comments regarding their behavior:
+
+open()	    enables user to open an _existing_ channel buffer.
+
+mmap()      results in channel buffer being mapped into the caller's
+	    memory space. Note that you can't do a partial mmap - you
+	    must map the entire file, which is NRBUF * SUBBUFSIZE.
+
+read()      read the contents of a channel buffer.  The bytes read are
+	    'consumed' by the reader, i.e. they won't be available
+	    again to subsequent reads.  If the channel is being used
+	    in no-overwrite mode (the default), it can be read at any
+	    time even if there's an active kernel writer.  If the
+	    channel is being used in overwrite mode and there are
+	    active channel writers, results may be unpredictable -
+	    users should make sure that all logging to the channel has
+	    ended before using read() with overwrite mode.  Sub-buffer
+	    padding is automatically removed and will not be seen by
+	    the reader.
+
+sendfile()  transfer data from a channel buffer to an output file
+	    descriptor. Sub-buffer padding is automatically removed
+	    and will not be seen by the reader.
+
+poll()      POLLIN/POLLRDNORM/POLLERR supported.  User applications are
+	    notified when sub-buffer boundaries are crossed.
+
+close()     decrements the channel buffer's refcount.  When the refcount
+	    reaches 0, i.e. when no process or kernel client has the
+	    buffer open, the channel buffer is freed.
+
+In order for a user application to make use of relay files, the
+host filesystem must be mounted.  For example,
+
+	mount -t debugfs debugfs /debug
+
+NOTE:   the host filesystem doesn't need to be mounted for kernel
+	clients to create or use channels - it only needs to be
+	mounted when user space applications need access to the buffer
+	data.
+
+
+The relay interface kernel API
+==============================
+
+Here's a summary of the API the relay interface provides to in-kernel clients:
+
+TBD(curr. line MT:/API/)
+  channel management functions:
+
+    relay_open(base_filename, parent, subbuf_size, n_subbufs,
+               callbacks, private_data)
+    relay_close(chan)
+    relay_flush(chan)
+    relay_reset(chan)
+
+  channel management typically called on instigation of userspace:
+
+    relay_subbufs_consumed(chan, cpu, subbufs_consumed)
+
+  write functions:
+
+    relay_write(chan, data, length)
+    __relay_write(chan, data, length)
+    relay_reserve(chan, length)
+
+  callbacks:
+
+    subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
+    buf_mapped(buf, filp)
+    buf_unmapped(buf, filp)
+    create_buf_file(filename, parent, mode, buf, is_global)
+    remove_buf_file(dentry)
+
+  helper functions:
+
+    relay_buf_full(buf)
+    subbuf_start_reserve(buf, length)
+
+
+Creating a channel
+------------------
+
+relay_open() is used to create a channel, along with its per-cpu
+channel buffers.  Each channel buffer will have an associated file
+created for it in the host filesystem, which can be and mmapped or
+read from in user space.  The files are named basename0...basenameN-1
+where N is the number of online cpus, and by default will be created
+in the root of the filesystem (if the parent param is NULL).  If you
+want a directory structure to contain your relay files, you should
+create it using the host filesystem's directory creation function,
+e.g. debugfs_create_dir(), and pass the parent directory to
+relay_open().  Users are responsible for cleaning up any directory
+structure they create, when the channel is closed - again the host
+filesystem's directory removal functions should be used for that,
+e.g. debugfs_remove().
+
+In order for a channel to be created and the host filesystem's files
+associated with its channel buffers, the user must provide definitions
+for two callback functions, create_buf_file() and remove_buf_file().
+create_buf_file() is called once for each per-cpu buffer from
+relay_open() and allows the user to create the file which will be used
+to represent the corresponding channel buffer.  The callback should
+return the dentry of the file created to represent the channel buffer.
+remove_buf_file() must also be defined; it's responsible for deleting
+the file(s) created in create_buf_file() and is called during
+relay_close().
+
+Here are some typical definitions for these callbacks, in this case
+using debugfs:
+
+/*
+ * create_buf_file() callback.  Creates relay file in debugfs.
+ */
+static struct dentry *create_buf_file_handler(const char *filename,
+                                              struct dentry *parent,
+                                              int mode,
+                                              struct rchan_buf *buf,
+                                              int *is_global)
+{
+        return debugfs_create_file(filename, mode, parent, buf,
+	                           &relay_file_operations);
+}
+
+/*
+ * remove_buf_file() callback.  Removes relay file from debugfs.
+ */
+static int remove_buf_file_handler(struct dentry *dentry)
+{
+        debugfs_remove(dentry);
+
+        return 0;
+}
+
+/*
+ * relay interface callbacks
+ */
+static struct rchan_callbacks relay_callbacks =
+{
+        .create_buf_file = create_buf_file_handler,
+        .remove_buf_file = remove_buf_file_handler,
+};
+
+And an example relay_open() invocation using them:
+
+  chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL);
+
+If the create_buf_file() callback fails, or isn't defined, channel
+creation and thus relay_open() will fail.
+
+The total size of each per-cpu buffer is calculated by multiplying the
+number of sub-buffers by the sub-buffer size passed into relay_open().
+The idea behind sub-buffers is that they're basically an extension of
+double-buffering to N buffers, and they also allow applications to
+easily implement random-access-on-buffer-boundary schemes, which can
+be important for some high-volume applications.  The number and size
+of sub-buffers is completely dependent on the application and even for
+the same application, different conditions will warrant different
+values for these parameters at different times.  Typically, the right
+values to use are best decided after some experimentation; in general,
+though, it's safe to assume that having only 1 sub-buffer is a bad
+idea - you're guaranteed to either overwrite data or lose events
+depending on the channel mode being used.
+
+The create_buf_file() implementation can also be defined in such a way
+as to allow the creation of a single 'global' buffer instead of the
+default per-cpu set.  This can be useful for applications interested
+mainly in seeing the relative ordering of system-wide events without
+the need to bother with saving explicit timestamps for the purpose of
+merging/sorting per-cpu files in a postprocessing step.
+
+To have relay_open() create a global buffer, the create_buf_file()
+implementation should set the value of the is_global outparam to a
+non-zero value in addition to creating the file that will be used to
+represent the single buffer.  In the case of a global buffer,
+create_buf_file() and remove_buf_file() will be called only once.  The
+normal channel-writing functions, e.g. relay_write(), can still be
+used - writes from any cpu will transparently end up in the global
+buffer - but since it is a global buffer, callers should make sure
+they use the proper locking for such a buffer, either by wrapping
+writes in a spinlock, or by copying a write function from relay.h and
+creating a local version that internally does the proper locking.
+
+The private_data passed into relay_open() allows clients to associate
+user-defined data with a channel, and is immediately available
+(including in create_buf_file()) via chan->private_data or
+buf->chan->private_data.
+
+Channel 'modes'
+---------------
+
+relay channels can be used in either of two modes - 'overwrite' or
+'no-overwrite'.  The mode is entirely determined by the implementation
+of the subbuf_start() callback, as described below.  The default if no
+subbuf_start() callback is defined is 'no-overwrite' mode.  If the
+default mode suits your needs, and you plan to use the read()
+interface to retrieve channel data, you can ignore the details of this
+section, as it pertains mainly to mmap() implementations.
+
+In 'overwrite' mode, also known as 'flight recorder' mode, writes
+continuously cycle around the buffer and will never fail, but will
+unconditionally overwrite old data regardless of whether it's actually
+been consumed.  In no-overwrite mode, writes will fail, i.e. data will
+be lost, if the number of unconsumed sub-buffers equals the total
+number of sub-buffers in the channel.  It should be clear that if
+there is no consumer or if the consumer can't consume sub-buffers fast
+enough, data will be lost in either case; the only difference is
+whether data is lost from the beginning or the end of a buffer.
+
+As explained above, a relay channel is made of up one or more
+per-cpu channel buffers, each implemented as a circular buffer
+subdivided into one or more sub-buffers.  Messages are written into
+the current sub-buffer of the channel's current per-cpu buffer via the
+write functions described below.  Whenever a message can't fit into
+the current sub-buffer, because there's no room left for it, the
+client is notified via the subbuf_start() callback that a switch to a
+new sub-buffer is about to occur.  The client uses this callback to 1)
+initialize the next sub-buffer if appropriate 2) finalize the previous
+sub-buffer if appropriate and 3) return a boolean value indicating
+whether or not to actually move on to the next sub-buffer.
+
+To implement 'no-overwrite' mode, the userspace client would provide
+an implementation of the subbuf_start() callback something like the
+following:
+
+static int subbuf_start(struct rchan_buf *buf,
+                        void *subbuf,
+			void *prev_subbuf,
+			unsigned int prev_padding)
+{
+	if (prev_subbuf)
+		*((unsigned *)prev_subbuf) = prev_padding;
+
+	if (relay_buf_full(buf))
+		return 0;
+
+	subbuf_start_reserve(buf, sizeof(unsigned int));
+
+	return 1;
+}
+
+If the current buffer is full, i.e. all sub-buffers remain unconsumed,
+the callback returns 0 to indicate that the buffer switch should not
+occur yet, i.e. until the consumer has had a chance to read the
+current set of ready sub-buffers.  For the relay_buf_full() function
+to make sense, the consumer is reponsible for notifying the relay
+interface when sub-buffers have been consumed via
+relay_subbufs_consumed().  Any subsequent attempts to write into the
+buffer will again invoke the subbuf_start() callback with the same
+parameters; only when the consumer has consumed one or more of the
+ready sub-buffers will relay_buf_full() return 0, in which case the
+buffer switch can continue.
+
+The implementation of the subbuf_start() callback for 'overwrite' mode
+would be very similar:
+
+static int subbuf_start(struct rchan_buf *buf,
+                        void *subbuf,
+			void *prev_subbuf,
+			unsigned int prev_padding)
+{
+	if (prev_subbuf)
+		*((unsigned *)prev_subbuf) = prev_padding;
+
+	subbuf_start_reserve(buf, sizeof(unsigned int));
+
+	return 1;
+}
+
+In this case, the relay_buf_full() check is meaningless and the
+callback always returns 1, causing the buffer switch to occur
+unconditionally.  It's also meaningless for the client to use the
+relay_subbufs_consumed() function in this mode, as it's never
+consulted.
+
+The default subbuf_start() implementation, used if the client doesn't
+define any callbacks, or doesn't define the subbuf_start() callback,
+implements the simplest possible 'no-overwrite' mode, i.e. it does
+nothing but return 0.
+
+Header information can be reserved at the beginning of each sub-buffer
+by calling the subbuf_start_reserve() helper function from within the
+subbuf_start() callback.  This reserved area can be used to store
+whatever information the client wants.  In the example above, room is
+reserved in each sub-buffer to store the padding count for that
+sub-buffer.  This is filled in for the previous sub-buffer in the
+subbuf_start() implementation; the padding value for the previous
+sub-buffer is passed into the subbuf_start() callback along with a
+pointer to the previous sub-buffer, since the padding value isn't
+known until a sub-buffer is filled.  The subbuf_start() callback is
+also called for the first sub-buffer when the channel is opened, to
+give the client a chance to reserve space in it.  In this case the
+previous sub-buffer pointer passed into the callback will be NULL, so
+the client should check the value of the prev_subbuf pointer before
+writing into the previous sub-buffer.
+
+Writing to a channel
+--------------------
+
+Kernel clients write data into the current cpu's channel buffer using
+relay_write() or __relay_write().  relay_write() is the main logging
+function - it uses local_irqsave() to protect the buffer and should be
+used if you might be logging from interrupt context.  If you know
+you'll never be logging from interrupt context, you can use
+__relay_write(), which only disables preemption.  These functions
+don't return a value, so you can't determine whether or not they
+failed - the assumption is that you wouldn't want to check a return
+value in the fast logging path anyway, and that they'll always succeed
+unless the buffer is full and no-overwrite mode is being used, in
+which case you can detect a failed write in the subbuf_start()
+callback by calling the relay_buf_full() helper function.
+
+relay_reserve() is used to reserve a slot in a channel buffer which
+can be written to later.  This would typically be used in applications
+that need to write directly into a channel buffer without having to
+stage data in a temporary buffer beforehand.  Because the actual write
+may not happen immediately after the slot is reserved, applications
+using relay_reserve() can keep a count of the number of bytes actually
+written, either in space reserved in the sub-buffers themselves or as
+a separate array.  See the 'reserve' example in the relay-apps tarball
+at http://relayfs.sourceforge.net for an example of how this can be
+done.  Because the write is under control of the client and is
+separated from the reserve, relay_reserve() doesn't protect the buffer
+at all - it's up to the client to provide the appropriate
+synchronization when using relay_reserve().
+
+Closing a channel
+-----------------
+
+The client calls relay_close() when it's finished using the channel.
+The channel and its associated buffers are destroyed when there are no
+longer any references to any of the channel buffers.  relay_flush()
+forces a sub-buffer switch on all the channel buffers, and can be used
+to finalize and process the last sub-buffers before the channel is
+closed.
+
+Misc
+----
+
+Some applications may want to keep a channel around and re-use it
+rather than open and close a new channel for each use.  relay_reset()
+can be used for this purpose - it resets a channel to its initial
+state without reallocating channel buffer memory or destroying
+existing mappings.  It should however only be called when it's safe to
+do so, i.e. when the channel isn't currently being written to.
+
+Finally, there are a couple of utility callbacks that can be used for
+different purposes.  buf_mapped() is called whenever a channel buffer
+is mmapped from user space and buf_unmapped() is called when it's
+unmapped.  The client can use this notification to trigger actions
+within the kernel application, such as enabling/disabling logging to
+the channel.
+
+
+Resources
+=========
+
+For news, example code, mailing list, etc. see the relay interface homepage:
+
+    http://relayfs.sourceforge.net
+
+
+Credits
+=======
+
+The ideas and specs for the relay interface came about as a result of
+discussions on tracing involving the following:
+
+Michel Dagenais		<michel.dagenais@polymtl.ca>
+Richard Moore		<richardj_moore@uk.ibm.com>
+Bob Wisniewski		<bob@watson.ibm.com>
+Karim Yaghmour		<karim@opersys.com>
+Tom Zanussi		<zanussi@us.ibm.com>
+
+Also thanks to Hubertus Franke for a lot of useful suggestions and bug
+reports.
--- a/Documentation/filesystems/romfs.txt
+++ b/Documentation/filesystems/romfs.txt
@@ -0,0 +1,187 @@
+ROMFS - ROM FILE SYSTEM
+
+This is a quite dumb, read only filesystem, mainly for initial RAM
+disks of installation disks.  It has grown up by the need of having
+modules linked at boot time.  Using this filesystem, you get a very
+similar feature, and even the possibility of a small kernel, with a
+file system which doesn't take up useful memory from the router
+functions in the basement of your office.
+
+For comparison, both the older minix and xiafs (the latter is now
+defunct) filesystems, compiled as module need more than 20000 bytes,
+while romfs is less than a page, about 4000 bytes (assuming i586
+code).  Under the same conditions, the msdos filesystem would need
+about 30K (and does not support device nodes or symlinks), while the
+nfs module with nfsroot is about 57K.  Furthermore, as a bit unfair
+comparison, an actual rescue disk used up 3202 blocks with ext2, while
+with romfs, it needed 3079 blocks.
+
+To create such a file system, you'll need a user program named
+genromfs.  It is available via anonymous ftp on sunsite.unc.edu and
+its mirrors, in the /pub/Linux/system/recovery/ directory.
+
+As the name suggests, romfs could be also used (space-efficiently) on
+various read-only media, like (E)EPROM disks if someone will have the
+motivation.. :)
+
+However, the main purpose of romfs is to have a very small kernel,
+which has only this filesystem linked in, and then can load any module
+later, with the current module utilities.  It can also be used to run
+some program to decide if you need SCSI devices, and even IDE or
+floppy drives can be loaded later if you use the "initrd"--initial
+RAM disk--feature of the kernel.  This would not be really news
+flash, but with romfs, you can even spare off your ext2 or minix or
+maybe even affs filesystem until you really know that you need it.
+
+For example, a distribution boot disk can contain only the cd disk
+drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem
+module.  The kernel can be small enough, since it doesn't have other
+filesystems, like the quite large ext2fs module, which can then be
+loaded off the CD at a later stage of the installation.  Another use
+would be for a recovery disk, when you are reinstalling a workstation
+from the network, and you will have all the tools/modules available
+from a nearby server, so you don't want to carry two disks for this
+purpose, just because it won't fit into ext2.
+
+romfs operates on block devices as you can expect, and the underlying
+structure is very simple.  Every accessible structure begins on 16
+byte boundaries for fast access.  The minimum space a file will take
+is 32 bytes (this is an empty file, with a less than 16 character
+name).  The maximum overhead for any non-empty file is the header, and
+the 16 byte padding for the name and the contents, also 16+14+15 = 45
+bytes.  This is quite rare however, since most file names are longer
+than 3 bytes, and shorter than 15 bytes.
+
+The layout of the filesystem is the following:
+
+offset	    content
+
+	+---+---+---+---+
+  0	| - | r | o | m |  \
+	+---+---+---+---+	The ASCII representation of those bytes
+  4	| 1 | f | s | - |  /	(i.e. "-rom1fs-")
+	+---+---+---+---+
+  8	|   full size	|	The number of accessible bytes in this fs.
+	+---+---+---+---+
+ 12	|    checksum	|	The checksum of the FIRST 512 BYTES.
+	+---+---+---+---+
+ 16	| volume name	|	The zero terminated name of the volume,
+	:               :	padded to 16 byte boundary.
+	+---+---+---+---+
+ xx	|     file	|
+	:    headers	:
+
+Every multi byte value (32 bit words, I'll use the longwords term from
+now on) must be in big endian order.
+
+The first eight bytes identify the filesystem, even for the casual
+inspector.  After that, in the 3rd longword, it contains the number of
+bytes accessible from the start of this filesystem.  The 4th longword
+is the checksum of the first 512 bytes (or the number of bytes
+accessible, whichever is smaller).  The applied algorithm is the same
+as in the AFFS filesystem, namely a simple sum of the longwords
+(assuming bigendian quantities again).  For details, please consult
+the source.  This algorithm was chosen because although it's not quite
+reliable, it does not require any tables, and it is very simple.
+
+The following bytes are now part of the file system; each file header
+must begin on a 16 byte boundary.
+
+offset	    content
+
+     	+---+---+---+---+
+  0	| next filehdr|X|	The offset of the next file header
+	+---+---+---+---+	  (zero if no more files)
+  4	|   spec.info	|	Info for directories/hard links/devices
+	+---+---+---+---+
+  8	|     size      |	The size of this file in bytes
+	+---+---+---+---+
+ 12	|   checksum	|	Covering the meta data, including the file
+	+---+---+---+---+	  name, and padding
+ 16	| file name     |	The zero terminated name of the file,
+	:               :	padded to 16 byte boundary
+	+---+---+---+---+
+ xx	| file data	|
+	:		:
+
+Since the file headers begin always at a 16 byte boundary, the lowest
+4 bits would be always zero in the next filehdr pointer.  These four
+bits are used for the mode information.  Bits 0..2 specify the type of
+the file; while bit 4 shows if the file is executable or not.  The
+permissions are assumed to be world readable, if this bit is not set,
+and world executable if it is; except the character and block devices,
+they are never accessible for other than owner.  The owner of every
+file is user and group 0, this should never be a problem for the
+intended use.  The mapping of the 8 possible values to file types is
+the following:
+
+	  mapping		spec.info means
+ 0	hard link	link destination [file header]
+ 1	directory	first file's header
+ 2	regular file	unused, must be zero [MBZ]
+ 3	symbolic link	unused, MBZ (file data is the link content)
+ 4	block device	16/16 bits major/minor number
+ 5	char device		    - " -
+ 6	socket		unused, MBZ
+ 7	fifo		unused, MBZ
+
+Note that hard links are specifically marked in this filesystem, but
+they will behave as you can expect (i.e. share the inode number).
+Note also that it is your responsibility to not create hard link
+loops, and creating all the . and .. links for directories.  This is
+normally done correctly by the genromfs program.  Please refrain from
+using the executable bits for special purposes on the socket and fifo
+special files, they may have other uses in the future.  Additionally,
+please remember that only regular files, and symlinks are supposed to
+have a nonzero size field; they contain the number of bytes available
+directly after the (padded) file name.
+
+Another thing to note is that romfs works on file headers and data
+aligned to 16 byte boundaries, but most hardware devices and the block
+device drivers are unable to cope with smaller than block-sized data.
+To overcome this limitation, the whole size of the file system must be
+padded to an 1024 byte boundary.
+
+If you have any problems or suggestions concerning this file system,
+please contact me.  However, think twice before wanting me to add
+features and code, because the primary and most important advantage of
+this file system is the small code.  On the other hand, don't be
+alarmed, I'm not getting that much romfs related mail.  Now I can
+understand why Avery wrote poems in the ARCnet docs to get some more
+feedback. :)
+
+romfs has also a mailing list, and to date, it hasn't received any
+traffic, so you are welcome to join it to discuss your ideas. :)
+
+It's run by ezmlm, so you can subscribe to it by sending a message
+to romfs-subscribe@shadow.banki.hu, the content is irrelevant.
+
+Pending issues:
+
+- Permissions and owner information are pretty essential features of a
+Un*x like system, but romfs does not provide the full possibilities.
+I have never found this limiting, but others might.
+
+- The file system is read only, so it can be very small, but in case
+one would want to write _anything_ to a file system, he still needs
+a writable file system, thus negating the size advantages.  Possible
+solutions: implement write access as a compile-time option, or a new,
+similarly small writable filesystem for RAM disks.
+
+- Since the files are only required to have alignment on a 16 byte
+boundary, it is currently possibly suboptimal to read or execute files
+from the filesystem.  It might be resolved by reordering file data to
+have most of it (i.e. except the start and the end) laying at "natural"
+boundaries, thus it would be possible to directly map a big portion of
+the file contents to the mm subsystem.
+
+- Compression might be an useful feature, but memory is quite a
+limiting factor in my eyes.
+
+- Where it is used?
+
+- Does it work on other architectures than intel and motorola?
+
+
+Have fun,
+Janos Farkas <chexum@shadow.banki.hu>
--- a/Documentation/filesystems/smbfs.txt
+++ b/Documentation/filesystems/smbfs.txt
@@ -0,0 +1,8 @@
+Smbfs is a filesystem that implements the SMB protocol, which is the
+protocol used by Windows for Workgroups, Windows 95 and Windows NT.
+Smbfs was inspired by Samba, the program written by Andrew Tridgell
+that turns any Unix host into a file server for DOS or Windows clients.
+
+Smbfs is a SMB client, but uses parts of samba for it's operation. For
+more info on samba, including documentation, please go to
+http://www.samba.org/ and then on to your nearest mirror.
--- a/Documentation/filesystems/spufs.txt
+++ b/Documentation/filesystems/spufs.txt
@@ -0,0 +1,521 @@
+SPUFS(2)                   Linux Programmer's Manual                  SPUFS(2)
+
+
+
+NAME
+       spufs - the SPU file system
+
+
+DESCRIPTION
+       The SPU file system is used on PowerPC machines that implement the Cell
+       Broadband Engine Architecture in order to access Synergistic  Processor
+       Units (SPUs).
+
+       The file system provides a name space similar to posix shared memory or
+       message queues. Users that have write permissions on  the  file  system
+       can use spu_create(2) to establish SPU contexts in the spufs root.
+
+       Every SPU context is represented by a directory containing a predefined
+       set of files. These files can be used for manipulating the state of the
+       logical SPU. Users can change permissions on those files, but not actu-
+       ally add or remove files.
+
+
+MOUNT OPTIONS
+       uid=<uid>
+              set the user owning the mount point, the default is 0 (root).
+
+       gid=<gid>
+              set the group owning the mount point, the default is 0 (root).
+
+
+FILES
+       The files in spufs mostly follow the standard behavior for regular sys-
+       tem  calls like read(2) or write(2), but often support only a subset of
+       the operations supported on regular file systems. This list details the
+       supported  operations  and  the  deviations  from  the behaviour in the
+       respective man pages.
+
+       All files that support the read(2) operation also support readv(2)  and
+       all  files  that support the write(2) operation also support writev(2).
+       All files support the access(2) and stat(2) family of  operations,  but
+       only  the  st_mode,  st_nlink,  st_uid and st_gid fields of struct stat
+       contain reliable information.
+
+       All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2)  opera-
+       tions,  but  will  not be able to grant permissions that contradict the
+       possible operations, e.g. read access on the wbox file.
+
+       The current set of files is:
+
+
+   /mem
+       the contents of the local storage memory  of  the  SPU.   This  can  be
+       accessed  like  a regular shared memory file and contains both code and
+       data in the address space of the SPU.  The possible  operations  on  an
+       open mem file are:
+
+       read(2), pread(2), write(2), pwrite(2), lseek(2)
+              These  operate  as  documented, with the exception that seek(2),
+              write(2) and pwrite(2) are not supported beyond the end  of  the
+              file. The file size is the size of the local storage of the SPU,
+              which normally is 256 kilobytes.
+
+       mmap(2)
+              Mapping mem into the process address space gives access  to  the
+              SPU  local  storage  within  the  process  address  space.  Only
+              MAP_SHARED mappings are allowed.
+
+
+   /mbox
+       The first SPU to CPU communication mailbox. This file is read-only  and
+       can  be  read  in  units of 32 bits.  The file can only be used in non-
+       blocking mode and it even poll() will not block on  it.   The  possible
+       operations on an open mbox file are:
+
+       read(2)
+              If  a  count smaller than four is requested, read returns -1 and
+              sets errno to EINVAL.  If there is no data available in the mail
+              box,  the  return  value  is set to -1 and errno becomes EAGAIN.
+              When data has been read successfully, four bytes are  placed  in
+              the data buffer and the value four is returned.
+
+
+   /ibox
+       The  second  SPU  to CPU communication mailbox. This file is similar to
+       the first mailbox file, but can be read in blocking I/O mode,  and  the
+       poll  family of system calls can be used to wait for it.  The  possible
+       operations on an open ibox file are:
+
+       read(2)
+              If a count smaller than four is requested, read returns  -1  and
+              sets errno to EINVAL.  If there is no data available in the mail
+              box and the file descriptor has been opened with O_NONBLOCK, the
+              return value is set to -1 and errno becomes EAGAIN.
+
+              If  there  is  no  data  available  in the mail box and the file
+              descriptor has been opened without  O_NONBLOCK,  the  call  will
+              block  until  the  SPU  writes to its interrupt mailbox channel.
+              When data has been read successfully, four bytes are  placed  in
+              the data buffer and the value four is returned.
+
+       poll(2)
+              Poll  on  the  ibox  file returns (POLLIN | POLLRDNORM) whenever
+              data is available for reading.
+
+
+   /wbox
+       The CPU to SPU communation mailbox. It is write-only and can be written
+       in  units  of  32  bits. If the mailbox is full, write() will block and
+       poll can be used to wait for it becoming  empty  again.   The  possible
+       operations  on  an open wbox file are: write(2) If a count smaller than
+       four is requested, write returns -1 and sets errno to EINVAL.  If there
+       is  no space available in the mail box and the file descriptor has been
+       opened with O_NONBLOCK, the return value is set to -1 and errno becomes
+       EAGAIN.
+
+       If  there is no space available in the mail box and the file descriptor
+       has been opened without O_NONBLOCK, the call will block until  the  SPU
+       reads  from  its PPE mailbox channel.  When data has been read success-
+       fully, four bytes are placed in the data buffer and the value  four  is
+       returned.
+
+       poll(2)
+              Poll  on  the  ibox file returns (POLLOUT | POLLWRNORM) whenever
+              space is available for writing.
+
+
+   /mbox_stat
+   /ibox_stat
+   /wbox_stat
+       Read-only files that contain the length of the current queue, i.e.  how
+       many  words  can  be  read  from  mbox or ibox or how many words can be
+       written to wbox without blocking.  The files can be read only in 4-byte
+       units  and  return  a  big-endian  binary integer number.  The possible
+       operations on an open *box_stat file are:
+
+       read(2)
+              If a count smaller than four is requested, read returns  -1  and
+              sets errno to EINVAL.  Otherwise, a four byte value is placed in
+              the data buffer, containing the number of elements that  can  be
+              read  from  (for  mbox_stat  and  ibox_stat)  or written to (for
+              wbox_stat) the respective mail box without blocking or resulting
+              in EAGAIN.
+
+
+   /npc
+   /decr
+   /decr_status
+   /spu_tag_mask
+   /event_mask
+   /srr0
+       Internal  registers  of  the SPU. The representation is an ASCII string
+       with the numeric value of the next instruction to  be  executed.  These
+       can  be  used in read/write mode for debugging, but normal operation of
+       programs should not rely on them because access to any of  them  except
+       npc requires an SPU context save and is therefore very inefficient.
+
+       The contents of these files are:
+
+       npc                 Next Program Counter
+
+       decr                SPU Decrementer
+
+       decr_status         Decrementer Status
+
+       spu_tag_mask        MFC tag mask for SPU DMA
+
+       event_mask          Event mask for SPU interrupts
+
+       srr0                Interrupt Return address register
+
+
+       The   possible   operations   on   an   open  npc,  decr,  decr_status,
+       spu_tag_mask, event_mask or srr0 file are:
+
+       read(2)
+              When the count supplied to the read call  is  shorter  than  the
+              required  length for the pointer value plus a newline character,
+              subsequent reads from the same file descriptor  will  result  in
+              completing  the string, regardless of changes to the register by
+              a running SPU task.  When a complete string has been  read,  all
+              subsequent read operations will return zero bytes and a new file
+              descriptor needs to be opened to read the value again.
+
+       write(2)
+              A write operation on the file results in setting the register to
+              the  value  given  in  the string. The string is parsed from the
+              beginning to the first non-numeric character or the end  of  the
+              buffer.  Subsequent writes to the same file descriptor overwrite
+              the previous setting.
+
+
+   /fpcr
+       This file gives access to the Floating Point Status and Control  Regis-
+       ter as a four byte long file. The operations on the fpcr file are:
+
+       read(2)
+              If  a  count smaller than four is requested, read returns -1 and
+              sets errno to EINVAL.  Otherwise, a four byte value is placed in
+              the data buffer, containing the current value of the fpcr regis-
+              ter.
+
+       write(2)
+              If a count smaller than four is requested, write returns -1  and
+              sets  errno  to  EINVAL.  Otherwise, a four byte value is copied
+              from the data buffer, updating the value of the fpcr register.
+
+
+   /signal1
+   /signal2
+       The two signal notification channels of an SPU.  These  are  read-write
+       files  that  operate  on  a 32 bit word.  Writing to one of these files
+       triggers an interrupt on the SPU.  The  value  written  to  the  signal
+       files can be read from the SPU through a channel read or from host user
+       space through the file.  After the value has been read by the  SPU,  it
+       is  reset  to zero.  The possible operations on an open signal1 or sig-
+       nal2 file are:
+
+       read(2)
+              If a count smaller than four is requested, read returns  -1  and
+              sets errno to EINVAL.  Otherwise, a four byte value is placed in
+              the data buffer, containing the current value of  the  specified
+              signal notification register.
+
+       write(2)
+              If  a count smaller than four is requested, write returns -1 and
+              sets errno to EINVAL.  Otherwise, a four byte  value  is  copied
+              from the data buffer, updating the value of the specified signal
+              notification register.  The signal  notification  register  will
+              either be replaced with the input data or will be updated to the
+              bitwise OR or the old value and the input data, depending on the
+              contents  of  the  signal1_type,  or  signal2_type respectively,
+              file.
+
+
+   /signal1_type
+   /signal2_type
+       These two files change the behavior of the signal1 and signal2  notifi-
+       cation  files.  The  contain  a numerical ASCII string which is read as
+       either "1" or "0".  In mode 0 (overwrite), the  hardware  replaces  the
+       contents of the signal channel with the data that is written to it.  in
+       mode 1 (logical OR), the hardware accumulates the bits that are  subse-
+       quently written to it.  The possible operations on an open signal1_type
+       or signal2_type file are:
+
+       read(2)
+              When the count supplied to the read call  is  shorter  than  the
+              required  length  for the digit plus a newline character, subse-
+              quent reads from the same file descriptor will  result  in  com-
+              pleting  the  string.  When a complete string has been read, all
+              subsequent read operations will return zero bytes and a new file
+              descriptor needs to be opened to read the value again.
+
+       write(2)
+              A write operation on the file results in setting the register to
+              the value given in the string. The string  is  parsed  from  the
+              beginning  to  the first non-numeric character or the end of the
+              buffer.  Subsequent writes to the same file descriptor overwrite
+              the previous setting.
+
+
+EXAMPLES
+       /etc/fstab entry
+              none      /spu      spufs     gid=spu   0    0
+
+
+AUTHORS
+       Arnd  Bergmann  <arndb@de.ibm.com>,  Mark  Nutter <mnutter@us.ibm.com>,
+       Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
+
+SEE ALSO
+       capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
+
+
+
+Linux                             2005-09-28                          SPUFS(2)
+
+------------------------------------------------------------------------------
+
+SPU_RUN(2)                 Linux Programmer's Manual                SPU_RUN(2)
+
+
+
+NAME
+       spu_run - execute an spu context
+
+
+SYNOPSIS
+       #include <sys/spu.h>
+
+       int spu_run(int fd, unsigned int *npc, unsigned int *event);
+
+DESCRIPTION
+       The  spu_run system call is used on PowerPC machines that implement the
+       Cell Broadband Engine Architecture in order to access Synergistic  Pro-
+       cessor  Units  (SPUs).  It  uses the fd that was returned from spu_cre-
+       ate(2) to address a specific SPU context. When the context gets  sched-
+       uled  to a physical SPU, it starts execution at the instruction pointer
+       passed in npc.
+
+       Execution of SPU code happens synchronously, meaning that spu_run  does
+       not  return  while the SPU is still running. If there is a need to exe-
+       cute SPU code in parallel with other code on either  the  main  CPU  or
+       other  SPUs,  you  need to create a new thread of execution first, e.g.
+       using the pthread_create(3) call.
+
+       When spu_run returns, the current value of the SPU instruction  pointer
+       is  written back to npc, so you can call spu_run again without updating
+       the pointers.
+
+       event can be a NULL pointer or point to an extended  status  code  that
+       gets  filled  when spu_run returns. It can be one of the following con-
+       stants:
+
+       SPE_EVENT_DMA_ALIGNMENT
+              A DMA alignment error
+
+       SPE_EVENT_SPE_DATA_SEGMENT
+              A DMA segmentation error
+
+       SPE_EVENT_SPE_DATA_STORAGE
+              A DMA storage error
+
+       If NULL is passed as the event argument, these errors will result in  a
+       signal delivered to the calling process.
+
+RETURN VALUE
+       spu_run  returns the value of the spu_status register or -1 to indicate
+       an error and set errno to one of the error  codes  listed  below.   The
+       spu_status  register  value  contains  a  bit  mask of status codes and
+       optionally a 14 bit code returned from the stop-and-signal  instruction
+       on the SPU. The bit masks for the status codes are:
+
+       0x02   SPU was stopped by stop-and-signal.
+
+       0x04   SPU was stopped by halt.
+
+       0x08   SPU is waiting for a channel.
+
+       0x10   SPU is in single-step mode.
+
+       0x20   SPU has tried to execute an invalid instruction.
+
+       0x40   SPU has tried to access an invalid channel.
+
+       0x3fff0000
+              The  bits  masked with this value contain the code returned from
+              stop-and-signal.
+
+       There are always one or more of the lower eight bits set  or  an  error
+       code is returned from spu_run.
+
+ERRORS
+       EAGAIN or EWOULDBLOCK
+              fd is in non-blocking mode and spu_run would block.
+
+       EBADF  fd is not a valid file descriptor.
+
+       EFAULT npc is not a valid pointer or status is neither NULL nor a valid
+              pointer.
+
+       EINTR  A signal occurred while spu_run was in progress.  The npc  value
+              has  been updated to the new program counter value if necessary.
+
+       EINVAL fd is not a file descriptor returned from spu_create(2).
+
+       ENOMEM Insufficient memory was available to handle a page fault result-
+              ing from an MFC direct memory access.
+
+       ENOSYS the functionality is not provided by the current system, because
+              either the hardware does not provide SPUs or the spufs module is
+              not loaded.
+
+
+NOTES
+       spu_run  is  meant  to  be  used  from  libraries that implement a more
+       abstract interface to SPUs, not to be used from  regular  applications.
+       See  http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+       ommended libraries.
+
+
+CONFORMING TO
+       This call is Linux specific and only implemented by the ppc64 architec-
+       ture. Programs using this system call are not portable.
+
+
+BUGS
+       The code does not yet fully implement all features lined out here.
+
+
+AUTHOR
+       Arnd Bergmann <arndb@de.ibm.com>
+
+SEE ALSO
+       capabilities(7), close(2), spu_create(2), spufs(7)
+
+
+
+Linux                             2005-09-28                        SPU_RUN(2)
+
+------------------------------------------------------------------------------
+
+SPU_CREATE(2)              Linux Programmer's Manual             SPU_CREATE(2)
+
+
+
+NAME
+       spu_create - create a new spu context
+
+
+SYNOPSIS
+       #include <sys/types.h>
+       #include <sys/spu.h>
+
+       int spu_create(const char *pathname, int flags, mode_t mode);
+
+DESCRIPTION
+       The  spu_create  system call is used on PowerPC machines that implement
+       the Cell Broadband Engine Architecture in order to  access  Synergistic
+       Processor  Units (SPUs). It creates a new logical context for an SPU in
+       pathname and returns a handle to associated  with  it.   pathname  must
+       point  to  a  non-existing directory in the mount point of the SPU file
+       system (spufs).  When spu_create is successful, a directory  gets  cre-
+       ated on pathname and it is populated with files.
+
+       The  returned  file  handle can only be passed to spu_run(2) or closed,
+       other operations are not defined on it. When it is closed, all  associ-
+       ated  directory entries in spufs are removed. When the last file handle
+       pointing either inside  of  the  context  directory  or  to  this  file
+       descriptor is closed, the logical SPU context is destroyed.
+
+       The  parameter flags can be zero or any bitwise or'd combination of the
+       following constants:
+
+       SPU_RAWIO
+              Allow mapping of some of the hardware registers of the SPU  into
+              user space. This flag requires the CAP_SYS_RAWIO capability, see
+              capabilities(7).
+
+       The mode parameter specifies the permissions used for creating the  new
+       directory  in  spufs.   mode is modified with the user's umask(2) value
+       and then used for both the directory and the files contained in it. The
+       file permissions mask out some more bits of mode because they typically
+       support only read or write access. See stat(2) for a full list  of  the
+       possible mode values.
+
+
+RETURN VALUE
+       spu_create  returns a new file descriptor. It may return -1 to indicate
+       an error condition and set errno to  one  of  the  error  codes  listed
+       below.
+
+
+ERRORS
+       EACCESS
+              The  current  user does not have write access on the spufs mount
+              point.
+
+       EEXIST An SPU context already exists at the given path name.
+
+       EFAULT pathname is not a valid string pointer in  the  current  address
+              space.
+
+       EINVAL pathname is not a directory in the spufs mount point.
+
+       ELOOP  Too many symlinks were found while resolving pathname.
+
+       EMFILE The process has reached its maximum open file limit.
+
+       ENAMETOOLONG
+              pathname was too long.
+
+       ENFILE The system has reached the global open file limit.
+
+       ENOENT Part of pathname could not be resolved.
+
+       ENOMEM The kernel could not allocate all resources required.
+
+       ENOSPC There  are  not  enough  SPU resources available to create a new
+              context or the user specific limit for the number  of  SPU  con-
+              texts has been reached.
+
+       ENOSYS the functionality is not provided by the current system, because
+              either the hardware does not provide SPUs or the spufs module is
+              not loaded.
+
+       ENOTDIR
+              A part of pathname is not a directory.
+
+
+
+NOTES
+       spu_create  is  meant  to  be used from libraries that implement a more
+       abstract interface to SPUs, not to be used from  regular  applications.
+       See  http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+       ommended libraries.
+
+
+FILES
+       pathname must point to a location beneath the mount point of spufs.  By
+       convention, it gets mounted in /spu.
+
+
+CONFORMING TO
+       This call is Linux specific and only implemented by the ppc64 architec-
+       ture. Programs using this system call are not portable.
+
+
+BUGS
+       The code does not yet fully implement all features lined out here.
+
+
+AUTHOR
+       Arnd Bergmann <arndb@de.ibm.com>
+
+SEE ALSO
+       capabilities(7), close(2), spu_run(2), spufs(7)
+
+
+
+Linux                             2005-09-28                     SPU_CREATE(2)
--- a/Documentation/filesystems/sysfs-pci.txt
+++ b/Documentation/filesystems/sysfs-pci.txt
@@ -0,0 +1,95 @@
+Accessing PCI device resources through sysfs
+--------------------------------------------
+
+sysfs, usually mounted at /sys, provides access to PCI resources on platforms
+that support it.  For example, a given bus might look like this:
+
+     /sys/devices/pci0000:17
+     |-- 0000:17:00.0
+     |   |-- class
+     |   |-- config
+     |   |-- device
+     |   |-- irq
+     |   |-- local_cpus
+     |   |-- resource
+     |   |-- resource0
+     |   |-- resource1
+     |   |-- resource2
+     |   |-- rom
+     |   |-- subsystem_device
+     |   |-- subsystem_vendor
+     |   `-- vendor
+     `-- ...
+
+The topmost element describes the PCI domain and bus number.  In this case,
+the domain number is 0000 and the bus number is 17 (both values are in hex).
+This bus contains a single function device in slot 0.  The domain and bus
+numbers are reproduced for convenience.  Under the device directory are several
+files, each with their own function.
+
+       file		   function
+       ----		   --------
+       class		   PCI class (ascii, ro)
+       config		   PCI config space (binary, rw)
+       device		   PCI device (ascii, ro)
+       irq		   IRQ number (ascii, ro)
+       local_cpus	   nearby CPU mask (cpumask, ro)
+       resource		   PCI resource host addresses (ascii, ro)
+       resource0..N	   PCI resource N, if present (binary, mmap)
+       rom		   PCI ROM resource, if present (binary, ro)
+       subsystem_device	   PCI subsystem device (ascii, ro)
+       subsystem_vendor	   PCI subsystem vendor (ascii, ro)
+       vendor		   PCI vendor (ascii, ro)
+
+  ro - read only file
+  rw - file is readable and writable
+  mmap - file is mmapable
+  ascii - file contains ascii text
+  binary - file contains binary data
+  cpumask - file contains a cpumask type
+
+The read only files are informational, writes to them will be ignored, with
+the exception of the 'rom' file.  Writable files can be used to perform
+actions on the device (e.g. changing config space, detaching a device).
+mmapable files are available via an mmap of the file at offset 0 and can be
+used to do actual device programming from userspace.  Note that some platforms
+don't support mmapping of certain resources, so be sure to check the return
+value from any attempted mmap.
+
+The 'rom' file is special in that it provides read-only access to the device's
+ROM file, if available.  It's disabled by default, however, so applications
+should write the string "1" to the file to enable it before attempting a read
+call, and disable it following the access by writing "0" to the file.
+
+Accessing legacy resources through sysfs
+----------------------------------------
+
+Legacy I/O port and ISA memory resources are also provided in sysfs if the
+underlying platform supports them.  They're located in the PCI class hierarchy,
+e.g.
+
+	/sys/class/pci_bus/0000:17/
+	|-- bridge -> ../../../devices/pci0000:17
+	|-- cpuaffinity
+	|-- legacy_io
+	`-- legacy_mem
+
+The legacy_io file is a read/write file that can be used by applications to
+do legacy port I/O.  The application should open the file, seek to the desired
+port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes.  The legacy_mem
+file should be mmapped with an offset corresponding to the memory offset
+desired, e.g. 0xa0000 for the VGA frame buffer.  The application can then
+simply dereference the returned pointer (after checking for errors of course)
+to access legacy memory space.
+
+Supporting PCI access on new platforms
+--------------------------------------
+
+In order to support PCI resource mapping as described above, Linux platform
+code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
+Platforms are free to only support subsets of the mmap functionality, but
+useful return codes should be provided.
+
+Legacy resources are protected by the HAVE_PCI_LEGACY define.  Platforms
+wishing to support legacy functionality should define it and provide
+pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.
--- a/Documentation/filesystems/sysfs.txt
+++ b/Documentation/filesystems/sysfs.txt
@@ -0,0 +1,346 @@
+
+sysfs - _The_ filesystem for exporting kernel objects. 
+
+Patrick Mochel	<mochel@osdl.org>
+
+10 January 2003
+
+
+What it is:
+~~~~~~~~~~~
+
+sysfs is a ram-based filesystem initially based on ramfs. It provides
+a means to export kernel data structures, their attributes, and the 
+linkages between them to userspace. 
+
+sysfs is tied inherently to the kobject infrastructure. Please read
+Documentation/kobject.txt for more information concerning the kobject
+interface. 
+
+
+Using sysfs
+~~~~~~~~~~~
+
+sysfs is always compiled in. You can access it by doing:
+
+    mount -t sysfs sysfs /sys 
+
+
+Directory Creation
+~~~~~~~~~~~~~~~~~~
+
+For every kobject that is registered with the system, a directory is
+created for it in sysfs. That directory is created as a subdirectory
+of the kobject's parent, expressing internal object hierarchies to
+userspace. Top-level directories in sysfs represent the common
+ancestors of object hierarchies; i.e. the subsystems the objects
+belong to. 
+
+Sysfs internally stores the kobject that owns the directory in the
+->d_fsdata pointer of the directory's dentry. This allows sysfs to do
+reference counting directly on the kobject when the file is opened and
+closed. 
+
+
+Attributes
+~~~~~~~~~~
+
+Attributes can be exported for kobjects in the form of regular files in
+the filesystem. Sysfs forwards file I/O operations to methods defined
+for the attributes, providing a means to read and write kernel
+attributes.
+
+Attributes should be ASCII text files, preferably with only one value
+per file. It is noted that it may not be efficient to contain only
+value per file, so it is socially acceptable to express an array of
+values of the same type. 
+
+Mixing types, expressing multiple lines of data, and doing fancy
+formatting of data is heavily frowned upon. Doing these things may get
+you publically humiliated and your code rewritten without notice. 
+
+
+An attribute definition is simply:
+
+struct attribute {
+        char                    * name;
+        mode_t                  mode;
+};
+
+
+int sysfs_create_file(struct kobject * kobj, struct attribute * attr);
+void sysfs_remove_file(struct kobject * kobj, struct attribute * attr);
+
+
+A bare attribute contains no means to read or write the value of the
+attribute. Subsystems are encouraged to define their own attribute
+structure and wrapper functions for adding and removing attributes for
+a specific object type. 
+
+For example, the driver model defines struct device_attribute like:
+
+struct device_attribute {
+        struct attribute        attr;
+        ssize_t (*show)(struct device * dev, char * buf);
+        ssize_t (*store)(struct device * dev, const char * buf);
+};
+
+int device_create_file(struct device *, struct device_attribute *);
+void device_remove_file(struct device *, struct device_attribute *);
+
+It also defines this helper for defining device attributes: 
+
+#define DEVICE_ATTR(_name, _mode, _show, _store)      \
+struct device_attribute dev_attr_##_name = {            \
+        .attr = {.name  = __stringify(_name) , .mode   = _mode },      \
+        .show   = _show,                                \
+        .store  = _store,                               \
+};
+
+For example, declaring
+
+static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
+
+is equivalent to doing:
+
+static struct device_attribute dev_attr_foo = {
+       .attr	= {
+		.name = "foo",
+		.mode = S_IWUSR | S_IRUGO,
+	},
+	.show = show_foo,
+	.store = store_foo,
+};
+
+
+Subsystem-Specific Callbacks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a subsystem defines a new attribute type, it must implement a
+set of sysfs operations for forwarding read and write calls to the
+show and store methods of the attribute owners. 
+
+struct sysfs_ops {
+        ssize_t (*show)(struct kobject *, struct attribute *, char *);
+        ssize_t (*store)(struct kobject *, struct attribute *, const char *);
+};
+
+[ Subsystems should have already defined a struct kobj_type as a
+descriptor for this type, which is where the sysfs_ops pointer is
+stored. See the kobject documentation for more information. ]
+
+When a file is read or written, sysfs calls the appropriate method
+for the type. The method then translates the generic struct kobject
+and struct attribute pointers to the appropriate pointer types, and
+calls the associated methods. 
+
+
+To illustrate:
+
+#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
+#define to_dev(d) container_of(d, struct device, kobj)
+
+static ssize_t
+dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf)
+{
+        struct device_attribute * dev_attr = to_dev_attr(attr);
+        struct device * dev = to_dev(kobj);
+        ssize_t ret = 0;
+
+        if (dev_attr->show)
+                ret = dev_attr->show(dev, buf);
+        return ret;
+}
+
+
+
+Reading/Writing Attribute Data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To read or write attributes, show() or store() methods must be
+specified when declaring the attribute. The method types should be as
+simple as those defined for device attributes:
+
+        ssize_t (*show)(struct device * dev, char * buf);
+        ssize_t (*store)(struct device * dev, const char * buf);
+
+IOW, they should take only an object and a buffer as parameters. 
+
+
+sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
+method. Sysfs will call the method exactly once for each read or
+write. This forces the following behavior on the method
+implementations: 
+
+- On read(2), the show() method should fill the entire buffer. 
+  Recall that an attribute should only be exporting one value, or an
+  array of similar values, so this shouldn't be that expensive. 
+
+  This allows userspace to do partial reads and seeks arbitrarily over
+  the entire file at will. 
+
+- On write(2), sysfs expects the entire buffer to be passed during the
+  first write. Sysfs then passes the entire buffer to the store()
+  method. 
+  
+  When writing sysfs files, userspace processes should first read the
+  entire file, modify the values it wishes to change, then write the
+  entire buffer back. 
+
+  Attribute method implementations should operate on an identical
+  buffer when reading and writing values. 
+
+Other notes:
+
+- The buffer will always be PAGE_SIZE bytes in length. On i386, this
+  is 4096. 
+
+- show() methods should return the number of bytes printed into the
+  buffer. This is the return value of snprintf().
+
+- show() should always use snprintf(). 
+
+- store() should return the number of bytes used from the buffer. This
+  can be done using strlen().
+
+- show() or store() can always return errors. If a bad value comes
+  through, be sure to return an error.
+
+- The object passed to the methods will be pinned in memory via sysfs
+  referencing counting its embedded object. However, the physical 
+  entity (e.g. device) the object represents may not be present. Be 
+  sure to have a way to check this, if necessary. 
+
+
+A very simple (and naive) implementation of a device attribute is:
+
+static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%s\n", dev->name);
+}
+
+static ssize_t store_name(struct device * dev, const char * buf)
+{
+	sscanf(buf, "%20s", dev->name);
+	return strnlen(buf, PAGE_SIZE);
+}
+
+static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
+
+
+(Note that the real implementation doesn't allow userspace to set the 
+name for a device.)
+
+
+Top Level Directory Layout
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The sysfs directory arrangement exposes the relationship of kernel
+data structures. 
+
+The top level sysfs directory looks like:
+
+block/
+bus/
+class/
+devices/
+firmware/
+net/
+fs/
+
+devices/ contains a filesystem representation of the device tree. It maps
+directly to the internal kernel device tree, which is a hierarchy of
+struct device. 
+
+bus/ contains flat directory layout of the various bus types in the
+kernel. Each bus's directory contains two subdirectories:
+
+	devices/
+	drivers/
+
+devices/ contains symlinks for each device discovered in the system
+that point to the device's directory under root/.
+
+drivers/ contains a directory for each device driver that is loaded
+for devices on that particular bus (this assumes that drivers do not
+span multiple bus types).
+
+fs/ contains a directory for some filesystems.  Currently each
+filesystem wanting to export attributes must create its own hierarchy
+below fs/ (see ./fuse.txt for an example).
+
+
+More information can driver-model specific features can be found in
+Documentation/driver-model/. 
+
+
+TODO: Finish this section.
+
+
+Current Interfaces
+~~~~~~~~~~~~~~~~~~
+
+The following interface layers currently exist in sysfs:
+
+
+- devices (include/linux/device.h)
+----------------------------------
+Structure:
+
+struct device_attribute {
+        struct attribute        attr;
+        ssize_t (*show)(struct device * dev, char * buf);
+        ssize_t (*store)(struct device * dev, const char * buf);
+};
+
+Declaring:
+
+DEVICE_ATTR(_name, _str, _mode, _show, _store);
+
+Creation/Removal:
+
+int device_create_file(struct device *device, struct device_attribute * attr);
+void device_remove_file(struct device * dev, struct device_attribute * attr);
+
+
+- bus drivers (include/linux/device.h)
+--------------------------------------
+Structure:
+
+struct bus_attribute {
+        struct attribute        attr;
+        ssize_t (*show)(struct bus_type *, char * buf);
+        ssize_t (*store)(struct bus_type *, const char * buf);
+};
+
+Declaring:
+
+BUS_ATTR(_name, _mode, _show, _store)
+
+Creation/Removal:
+
+int bus_create_file(struct bus_type *, struct bus_attribute *);
+void bus_remove_file(struct bus_type *, struct bus_attribute *);
+
+
+- device drivers (include/linux/device.h)
+-----------------------------------------
+
+Structure:
+
+struct driver_attribute {
+        struct attribute        attr;
+        ssize_t (*show)(struct device_driver *, char * buf);
+        ssize_t (*store)(struct device_driver *, const char * buf);
+};
+
+Declaring:
+
+DRIVER_ATTR(_name, _mode, _show, _store)
+
+Creation/Removal:
+
+int driver_create_file(struct device_driver *, struct driver_attribute *);
+void driver_remove_file(struct device_driver *, struct driver_attribute *);
+
+
--- a/Documentation/filesystems/sysv-fs.txt
+++ b/Documentation/filesystems/sysv-fs.txt
@@ -0,0 +1,197 @@
+It implements all of
+  - Xenix FS,
+  - SystemV/386 FS,
+  - Coherent FS.
+
+To install:
+* Answer the 'System V and Coherent filesystem support' question with 'y'
+  when configuring the kernel.
+* To mount a disk or a partition, use
+    mount [-r] -t sysv device mountpoint
+  The file system type names
+               -t sysv
+               -t xenix
+               -t coherent
+  may be used interchangeably, but the last two will eventually disappear.
+
+Bugs in the present implementation:
+- Coherent FS:
+  - The "free list interleave" n:m is currently ignored.
+  - Only file systems with no filesystem name and no pack name are recognized.
+  (See Coherent "man mkfs" for a description of these features.)
+- SystemV Release 2 FS:
+  The superblock is only searched in the blocks 9, 15, 18, which
+  corresponds to the beginning of track 1 on floppy disks. No support
+  for this FS on hard disk yet.
+
+
+These filesystems are rather similar. Here is a comparison with Minix FS:
+
+* Linux fdisk reports on partitions
+  - Minix FS     0x81 Linux/Minix
+  - Xenix FS     ??
+  - SystemV FS   ??
+  - Coherent FS  0x08 AIX bootable
+
+* Size of a block or zone (data allocation unit on disk)
+  - Minix FS     1024
+  - Xenix FS     1024 (also 512 ??)
+  - SystemV FS   1024 (also 512 and 2048)
+  - Coherent FS   512
+
+* General layout: all have one boot block, one super block and
+  separate areas for inodes and for directories/data.
+  On SystemV Release 2 FS (e.g. Microport) the first track is reserved and
+  all the block numbers (including the super block) are offset by one track.
+
+* Byte ordering of "short" (16 bit entities) on disk:
+  - Minix FS     little endian  0 1
+  - Xenix FS     little endian  0 1
+  - SystemV FS   little endian  0 1
+  - Coherent FS  little endian  0 1
+  Of course, this affects only the file system, not the data of files on it!
+
+* Byte ordering of "long" (32 bit entities) on disk:
+  - Minix FS     little endian  0 1 2 3
+  - Xenix FS     little endian  0 1 2 3
+  - SystemV FS   little endian  0 1 2 3
+  - Coherent FS  PDP-11         2 3 0 1
+  Of course, this affects only the file system, not the data of files on it!
+
+* Inode on disk: "short", 0 means non-existent, the root dir ino is:
+  - Minix FS                            1
+  - Xenix FS, SystemV FS, Coherent FS   2
+
+* Maximum number of hard links to a file:
+  - Minix FS     250
+  - Xenix FS     ??
+  - SystemV FS   ??
+  - Coherent FS  >=10000
+
+* Free inode management:
+  - Minix FS                             a bitmap
+  - Xenix FS, SystemV FS, Coherent FS
+      There is a cache of a certain number of free inodes in the super-block.
+      When it is exhausted, new free inodes are found using a linear search.
+
+* Free block management:
+  - Minix FS                             a bitmap
+  - Xenix FS, SystemV FS, Coherent FS
+      Free blocks are organized in a "free list". Maybe a misleading term,
+      since it is not true that every free block contains a pointer to
+      the next free block. Rather, the free blocks are organized in chunks
+      of limited size, and every now and then a free block contains pointers
+      to the free blocks pertaining to the next chunk; the first of these
+      contains pointers and so on. The list terminates with a "block number"
+      0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS.
+
+* Super-block location:
+  - Minix FS     block 1 = bytes 1024..2047
+  - Xenix FS     block 1 = bytes 1024..2047
+  - SystemV FS   bytes 512..1023
+  - Coherent FS  block 1 = bytes 512..1023
+
+* Super-block layout:
+  - Minix FS
+                    unsigned short s_ninodes;
+                    unsigned short s_nzones;
+                    unsigned short s_imap_blocks;
+                    unsigned short s_zmap_blocks;
+                    unsigned short s_firstdatazone;
+                    unsigned short s_log_zone_size;
+                    unsigned long s_max_size;
+                    unsigned short s_magic;
+  - Xenix FS, SystemV FS, Coherent FS
+                    unsigned short s_firstdatazone;
+                    unsigned long  s_nzones;
+                    unsigned short s_fzone_count;
+                    unsigned long  s_fzones[NICFREE];
+                    unsigned short s_finode_count;
+                    unsigned short s_finodes[NICINOD];
+                    char           s_flock;
+                    char           s_ilock;
+                    char           s_modified;
+                    char           s_rdonly;
+                    unsigned long  s_time;
+                    short          s_dinfo[4]; -- SystemV FS only
+                    unsigned long  s_free_zones;
+                    unsigned short s_free_inodes;
+                    short          s_dinfo[4]; -- Xenix FS only
+                    unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only
+                    char           s_fname[6];
+                    char           s_fpack[6];
+    then they differ considerably:
+        Xenix FS
+                    char           s_clean;
+                    char           s_fill[371];
+                    long           s_magic;
+                    long           s_type;
+        SystemV FS
+                    long           s_fill[12 or 14];
+                    long           s_state;
+                    long           s_magic;
+                    long           s_type;
+        Coherent FS
+                    unsigned long  s_unique;
+    Note that Coherent FS has no magic.
+
+* Inode layout:
+  - Minix FS
+                    unsigned short i_mode;
+                    unsigned short i_uid;
+                    unsigned long  i_size;
+                    unsigned long  i_time;
+                    unsigned char  i_gid;
+                    unsigned char  i_nlinks;
+                    unsigned short i_zone[7+1+1];
+  - Xenix FS, SystemV FS, Coherent FS
+                    unsigned short i_mode;
+                    unsigned short i_nlink;
+                    unsigned short i_uid;
+                    unsigned short i_gid;
+                    unsigned long  i_size;
+                    unsigned char  i_zone[3*(10+1+1+1)];
+                    unsigned long  i_atime;
+                    unsigned long  i_mtime;
+                    unsigned long  i_ctime;
+
+* Regular file data blocks are organized as
+  - Minix FS
+               7 direct blocks
+               1 indirect block (pointers to blocks)
+               1 double-indirect block (pointer to pointers to blocks)
+  - Xenix FS, SystemV FS, Coherent FS
+              10 direct blocks
+               1 indirect block (pointers to blocks)
+               1 double-indirect block (pointer to pointers to blocks)
+               1 triple-indirect block (pointer to pointers to pointers to blocks)
+
+* Inode size, inodes per block
+  - Minix FS        32   32
+  - Xenix FS        64   16
+  - SystemV FS      64   16
+  - Coherent FS     64    8
+
+* Directory entry on disk
+  - Minix FS
+                    unsigned short inode;
+                    char name[14/30];
+  - Xenix FS, SystemV FS, Coherent FS
+                    unsigned short inode;
+                    char name[14];
+
+* Dir entry size, dir entries per block
+  - Minix FS     16/32    64/32
+  - Xenix FS     16       64
+  - SystemV FS   16       64
+  - Coherent FS  16       32
+
+* How to implement symbolic links such that the host fsck doesn't scream:
+  - Minix FS     normal
+  - Xenix FS     kludge: as regular files with  chmod 1000
+  - SystemV FS   ??
+  - Coherent FS  kludge: as regular files with  chmod 1000
+
+
+Notation: We often speak of a "block" but mean a zone (the allocation unit)
+and not the disk driver's notion of "block".
--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -0,0 +1,124 @@
+Tmpfs is a file system which keeps all files in virtual memory.
+
+
+Everything in tmpfs is temporary in the sense that no files will be
+created on your hard drive. If you unmount a tmpfs instance,
+everything stored therein is lost.
+
+tmpfs puts everything into the kernel internal caches and grows and
+shrinks to accommodate the files it contains and is able to swap
+unneeded pages out to swap space. It has maximum size limits which can
+be adjusted on the fly via 'mount -o remount ...'
+
+If you compare it to ramfs (which was the template to create tmpfs)
+you gain swapping and limit checking. Another similar thing is the RAM
+disk (/dev/ram*), which simulates a fixed size hard disk in physical
+RAM, where you have to create an ordinary filesystem on top. Ramdisks
+cannot swap and you do not have the possibility to resize them. 
+
+Since tmpfs lives completely in the page cache and on swap, all tmpfs
+pages currently in memory will show up as cached. It will not show up
+as shared or something like that. Further on you can check the actual
+RAM+swap use of a tmpfs instance with df(1) and du(1).
+
+
+tmpfs has the following uses:
+
+1) There is always a kernel internal mount which you will not see at
+   all. This is used for shared anonymous mappings and SYSV shared
+   memory. 
+
+   This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
+   set, the user visible part of tmpfs is not build. But the internal
+   mechanisms are always present.
+
+2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
+   POSIX shared memory (shm_open, shm_unlink). Adding the following
+   line to /etc/fstab should take care of this:
+
+	tmpfs	/dev/shm	tmpfs	defaults	0 0
+
+   Remember to create the directory that you intend to mount tmpfs on
+   if necessary.
+
+   This mount is _not_ needed for SYSV shared memory. The internal
+   mount is used for that. (In the 2.3 kernel versions it was
+   necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
+   shared memory)
+
+3) Some people (including me) find it very convenient to mount it
+   e.g. on /tmp and /var/tmp and have a big swap partition. And now
+   loop mounts of tmpfs files do work, so mkinitrd shipped by most
+   distributions should succeed with a tmpfs /tmp.
+
+4) And probably a lot more I do not know about :-)
+
+
+tmpfs has three mount options for sizing:
+
+size:      The limit of allocated bytes for this tmpfs instance. The 
+           default is half of your physical RAM without swap. If you
+           oversize your tmpfs instances the machine will deadlock
+           since the OOM handler will not be able to free that memory.
+nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE.
+nr_inodes: The maximum number of inodes for this instance. The default
+           is half of the number of your physical RAM pages, or (on a
+           machine with highmem) the number of lowmem RAM pages,
+           whichever is the lower.
+
+These parameters accept a suffix k, m or g for kilo, mega and giga and
+can be changed on remount.  The size parameter also accepts a suffix %
+to limit this tmpfs instance to that percentage of your physical RAM:
+the default, when neither size nor nr_blocks is specified, is size=50%
+
+If nr_blocks=0 (or size=0), blocks will not be limited in that instance;
+if nr_inodes=0, inodes will not be limited.  It is generally unwise to
+mount with such options, since it allows any user with write access to
+use up all the memory on the machine; but enhances the scalability of
+that instance in a system with many cpus making intensive use of it.
+
+
+tmpfs has a mount option to set the NUMA memory allocation policy for
+all files in that instance (if CONFIG_NUMA is enabled) - which can be
+adjusted on the fly via 'mount -o remount ...'
+
+mpol=default             prefers to allocate memory from the local node
+mpol=prefer:Node         prefers to allocate memory from the given Node
+mpol=bind:NodeList       allocates memory only from nodes in NodeList
+mpol=interleave          prefers to allocate from each node in turn
+mpol=interleave:NodeList allocates from each node of NodeList in turn
+
+NodeList format is a comma-separated list of decimal numbers and ranges,
+a range being two hyphen-separated decimal numbers, the smallest and
+largest node numbers in the range.  For example, mpol=bind:0-3,5,7,9-15
+
+Note that trying to mount a tmpfs with an mpol option will fail if the
+running kernel does not support NUMA; and will fail if its nodelist
+specifies a node >= MAX_NUMNODES.  If your system relies on that tmpfs
+being mounted, but from time to time runs a kernel built without NUMA
+capability (perhaps a safe recovery kernel), or configured to support
+fewer nodes, then it is advisable to omit the mpol option from automatic
+mount options.  It can be added later, when the tmpfs is already mounted
+on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'.
+
+
+To specify the initial root directory you can use the following mount
+options:
+
+mode:	The permissions as an octal number
+uid:	The user id 
+gid:	The group id
+
+These options do not have any effect on remount. You can change these
+parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
+
+
+So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
+will give you tmpfs instance on /mytmpfs which can allocate 10GB
+RAM/SWAP in 10240 inodes and it is only accessible by root.
+
+
+Author:
+   Christoph Rohland <cr@sap.com>, 1.12.01
+Updated:
+   Hugh Dickins <hugh@veritas.com>, 19 February 2006
--- a/Documentation/filesystems/udf.txt
+++ b/Documentation/filesystems/udf.txt
@@ -0,0 +1,80 @@
+*
+* Documentation/filesystems/udf.txt
+*
+UDF Filesystem version 0.9.8.1
+
+If you encounter problems with reading UDF discs using this driver,
+please report them to linux_udf@hpesjro.fc.hp.com, which is the
+developer's list.
+
+Write support requires a block driver which supports writing.  Currently
+dvd+rw drives and media support true random sector writes, and so a udf
+filesystem on such devices can be directly mounted read/write.  CD-RW
+media however, does not support this.  Instead the media can be formatted
+for packet mode using the utility cdrwtool, then the pktcdvd driver can
+be bound to the underlying cd device to provide the required buffering
+and read-modify-write cycles to allow the filesystem random sector writes
+while providing the hardware with only full packet writes.  While not
+required for dvd+rw media, use of the pktcdvd driver often enhances
+performance due to very poor read-modify-write support supplied internally
+by drive firmware.
+
+-------------------------------------------------------------------------------
+The following mount options are supported:
+
+	gid=		Set the default group.
+	umask=		Set the default umask.
+	uid=		Set the default user.
+	bs=		Set the block size.
+	unhide		Show otherwise hidden files.
+	undelete	Show deleted files in lists.
+	adinicb		Embed data in the inode (default)
+	noadinicb	Don't embed data in the inode
+	shortad		Use short ad's
+	longad		Use long ad's (default)
+	nostrict	Unset strict conformance
+	iocharset=	Set the NLS character set
+
+The uid= and gid= options need a bit more explaining.  They will accept a
+decimal numeric value which will be used as the default ID for that mount.
+They will also accept the string "ignore" and "forget".  For files on the disk
+that are owned by nobody ( -1 ), they will instead look as if they are owned
+by the default ID.  The ignore option causes the default ID to override all
+IDs on the disk, not just -1.  The forget option causes all IDs to be written
+to disk as -1, so when the media is later remounted, they will appear to be
+owned by whatever default ID it is mounted with at that time.
+
+For typical desktop use of removable media, you should set the ID to that
+of the interactively logged on user, and also specify both the forget and
+ignore options.  This way the interactive user will always see the files
+on the disk as belonging to him.
+
+The remaining are for debugging and disaster recovery:
+
+	novrs		Skip volume sequence recognition 
+
+The following expect a offset from 0.
+
+	session=	Set the CDROM session (default= last session)
+	anchor=		Override standard anchor location. (default= 256)
+	volume=		Override the VolumeDesc location. (unused)
+	partition=	Override the PartitionDesc location. (unused)
+	lastblock=	Set the last block of the filesystem/
+
+The following expect a offset from the partition root.
+
+	fileset=	Override the fileset block location. (unused)
+	rootdir=	Override the root directory location. (unused)
+			WARNING: overriding the rootdir to a non-directory may
+				yield highly unpredictable results.
+-------------------------------------------------------------------------------
+
+
+For the latest version and toolset see:
+	http://linux-udf.sourceforge.net/
+
+Documentation on UDF and ECMA 167 is available FREE from:
+	http://www.osta.org/
+	http://www.ecma-international.org/
+
+Ben Fennema <bfennema@falcon.csc.calpoly.edu>
--- a/Documentation/filesystems/ufs.txt
+++ b/Documentation/filesystems/ufs.txt
@@ -0,0 +1,60 @@
+USING UFS
+=========
+
+mount -t ufs -o ufstype=type_of_ufs device dir
+
+
+UFS OPTIONS
+===========
+
+ufstype=type_of_ufs
+	UFS is a file system widely used in different operating systems.
+	The problem are differences among implementations. Features of
+	some implementations are undocumented, so its hard to recognize
+	type of ufs automatically. That's why user must specify type of 
+	ufs manually by mount option ufstype. Possible values are:
+
+	old	old format of ufs
+		default value, supported as read-only
+
+	44bsd	used in FreeBSD, NetBSD, OpenBSD
+		supported as read-write
+
+	ufs2    used in FreeBSD 5.x
+		supported as read-write
+
+	5xbsd	synonym for ufs2
+
+	sun	used in SunOS (Solaris)
+		supported as read-write
+
+	sunx86	used in SunOS for Intel (Solarisx86)
+		supported as read-write
+
+	hp	used in HP-UX
+		supported as read-only
+
+	nextstep
+		used in NextStep
+		supported as read-only
+
+	nextstep-cd
+		used for NextStep CDROMs (block_size == 2048)
+		supported as read-only
+
+	openstep
+		used in OpenStep
+		supported as read-only
+
+
+POSSIBLE PROBLEMS
+=================
+
+See next section, if you have any.
+
+
+BUG REPORTS
+===========
+
+Any ufs bug report you can send to daniel.pirkl@email.cz or
+to dushistov@mail.ru (do not send partition tables bug reports).
--- a/Documentation/filesystems/vfat.txt
+++ b/Documentation/filesystems/vfat.txt
@@ -0,0 +1,231 @@
+USING VFAT
+----------------------------------------------------------------------
+To use the vfat filesystem, use the filesystem type 'vfat'.  i.e.
+  mount -t vfat /dev/fd0 /mnt
+
+No special partition formatter is required.  mkdosfs will work fine
+if you want to format from within Linux.
+
+VFAT MOUNT OPTIONS
+----------------------------------------------------------------------
+umask=###     -- The permission mask (for files and directories, see umask(1)).
+                 The default is the umask of current process.
+
+dmask=###     -- The permission mask for the directory.
+                 The default is the umask of current process.
+
+fmask=###     -- The permission mask for files.
+                 The default is the umask of current process.
+
+codepage=###  -- Sets the codepage number for converting to shortname
+		 characters on FAT filesystem.
+		 By default, FAT_DEFAULT_CODEPAGE setting is used.
+
+iocharset=name -- Character set to use for converting between the
+		 encoding is used for user visible filename and 16 bit
+		 Unicode characters. Long filenames are stored on disk
+		 in Unicode format, but Unix for the most part doesn't
+		 know how to deal with Unicode.
+		 By default, FAT_DEFAULT_IOCHARSET setting is used.
+
+		 There is also an option of doing UTF-8 translations
+		 with the utf8 option.
+
+		 NOTE: "iocharset=utf8" is not recommended. If unsure,
+		 you should consider the following option instead.
+
+utf8=<bool>   -- UTF-8 is the filesystem safe version of Unicode that
+		 is used by the console.  It can be enabled for the
+		 filesystem with this option. If 'uni_xlate' gets set,
+		 UTF-8 gets disabled.
+
+uni_xlate=<bool> -- Translate unhandled Unicode characters to special
+		 escaped sequences.  This would let you backup and
+		 restore filenames that are created with any Unicode
+		 characters.  Until Linux supports Unicode for real,
+		 this gives you an alternative.  Without this option,
+		 a '?' is used when no translation is possible.  The
+		 escape character is ':' because it is otherwise
+		 illegal on the vfat filesystem.  The escape sequence
+		 that gets used is ':' and the four digits of hexadecimal
+		 unicode.
+
+nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will
+                 end in '~1' or tilde followed by some number.  If this
+                 option is set, then if the filename is 
+                 "longfilename.txt" and "longfile.txt" does not
+                 currently exist in the directory, 'longfile.txt' will
+                 be the short alias instead of 'longfi~1.txt'. 
+                  
+quiet         -- Stops printing certain warning messages.
+
+check=s|r|n   -- Case sensitivity checking setting.
+                 s: strict, case sensitive
+                 r: relaxed, case insensitive
+                 n: normal, default setting, currently case insensitive
+
+shortname=lower|win95|winnt|mixed
+	      -- Shortname display/create setting.
+		 lower: convert to lowercase for display,
+			emulate the Windows 95 rule for create.
+		 win95: emulate the Windows 95 rule for display/create.
+		 winnt: emulate the Windows NT rule for display/create.
+		 mixed: emulate the Windows NT rule for display,
+			emulate the Windows 95 rule for create.
+		 Default setting is `lower'.
+
+<bool>: 0,1,yes,no,true,false
+
+TODO
+----------------------------------------------------------------------
+* Need to get rid of the raw scanning stuff.  Instead, always use
+  a get next directory entry approach.  The only thing left that uses
+  raw scanning is the directory renaming code.
+
+
+POSSIBLE PROBLEMS
+----------------------------------------------------------------------
+* vfat_valid_longname does not properly checked reserved names.
+* When a volume name is the same as a directory name in the root
+  directory of the filesystem, the directory name sometimes shows
+  up as an empty file.
+* autoconv option does not work correctly.
+
+BUG REPORTS
+----------------------------------------------------------------------
+If you have trouble with the VFAT filesystem, mail bug reports to
+chaffee@bmrc.cs.berkeley.edu.  Please specify the filename
+and the operation that gave you trouble.
+
+TEST SUITE
+----------------------------------------------------------------------
+If you plan to make any modifications to the vfat filesystem, please
+get the test suite that comes with the vfat distribution at
+
+  http://bmrc.berkeley.edu/people/chaffee/vfat.html
+
+This tests quite a few parts of the vfat filesystem and additional
+tests for new features or untested features would be appreciated.
+
+NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
+----------------------------------------------------------------------
+(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
+ and lightly annotated by Gordon Chaffee).
+
+This document presents a very rough, technical overview of my
+knowledge of the extended FAT file system used in Windows NT 3.5 and
+Windows 95.  I don't guarantee that any of the following is correct,
+but it appears to be so.
+
+The extended FAT file system is almost identical to the FAT
+file system used in DOS versions up to and including 6.223410239847
+:-).  The significant change has been the addition of long file names.
+These names support up to 255 characters including spaces and lower
+case characters as opposed to the traditional 8.3 short names.
+
+Here is the description of the traditional FAT entry in the current
+Windows 95 filesystem:
+
+        struct directory { // Short 8.3 names 
+                unsigned char name[8];          // file name 
+                unsigned char ext[3];           // file extension 
+                unsigned char attr;             // attribute byte 
+		unsigned char lcase;		// Case for base and extension
+		unsigned char ctime_ms;		// Creation time, milliseconds
+		unsigned char ctime[2];		// Creation time
+		unsigned char cdate[2];		// Creation date
+		unsigned char adate[2];		// Last access date
+		unsigned char reserved[2];	// reserved values (ignored) 
+                unsigned char time[2];          // time stamp 
+                unsigned char date[2];          // date stamp 
+                unsigned char start[2];         // starting cluster number 
+                unsigned char size[4];          // size of the file 
+        };
+
+The lcase field specifies if the base and/or the extension of an 8.3
+name should be capitalized.  This field does not seem to be used by
+Windows 95 but it is used by Windows NT.  The case of filenames is not
+completely compatible from Windows NT to Windows 95.  It is not completely
+compatible in the reverse direction, however.  Filenames that fit in
+the 8.3 namespace and are written on Windows NT to be lowercase will
+show up as uppercase on Windows 95.
+
+Note that the "start" and "size" values are actually little
+endian integer values.  The descriptions of the fields in this
+structure are public knowledge and can be found elsewhere.
+
+With the extended FAT system, Microsoft has inserted extra
+directory entries for any files with extended names.  (Any name which
+legally fits within the old 8.3 encoding scheme does not have extra
+entries.)  I call these extra entries slots.  Basically, a slot is a
+specially formatted directory entry which holds up to 13 characters of
+a file's extended name.  Think of slots as additional labeling for the
+directory entry of the file to which they correspond.  Microsoft
+prefers to refer to the 8.3 entry for a file as its alias and the
+extended slot directory entries as the file name. 
+
+The C structure for a slot directory entry follows:
+
+        struct slot { // Up to 13 characters of a long name 
+                unsigned char id;               // sequence number for slot 
+                unsigned char name0_4[10];      // first 5 characters in name 
+                unsigned char attr;             // attribute byte
+                unsigned char reserved;         // always 0 
+                unsigned char alias_checksum;   // checksum for 8.3 alias 
+                unsigned char name5_10[12];     // 6 more characters in name
+                unsigned char start[2];         // starting cluster number
+                unsigned char name11_12[4];     // last 2 characters in name
+        };
+
+If the layout of the slots looks a little odd, it's only
+because of Microsoft's efforts to maintain compatibility with old
+software.  The slots must be disguised to prevent old software from
+panicking.  To this end, a number of measures are taken:
+
+        1) The attribute byte for a slot directory entry is always set
+           to 0x0f.  This corresponds to an old directory entry with
+           attributes of "hidden", "system", "read-only", and "volume
+           label".  Most old software will ignore any directory
+           entries with the "volume label" bit set.  Real volume label
+           entries don't have the other three bits set.
+
+        2) The starting cluster is always set to 0, an impossible
+           value for a DOS file.
+
+Because the extended FAT system is backward compatible, it is
+possible for old software to modify directory entries.  Measures must
+be taken to ensure the validity of slots.  An extended FAT system can
+verify that a slot does in fact belong to an 8.3 directory entry by
+the following:
+
+        1) Positioning.  Slots for a file always immediately proceed
+           their corresponding 8.3 directory entry.  In addition, each
+           slot has an id which marks its order in the extended file
+           name.  Here is a very abbreviated view of an 8.3 directory
+           entry and its corresponding long name slots for the file
+           "My Big File.Extension which is long":
+
+                <proceeding files...>
+                <slot #3, id = 0x43, characters = "h is long">
+                <slot #2, id = 0x02, characters = "xtension whic">
+                <slot #1, id = 0x01, characters = "My Big File.E">
+                <directory entry, name = "MYBIGFIL.EXT">
+
+           Note that the slots are stored from last to first.  Slots
+           are numbered from 1 to N.  The Nth slot is or'ed with 0x40
+           to mark it as the last one.
+
+        2) Checksum.  Each slot has an "alias_checksum" value.  The
+           checksum is calculated from the 8.3 name using the
+           following algorithm:
+
+                for (sum = i = 0; i < 11; i++) {
+                        sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
+                }
+
+	3) If there is free space in the final slot, a Unicode NULL (0x0000) 
+	   is stored after the final character.  After that, all unused 
+	   characters in the final slot are set to Unicode 0xFFFF.
+
+Finally, note that the extended name is stored in Unicode.  Each Unicode
+character takes two bytes.
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -0,0 +1,931 @@
+
+	      Overview of the Linux Virtual File System
+
+	Original author: Richard Gooch <rgooch@atnf.csiro.au>
+
+		  Last updated on October 28, 2005
+
+  Copyright (C) 1999 Richard Gooch
+  Copyright (C) 2005 Pekka Enberg
+
+  This file is released under the GPLv2.
+
+
+Introduction
+============
+
+The Virtual File System (also known as the Virtual Filesystem Switch)
+is the software layer in the kernel that provides the filesystem
+interface to userspace programs. It also provides an abstraction
+within the kernel which allows different filesystem implementations to
+coexist.
+
+VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
+on are called from a process context. Filesystem locking is described
+in the document Documentation/filesystems/Locking.
+
+
+Directory Entry Cache (dcache)
+------------------------------
+
+The VFS implements the open(2), stat(2), chmod(2), and similar system
+calls. The pathname argument that is passed to them is used by the VFS
+to search through the directory entry cache (also known as the dentry
+cache or dcache). This provides a very fast look-up mechanism to
+translate a pathname (filename) into a specific dentry. Dentries live
+in RAM and are never saved to disc: they exist only for performance.
+
+The dentry cache is meant to be a view into your entire filespace. As
+most computers cannot fit all dentries in the RAM at the same time,
+some bits of the cache are missing. In order to resolve your pathname
+into a dentry, the VFS may have to resort to creating dentries along
+the way, and then loading the inode. This is done by looking up the
+inode.
+
+
+The Inode Object
+----------------
+
+An individual dentry usually has a pointer to an inode. Inodes are
+filesystem objects such as regular files, directories, FIFOs and other
+beasts.  They live either on the disc (for block device filesystems)
+or in the memory (for pseudo filesystems). Inodes that live on the
+disc are copied into the memory when required and changes to the inode
+are written back to disc. A single inode can be pointed to by multiple
+dentries (hard links, for example, do this).
+
+To look up an inode requires that the VFS calls the lookup() method of
+the parent directory inode. This method is installed by the specific
+filesystem implementation that the inode lives in. Once the VFS has
+the required dentry (and hence the inode), we can do all those boring
+things like open(2) the file, or stat(2) it to peek at the inode
+data. The stat(2) operation is fairly simple: once the VFS has the
+dentry, it peeks at the inode data and passes some of it back to
+userspace.
+
+
+The File Object
+---------------
+
+Opening a file requires another operation: allocation of a file
+structure (this is the kernel-side implementation of file
+descriptors). The freshly allocated file structure is initialized with
+a pointer to the dentry and a set of file operation member functions.
+These are taken from the inode data. The open() file method is then
+called so the specific filesystem implementation can do it's work. You
+can see that this is another switch performed by the VFS. The file
+structure is placed into the file descriptor table for the process.
+
+Reading, writing and closing files (and other assorted VFS operations)
+is done by using the userspace file descriptor to grab the appropriate
+file structure, and then calling the required file structure method to
+do whatever is required. For as long as the file is open, it keeps the
+dentry in use, which in turn means that the VFS inode is still in use.
+
+
+Registering and Mounting a Filesystem
+=====================================
+
+To register and unregister a filesystem, use the following API
+functions:
+
+   #include <linux/fs.h>
+
+   extern int register_filesystem(struct file_system_type *);
+   extern int unregister_filesystem(struct file_system_type *);
+
+The passed struct file_system_type describes your filesystem. When a
+request is made to mount a device onto a directory in your filespace,
+the VFS will call the appropriate get_sb() method for the specific
+filesystem. The dentry for the mount point will then be updated to
+point to the root inode for the new filesystem.
+
+You can see all filesystems that are registered to the kernel in the
+file /proc/filesystems.
+
+
+struct file_system_type
+-----------------------
+
+This describes the filesystem. As of kernel 2.6.13, the following
+members are defined:
+
+struct file_system_type {
+	const char *name;
+	int fs_flags;
+        int (*get_sb) (struct file_system_type *, int,
+                       const char *, void *, struct vfsmount *);
+        void (*kill_sb) (struct super_block *);
+        struct module *owner;
+        struct file_system_type * next;
+        struct list_head fs_supers;
+};
+
+  name: the name of the filesystem type, such as "ext2", "iso9660",
+	"msdos" and so on
+
+  fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
+
+  get_sb: the method to call when a new instance of this
+	filesystem should be mounted
+
+  kill_sb: the method to call when an instance of this filesystem
+	should be unmounted
+
+  owner: for internal VFS use: you should initialize this to THIS_MODULE in
+  	most cases.
+
+  next: for internal VFS use: you should initialize this to NULL
+
+The get_sb() method has the following arguments:
+
+  struct super_block *sb: the superblock structure. This is partially
+	initialized by the VFS and the rest must be initialized by the
+	get_sb() method
+
+  int flags: mount flags
+
+  const char *dev_name: the device name we are mounting.
+
+  void *data: arbitrary mount options, usually comes as an ASCII
+	string
+
+  int silent: whether or not to be silent on error
+
+The get_sb() method must determine if the block device specified
+in the superblock contains a filesystem of the type the method
+supports. On success the method returns the superblock pointer, on
+failure it returns NULL.
+
+The most interesting member of the superblock structure that the
+get_sb() method fills in is the "s_op" field. This is a pointer to
+a "struct super_operations" which describes the next level of the
+filesystem implementation.
+
+Usually, a filesystem uses one of the generic get_sb() implementations
+and provides a fill_super() method instead. The generic methods are:
+
+  get_sb_bdev: mount a filesystem residing on a block device
+
+  get_sb_nodev: mount a filesystem that is not backed by a device
+
+  get_sb_single: mount a filesystem which shares the instance between
+  	all mounts
+
+A fill_super() method implementation has the following arguments:
+
+  struct super_block *sb: the superblock structure. The method fill_super()
+  	must initialize this properly.
+
+  void *data: arbitrary mount options, usually comes as an ASCII
+	string
+
+  int silent: whether or not to be silent on error
+
+
+The Superblock Object
+=====================
+
+A superblock object represents a mounted filesystem.
+
+
+struct super_operations
+-----------------------
+
+This describes how the VFS can manipulate the superblock of your
+filesystem. As of kernel 2.6.13, the following members are defined:
+
+struct super_operations {
+        struct inode *(*alloc_inode)(struct super_block *sb);
+        void (*destroy_inode)(struct inode *);
+
+        void (*read_inode) (struct inode *);
+
+        void (*dirty_inode) (struct inode *);
+        int (*write_inode) (struct inode *, int);
+        void (*put_inode) (struct inode *);
+        void (*drop_inode) (struct inode *);
+        void (*delete_inode) (struct inode *);
+        void (*put_super) (struct super_block *);
+        void (*write_super) (struct super_block *);
+        int (*sync_fs)(struct super_block *sb, int wait);
+        void (*write_super_lockfs) (struct super_block *);
+        void (*unlockfs) (struct super_block *);
+        int (*statfs) (struct dentry *, struct kstatfs *);
+        int (*remount_fs) (struct super_block *, int *, char *);
+        void (*clear_inode) (struct inode *);
+        void (*umount_begin) (struct super_block *);
+
+        void (*sync_inodes) (struct super_block *sb,
+                                struct writeback_control *wbc);
+        int (*show_options)(struct seq_file *, struct vfsmount *);
+
+        ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
+        ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
+};
+
+All methods are called without any locks being held, unless otherwise
+noted. This means that most methods can block safely. All methods are
+only called from a process context (i.e. not from an interrupt handler
+or bottom half).
+
+  alloc_inode: this method is called by inode_alloc() to allocate memory
+ 	for struct inode and initialize it.  If this function is not
+ 	defined, a simple 'struct inode' is allocated.  Normally
+ 	alloc_inode will be used to allocate a larger structure which
+ 	contains a 'struct inode' embedded within it.
+
+  destroy_inode: this method is called by destroy_inode() to release
+  	resources allocated for struct inode.  It is only required if
+  	->alloc_inode was defined and simply undoes anything done by
+	->alloc_inode.
+
+  read_inode: this method is called to read a specific inode from the
+        mounted filesystem.  The i_ino member in the struct inode is
+	initialized by the VFS to indicate which inode to read. Other
+	members are filled in by this method.
+
+	You can set this to NULL and use iget5_locked() instead of iget()
+	to read inodes.  This is necessary for filesystems for which the
+	inode number is not sufficient to identify an inode.
+
+  dirty_inode: this method is called by the VFS to mark an inode dirty.
+
+  write_inode: this method is called when the VFS needs to write an
+	inode to disc.  The second parameter indicates whether the write
+	should be synchronous or not, not all filesystems check this flag.
+
+  put_inode: called when the VFS inode is removed from the inode
+	cache.
+
+  drop_inode: called when the last access to the inode is dropped,
+	with the inode_lock spinlock held.
+
+	This method should be either NULL (normal UNIX filesystem
+	semantics) or "generic_delete_inode" (for filesystems that do not
+	want to cache inodes - causing "delete_inode" to always be
+	called regardless of the value of i_nlink)
+
+	The "generic_delete_inode()" behavior is equivalent to the
+	old practice of using "force_delete" in the put_inode() case,
+	but does not have the races that the "force_delete()" approach
+	had. 
+
+  delete_inode: called when the VFS wants to delete an inode
+
+  put_super: called when the VFS wishes to free the superblock
+	(i.e. unmount). This is called with the superblock lock held
+
+  write_super: called when the VFS superblock needs to be written to
+	disc. This method is optional
+
+  sync_fs: called when VFS is writing out all dirty data associated with
+  	a superblock. The second parameter indicates whether the method
+	should wait until the write out has been completed. Optional.
+
+  write_super_lockfs: called when VFS is locking a filesystem and
+  	forcing it into a consistent state.  This method is currently
+  	used by the Logical Volume Manager (LVM).
+
+  unlockfs: called when VFS is unlocking a filesystem and making it writable
+  	again.
+
+  statfs: called when the VFS needs to get filesystem statistics. This
+	is called with the kernel lock held
+
+  remount_fs: called when the filesystem is remounted. This is called
+	with the kernel lock held
+
+  clear_inode: called then the VFS clears the inode. Optional
+
+  umount_begin: called when the VFS is unmounting a filesystem.
+
+  sync_inodes: called when the VFS is writing out dirty data associated with
+  	a superblock.
+
+  show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
+
+  quota_read: called by the VFS to read from filesystem quota file.
+
+  quota_write: called by the VFS to write to filesystem quota file.
+
+The read_inode() method is responsible for filling in the "i_op"
+field. This is a pointer to a "struct inode_operations" which
+describes the methods that can be performed on individual inodes.
+
+
+The Inode Object
+================
+
+An inode object represents an object within the filesystem.
+
+
+struct inode_operations
+-----------------------
+
+This describes how the VFS can manipulate an inode in your
+filesystem. As of kernel 2.6.13, the following members are defined:
+
+struct inode_operations {
+	int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
+	struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
+	int (*link) (struct dentry *,struct inode *,struct dentry *);
+	int (*unlink) (struct inode *,struct dentry *);
+	int (*symlink) (struct inode *,struct dentry *,const char *);
+	int (*mkdir) (struct inode *,struct dentry *,int);
+	int (*rmdir) (struct inode *,struct dentry *);
+	int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+	int (*rename) (struct inode *, struct dentry *,
+			struct inode *, struct dentry *);
+	int (*readlink) (struct dentry *, char __user *,int);
+        void * (*follow_link) (struct dentry *, struct nameidata *);
+        void (*put_link) (struct dentry *, struct nameidata *, void *);
+	void (*truncate) (struct inode *);
+	int (*permission) (struct inode *, int, struct nameidata *);
+	int (*setattr) (struct dentry *, struct iattr *);
+	int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
+	int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
+	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
+	ssize_t (*listxattr) (struct dentry *, char *, size_t);
+	int (*removexattr) (struct dentry *, const char *);
+};
+
+Again, all methods are called without any locks being held, unless
+otherwise noted.
+
+  create: called by the open(2) and creat(2) system calls. Only
+	required if you want to support regular files. The dentry you
+	get should not have an inode (i.e. it should be a negative
+	dentry). Here you will probably call d_instantiate() with the
+	dentry and the newly created inode
+
+  lookup: called when the VFS needs to look up an inode in a parent
+	directory. The name to look for is found in the dentry. This
+	method must call d_add() to insert the found inode into the
+	dentry. The "i_count" field in the inode structure should be
+	incremented. If the named inode does not exist a NULL inode
+	should be inserted into the dentry (this is called a negative
+	dentry). Returning an error code from this routine must only
+	be done on a real error, otherwise creating inodes with system
+	calls like create(2), mknod(2), mkdir(2) and so on will fail.
+	If you wish to overload the dentry methods then you should
+	initialise the "d_dop" field in the dentry; this is a pointer
+	to a struct "dentry_operations".
+	This method is called with the directory inode semaphore held
+
+  link: called by the link(2) system call. Only required if you want
+	to support hard links. You will probably need to call
+	d_instantiate() just as you would in the create() method
+
+  unlink: called by the unlink(2) system call. Only required if you
+	want to support deleting inodes
+
+  symlink: called by the symlink(2) system call. Only required if you
+	want to support symlinks. You will probably need to call
+	d_instantiate() just as you would in the create() method
+
+  mkdir: called by the mkdir(2) system call. Only required if you want
+	to support creating subdirectories. You will probably need to
+	call d_instantiate() just as you would in the create() method
+
+  rmdir: called by the rmdir(2) system call. Only required if you want
+	to support deleting subdirectories
+
+  mknod: called by the mknod(2) system call to create a device (char,
+	block) inode or a named pipe (FIFO) or socket. Only required
+	if you want to support creating these types of inodes. You
+	will probably need to call d_instantiate() just as you would
+	in the create() method
+
+  rename: called by the rename(2) system call to rename the object to
+	have the parent and name given by the second inode and dentry.
+
+  readlink: called by the readlink(2) system call. Only required if
+	you want to support reading symbolic links
+
+  follow_link: called by the VFS to follow a symbolic link to the
+	inode it points to.  Only required if you want to support
+	symbolic links.  This method returns a void pointer cookie
+	that is passed to put_link().
+
+  put_link: called by the VFS to release resources allocated by
+  	follow_link().  The cookie returned by follow_link() is passed
+  	to this method as the last parameter.  It is used by
+  	filesystems such as NFS where page cache is not stable
+  	(i.e. page that was installed when the symbolic link walk
+  	started might not be in the page cache at the end of the
+  	walk).
+
+  truncate: called by the VFS to change the size of a file.  The
+ 	i_size field of the inode is set to the desired size by the
+ 	VFS before this method is called.  This method is called by
+ 	the truncate(2) system call and related functionality.
+
+  permission: called by the VFS to check for access rights on a POSIX-like
+  	filesystem.
+
+  setattr: called by the VFS to set attributes for a file. This method
+  	is called by chmod(2) and related system calls.
+
+  getattr: called by the VFS to get attributes of a file. This method
+  	is called by stat(2) and related system calls.
+
+  setxattr: called by the VFS to set an extended attribute for a file.
+  	Extended attribute is a name:value pair associated with an
+  	inode. This method is called by setxattr(2) system call.
+
+  getxattr: called by the VFS to retrieve the value of an extended
+  	attribute name. This method is called by getxattr(2) function
+  	call.
+
+  listxattr: called by the VFS to list all extended attributes for a
+  	given file. This method is called by listxattr(2) system call.
+
+  removexattr: called by the VFS to remove an extended attribute from
+  	a file. This method is called by removexattr(2) system call.
+
+
+The Address Space Object
+========================
+
+The address space object is used to group and manage pages in the page
+cache.  It can be used to keep track of the pages in a file (or
+anything else) and also track the mapping of sections of the file into
+process address spaces.
+
+There are a number of distinct yet related services that an
+address-space can provide.  These include communicating memory
+pressure, page lookup by address, and keeping track of pages tagged as
+Dirty or Writeback.
+
+The first can be used independently to the others.  The VM can try to
+either write dirty pages in order to clean them, or release clean
+pages in order to reuse them.  To do this it can call the ->writepage
+method on dirty pages, and ->releasepage on clean pages with
+PagePrivate set. Clean pages without PagePrivate and with no external
+references will be released without notice being given to the
+address_space.
+
+To achieve this functionality, pages need to be placed on an LRU with
+lru_cache_add and mark_page_active needs to be called whenever the
+page is used.
+
+Pages are normally kept in a radix tree index by ->index. This tree
+maintains information about the PG_Dirty and PG_Writeback status of
+each page, so that pages with either of these flags can be found
+quickly.
+
+The Dirty tag is primarily used by mpage_writepages - the default
+->writepages method.  It uses the tag to find dirty pages to call
+->writepage on.  If mpage_writepages is not used (i.e. the address
+provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
+almost unused.  write_inode_now and sync_inode do use it (through
+__sync_single_inode) to check if ->writepages has been successful in
+writing out the whole address_space.
+
+The Writeback tag is used by filemap*wait* and sync_page* functions,
+via wait_on_page_writeback_range, to wait for all writeback to
+complete.  While waiting ->sync_page (if defined) will be called on
+each page that is found to require writeback.
+
+An address_space handler may attach extra information to a page,
+typically using the 'private' field in the 'struct page'.  If such
+information is attached, the PG_Private flag should be set.  This will
+cause various VM routines to make extra calls into the address_space
+handler to deal with that data.
+
+An address space acts as an intermediate between storage and
+application.  Data is read into the address space a whole page at a
+time, and provided to the application either by copying of the page,
+or by memory-mapping the page.
+Data is written into the address space by the application, and then
+written-back to storage typically in whole pages, however the
+address_space has finer control of write sizes.
+
+The read process essentially only requires 'readpage'.  The write
+process is more complicated and uses prepare_write/commit_write or
+set_page_dirty to write data into the address_space, and writepage,
+sync_page, and writepages to writeback data to storage.
+
+Adding and removing pages to/from an address_space is protected by the
+inode's i_mutex.
+
+When data is written to a page, the PG_Dirty flag should be set.  It
+typically remains set until writepage asks for it to be written.  This
+should clear PG_Dirty and set PG_Writeback.  It can be actually
+written at any point after PG_Dirty is clear.  Once it is known to be
+safe, PG_Writeback is cleared.
+
+Writeback makes use of a writeback_control structure...
+
+struct address_space_operations
+-------------------------------
+
+This describes how the VFS can manipulate mapping of a file to page cache in
+your filesystem. As of kernel 2.6.16, the following members are defined:
+
+struct address_space_operations {
+	int (*writepage)(struct page *page, struct writeback_control *wbc);
+	int (*readpage)(struct file *, struct page *);
+	int (*sync_page)(struct page *);
+	int (*writepages)(struct address_space *, struct writeback_control *);
+	int (*set_page_dirty)(struct page *page);
+	int (*readpages)(struct file *filp, struct address_space *mapping,
+			struct list_head *pages, unsigned nr_pages);
+	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
+	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+	sector_t (*bmap)(struct address_space *, sector_t);
+	int (*invalidatepage) (struct page *, unsigned long);
+	int (*releasepage) (struct page *, int);
+	ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
+			loff_t offset, unsigned long nr_segs);
+	struct page* (*get_xip_page)(struct address_space *, sector_t,
+			int);
+	/* migrate the contents of a page to the specified target */
+	int (*migratepage) (struct page *, struct page *);
+};
+
+  writepage: called by the VM to write a dirty page to backing store.
+      This may happen for data integrity reasons (i.e. 'sync'), or
+      to free up memory (flush).  The difference can be seen in
+      wbc->sync_mode.
+      The PG_Dirty flag has been cleared and PageLocked is true.
+      writepage should start writeout, should set PG_Writeback,
+      and should make sure the page is unlocked, either synchronously
+      or asynchronously when the write operation completes.
+
+      If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
+      try too hard if there are problems, and may choose to write out
+      other pages from the mapping if that is easier (e.g. due to
+      internal dependencies).  If it chooses not to start writeout, it
+      should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
+      calling ->writepage on that page.
+
+      See the file "Locking" for more details.
+
+  readpage: called by the VM to read a page from backing store.
+       The page will be Locked when readpage is called, and should be
+       unlocked and marked uptodate once the read completes.
+       If ->readpage discovers that it needs to unlock the page for
+       some reason, it can do so, and then return AOP_TRUNCATED_PAGE.
+       In this case, the page will be relocated, relocked and if
+       that all succeeds, ->readpage will be called again.
+
+  sync_page: called by the VM to notify the backing store to perform all
+  	queued I/O operations for a page. I/O operations for other pages
+	associated with this address_space object may also be performed.
+
+	This function is optional and is called only for pages with
+  	PG_Writeback set while waiting for the writeback to complete.
+
+  writepages: called by the VM to write out pages associated with the
+  	address_space object.  If wbc->sync_mode is WBC_SYNC_ALL, then
+  	the writeback_control will specify a range of pages that must be
+  	written out.  If it is WBC_SYNC_NONE, then a nr_to_write is given
+	and that many pages should be written if possible.
+	If no ->writepages is given, then mpage_writepages is used
+  	instead.  This will choose pages from the address space that are
+  	tagged as DIRTY and will pass them to ->writepage.
+
+  set_page_dirty: called by the VM to set a page dirty.
+        This is particularly needed if an address space attaches
+        private data to a page, and that data needs to be updated when
+        a page is dirtied.  This is called, for example, when a memory
+	mapped page gets modified.
+	If defined, it should set the PageDirty flag, and the
+        PAGECACHE_TAG_DIRTY tag in the radix tree.
+
+  readpages: called by the VM to read pages associated with the address_space
+  	object. This is essentially just a vector version of
+  	readpage.  Instead of just one page, several pages are
+  	requested.
+	readpages is only used for read-ahead, so read errors are
+  	ignored.  If anything goes wrong, feel free to give up.
+
+  prepare_write: called by the generic write path in VM to set up a write
+  	request for a page.  This indicates to the address space that
+  	the given range of bytes is about to be written.  The
+  	address_space should check that the write will be able to
+  	complete, by allocating space if necessary and doing any other
+  	internal housekeeping.  If the write will update parts of
+  	any basic-blocks on storage, then those blocks should be
+  	pre-read (if they haven't been read already) so that the
+  	updated blocks can be written out properly.
+	The page will be locked.  If prepare_write wants to unlock the
+  	page it, like readpage, may do so and return
+  	AOP_TRUNCATED_PAGE.
+	In this case the prepare_write will be retried one the lock is
+  	regained.
+
+	Note: the page _must not_ be marked uptodate in this function
+	(or anywhere else) unless it actually is uptodate right now. As
+	soon as a page is marked uptodate, it is possible for a concurrent
+	read(2) to copy it to userspace.
+
+  commit_write: If prepare_write succeeds, new data will be copied
+        into the page and then commit_write will be called.  It will
+        typically update the size of the file (if appropriate) and
+        mark the inode as dirty, and do any other related housekeeping
+        operations.  It should avoid returning an error if possible -
+        errors should have been handled by prepare_write.
+
+  bmap: called by the VFS to map a logical block offset within object to
+  	physical block number. This method is used by the FIBMAP
+  	ioctl and for working with swap-files.  To be able to swap to
+  	a file, the file must have a stable mapping to a block
+  	device.  The swap system does not go through the filesystem
+  	but instead uses bmap to find out where the blocks in the file
+  	are and uses those addresses directly.
+
+
+  invalidatepage: If a page has PagePrivate set, then invalidatepage
+        will be called when part or all of the page is to be removed
+	from the address space.  This generally corresponds to either a
+	truncation or a complete invalidation of the address space
+	(in the latter case 'offset' will always be 0).
+	Any private data associated with the page should be updated
+	to reflect this truncation.  If offset is 0, then
+	the private data should be released, because the page
+	must be able to be completely discarded.  This may be done by
+        calling the ->releasepage function, but in this case the
+        release MUST succeed.
+
+  releasepage: releasepage is called on PagePrivate pages to indicate
+        that the page should be freed if possible.  ->releasepage
+        should remove any private data from the page and clear the
+        PagePrivate flag.  It may also remove the page from the
+        address_space.  If this fails for some reason, it may indicate
+        failure with a 0 return value.
+	This is used in two distinct though related cases.  The first
+        is when the VM finds a clean page with no active users and
+        wants to make it a free page.  If ->releasepage succeeds, the
+        page will be removed from the address_space and become free.
+
+	The second case if when a request has been made to invalidate
+        some or all pages in an address_space.  This can happen
+        through the fadvice(POSIX_FADV_DONTNEED) system call or by the
+        filesystem explicitly requesting it as nfs and 9fs do (when
+        they believe the cache may be out of date with storage) by
+        calling invalidate_inode_pages2().
+	If the filesystem makes such a call, and needs to be certain
+        that all pages are invalidated, then its releasepage will
+        need to ensure this.  Possibly it can clear the PageUptodate
+        bit if it cannot free private data yet.
+
+  direct_IO: called by the generic read/write routines to perform
+        direct_IO - that is IO requests which bypass the page cache
+        and transfer data directly between the storage and the
+        application's address space.
+
+  get_xip_page: called by the VM to translate a block number to a page.
+	The page is valid until the corresponding filesystem is unmounted.
+	Filesystems that want to use execute-in-place (XIP) need to implement
+	it.  An example implementation can be found in fs/ext2/xip.c.
+
+  migrate_page:  This is used to compact the physical memory usage.
+        If the VM wants to relocate a page (maybe off a memory card
+        that is signalling imminent failure) it will pass a new page
+	and an old page to this function.  migrate_page should
+	transfer any private data across and update any references
+        that it has to the page.
+
+The File Object
+===============
+
+A file object represents a file opened by a process.
+
+
+struct file_operations
+----------------------
+
+This describes how the VFS can manipulate an open file. As of kernel
+2.6.17, the following members are defined:
+
+struct file_operations {
+	loff_t (*llseek) (struct file *, loff_t, int);
+	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
+	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
+	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+	int (*readdir) (struct file *, void *, filldir_t);
+	unsigned int (*poll) (struct file *, struct poll_table_struct *);
+	int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
+	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
+	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+	int (*mmap) (struct file *, struct vm_area_struct *);
+	int (*open) (struct inode *, struct file *);
+	int (*flush) (struct file *);
+	int (*release) (struct inode *, struct file *);
+	int (*fsync) (struct file *, struct dentry *, int datasync);
+	int (*aio_fsync) (struct kiocb *, int datasync);
+	int (*fasync) (int, struct file *, int);
+	int (*lock) (struct file *, int, struct file_lock *);
+	ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
+	ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
+	ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
+	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
+	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
+	int (*check_flags)(int);
+	int (*dir_notify)(struct file *filp, unsigned long arg);
+	int (*flock) (struct file *, int, struct file_lock *);
+	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned 
+int);
+	ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned  
+int);
+};
+
+Again, all methods are called without any locks being held, unless
+otherwise noted.
+
+  llseek: called when the VFS needs to move the file position index
+
+  read: called by read(2) and related system calls
+
+  aio_read: called by io_submit(2) and other asynchronous I/O operations
+
+  write: called by write(2) and related system calls
+
+  aio_write: called by io_submit(2) and other asynchronous I/O operations
+
+  readdir: called when the VFS needs to read the directory contents
+
+  poll: called by the VFS when a process wants to check if there is
+	activity on this file and (optionally) go to sleep until there
+	is activity. Called by the select(2) and poll(2) system calls
+
+  ioctl: called by the ioctl(2) system call
+
+  unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
+  	require the BKL should use this method instead of the ioctl() above.
+
+  compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
+ 	 are used on 64 bit kernels.
+
+  mmap: called by the mmap(2) system call
+
+  open: called by the VFS when an inode should be opened. When the VFS
+	opens a file, it creates a new "struct file". It then calls the
+	open method for the newly allocated file structure. You might
+	think that the open method really belongs in
+	"struct inode_operations", and you may be right. I think it's
+	done the way it is because it makes filesystems simpler to
+	implement. The open() method is a good place to initialize the
+	"private_data" member in the file structure if you want to point
+	to a device structure
+
+  flush: called by the close(2) system call to flush a file
+
+  release: called when the last reference to an open file is closed
+
+  fsync: called by the fsync(2) system call
+
+  fasync: called by the fcntl(2) system call when asynchronous
+	(non-blocking) mode is enabled for a file
+
+  lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
+  	commands
+
+  readv: called by the readv(2) system call
+
+  writev: called by the writev(2) system call
+
+  sendfile: called by the sendfile(2) system call
+
+  get_unmapped_area: called by the mmap(2) system call
+
+  check_flags: called by the fcntl(2) system call for F_SETFL command
+
+  dir_notify: called by the fcntl(2) system call for F_NOTIFY command
+
+  flock: called by the flock(2) system call
+
+  splice_write: called by the VFS to splice data from a pipe to a file. This
+		method is used by the splice(2) system call
+
+  splice_read: called by the VFS to splice data from file to a pipe. This
+	       method is used by the splice(2) system call
+
+Note that the file operations are implemented by the specific
+filesystem in which the inode resides. When opening a device node
+(character or block special) most filesystems will call special
+support routines in the VFS which will locate the required device
+driver information. These support routines replace the filesystem file
+operations with those for the device driver, and then proceed to call
+the new open() method for the file. This is how opening a device file
+in the filesystem eventually ends up calling the device driver open()
+method.
+
+
+Directory Entry Cache (dcache)
+==============================
+
+
+struct dentry_operations
+------------------------
+
+This describes how a filesystem can overload the standard dentry
+operations. Dentries and the dcache are the domain of the VFS and the
+individual filesystem implementations. Device drivers have no business
+here. These methods may be set to NULL, as they are either optional or
+the VFS uses a default. As of kernel 2.6.13, the following members are
+defined:
+
+struct dentry_operations {
+	int (*d_revalidate)(struct dentry *, struct nameidata *);
+	int (*d_hash) (struct dentry *, struct qstr *);
+	int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
+	int (*d_delete)(struct dentry *);
+	void (*d_release)(struct dentry *);
+	void (*d_iput)(struct dentry *, struct inode *);
+};
+
+  d_revalidate: called when the VFS needs to revalidate a dentry. This
+	is called whenever a name look-up finds a dentry in the
+	dcache. Most filesystems leave this as NULL, because all their
+	dentries in the dcache are valid
+
+  d_hash: called when the VFS adds a dentry to the hash table
+
+  d_compare: called when a dentry should be compared with another
+
+  d_delete: called when the last reference to a dentry is
+	deleted. This means no-one is using the dentry, however it is
+	still valid and in the dcache
+
+  d_release: called when a dentry is really deallocated
+
+  d_iput: called when a dentry loses its inode (just prior to its
+	being deallocated). The default when this is NULL is that the
+	VFS calls iput(). If you define this method, you must call
+	iput() yourself
+
+Each dentry has a pointer to its parent dentry, as well as a hash list
+of child dentries. Child dentries are basically like files in a
+directory.
+
+
+Directory Entry Cache API
+--------------------------
+
+There are a number of functions defined which permit a filesystem to
+manipulate dentries:
+
+  dget: open a new handle for an existing dentry (this just increments
+	the usage count)
+
+  dput: close a handle for a dentry (decrements the usage count). If
+	the usage count drops to 0, the "d_delete" method is called
+	and the dentry is placed on the unused list if the dentry is
+	still in its parents hash list. Putting the dentry on the
+	unused list just means that if the system needs some RAM, it
+	goes through the unused list of dentries and deallocates them.
+	If the dentry has already been unhashed and the usage count
+	drops to 0, in this case the dentry is deallocated after the
+	"d_delete" method is called
+
+  d_drop: this unhashes a dentry from its parents hash list. A
+	subsequent call to dput() will deallocate the dentry if its
+	usage count drops to 0
+
+  d_delete: delete a dentry. If there are no other open references to
+	the dentry then the dentry is turned into a negative dentry
+	(the d_iput() method is called). If there are other
+	references, then d_drop() is called instead
+
+  d_add: add a dentry to its parents hash list and then calls
+	d_instantiate()
+
+  d_instantiate: add a dentry to the alias hash list for the inode and
+	updates the "d_inode" member. The "i_count" member in the
+	inode structure should be set/incremented. If the inode
+	pointer is NULL, the dentry is called a "negative
+	dentry". This function is commonly called when an inode is
+	created for an existing negative dentry
+
+  d_lookup: look up a dentry given its parent and path name component
+	It looks up the child of that given name from the dcache
+	hash table. If it is found, the reference count is incremented
+	and the dentry is returned. The caller must use d_put()
+	to free the dentry when it finishes using it.
+
+For further information on dentry locking, please refer to the document
+Documentation/filesystems/dentry-locking.txt.
+
+
+Resources
+=========
+
+(Note some of these resources are not up-to-date with the latest kernel
+ version.)
+
+Creating Linux virtual filesystems. 2002
+    <http://lwn.net/Articles/13325/>
+
+The Linux Virtual File-system Layer by Neil Brown. 1999
+    <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
+
+A tour of the Linux VFS by Michael K. Johnson. 1996
+    <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
+
+A small trail through the Linux kernel by Andries Brouwer. 2001
+    <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -0,0 +1,262 @@
+
+The SGI XFS Filesystem
+======================
+
+XFS is a high performance journaling filesystem which originated
+on the SGI IRIX platform.  It is completely multi-threaded, can
+support large files and large filesystems, extended attributes,
+variable block sizes, is extent based, and makes extensive use of
+Btrees (directories, extents, free space) to aid both performance
+and scalability.
+
+Refer to the documentation at http://oss.sgi.com/projects/xfs/
+for further details.  This implementation is on-disk compatible
+with the IRIX version of XFS.
+
+
+Mount Options
+=============
+
+When mounting an XFS filesystem, the following options are accepted.
+
+  allocsize=size
+	Sets the buffered I/O end-of-file preallocation size when
+	doing delayed allocation writeout (default size is 64KiB).
+	Valid values for this option are page size (typically 4KiB)
+	through to 1GiB, inclusive, in power-of-2 increments.
+
+  attr2/noattr2
+	The options enable/disable (default is disabled for backward
+	compatibility on-disk) an "opportunistic" improvement to be
+	made in the way inline extended attributes are stored on-disk.
+	When the new form is used for the first time (by setting or
+	removing extended attributes) the on-disk superblock feature
+	bit field will be updated to reflect this format being in use.
+
+  barrier
+	Enables the use of block layer write barriers for writes into
+	the journal and unwritten extent conversion.  This allows for
+	drive level write caching to be enabled, for devices that
+	support write barriers.
+
+  dmapi
+	Enable the DMAPI (Data Management API) event callouts.
+	Use with the "mtpt" option.
+
+  grpid/bsdgroups and nogrpid/sysvgroups
+	These options define what group ID a newly created file gets.
+	When grpid is set, it takes the group ID of the directory in
+	which it is created; otherwise (the default) it takes the fsgid
+	of the current process, unless the directory has the setgid bit
+	set, in which case it takes the gid from the parent directory,
+	and also gets the setgid bit set if it is a directory itself.
+
+  ihashsize=value
+	Sets the number of hash buckets available for hashing the
+	in-memory inodes of the specified mount point.  If a value
+	of zero is used, the value selected by the default algorithm
+	will be displayed in /proc/mounts.
+
+  ikeep/noikeep
+	When inode clusters are emptied of inodes, keep them around
+	on the disk (ikeep) - this is the traditional XFS behaviour
+	and is still the default for now.  Using the noikeep option,
+	inode clusters are returned to the free space pool.
+
+  inode64
+	Indicates that XFS is allowed to create inodes at any location
+	in the filesystem, including those which will result in inode
+	numbers occupying more than 32 bits of significance.  This is
+	provided for backwards compatibility, but causes problems for
+	backup applications that cannot handle large inode numbers.
+
+  largeio/nolargeio
+	If "nolargeio" is specified, the optimal I/O reported in
+	st_blksize by stat(2) will be as small as possible to allow user
+	applications to avoid inefficient read/modify/write I/O.
+	If "largeio" specified, a filesystem that has a "swidth" specified
+	will return the "swidth" value (in bytes) in st_blksize. If the
+	filesystem does not have a "swidth" specified but does specify
+	an "allocsize" then "allocsize" (in bytes) will be returned
+	instead.
+	If neither of these two options are specified, then filesystem
+	will behave as if "nolargeio" was specified.
+
+  logbufs=value
+	Set the number of in-memory log buffers.  Valid numbers range
+	from 2-8 inclusive.
+	The default value is 8 buffers for filesystems with a
+	blocksize of 64KiB, 4 buffers for filesystems with a blocksize
+	of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB
+	and 2 buffers for all other configurations.  Increasing the
+	number of buffers may increase performance on some workloads
+	at the cost of the memory used for the additional log buffers
+	and their associated control structures.
+
+  logbsize=value
+	Set the size of each in-memory log buffer.
+	Size may be specified in bytes, or in kilobytes with a "k" suffix.
+	Valid sizes for version 1 and version 2 logs are 16384 (16k) and
+	32768 (32k).  Valid sizes for version 2 logs also include
+	65536 (64k), 131072 (128k) and 262144 (256k).
+	The default value for machines with more than 32MiB of memory
+	is 32768, machines with less memory use 16384 by default.
+
+  logdev=device and rtdev=device
+	Use an external log (metadata journal) and/or real-time device.
+	An XFS filesystem has up to three parts: a data section, a log
+	section, and a real-time section.  The real-time section is
+	optional, and the log section can be separate from the data
+	section or contained within it.
+
+  mtpt=mountpoint
+	Use with the "dmapi" option.  The value specified here will be
+	included in the DMAPI mount event, and should be the path of
+	the actual mountpoint that is used.
+
+  noalign
+	Data allocations will not be aligned at stripe unit boundaries.
+
+  noatime
+	Access timestamps are not updated when a file is read.
+
+  norecovery
+	The filesystem will be mounted without running log recovery.
+	If the filesystem was not cleanly unmounted, it is likely to
+	be inconsistent when mounted in "norecovery" mode.
+	Some files or directories may not be accessible because of this.
+	Filesystems mounted "norecovery" must be mounted read-only or
+	the mount will fail.
+
+  nouuid
+	Don't check for double mounted file systems using the file system uuid.
+	This is useful to mount LVM snapshot volumes.
+
+  osyncisosync
+	Make O_SYNC writes implement true O_SYNC.  WITHOUT this option,
+	Linux XFS behaves as if an "osyncisdsync" option is used,
+	which will make writes to files opened with the O_SYNC flag set
+	behave as if the O_DSYNC flag had been used instead.
+	This can result in better performance without compromising
+	data safety.
+	However if this option is not in effect, timestamp updates from
+	O_SYNC writes can be lost if the system crashes.
+	If timestamp updates are critical, use the osyncisosync option.
+
+  uquota/usrquota/uqnoenforce/quota
+	User disk quota accounting enabled, and limits (optionally)
+	enforced.  Refer to xfs_quota(8) for further details.
+
+  gquota/grpquota/gqnoenforce
+	Group disk quota accounting enabled and limits (optionally)
+	enforced.  Refer to xfs_quota(8) for further details.
+
+  pquota/prjquota/pqnoenforce
+	Project disk quota accounting enabled and limits (optionally)
+	enforced.  Refer to xfs_quota(8) for further details.
+
+  sunit=value and swidth=value
+	Used to specify the stripe unit and width for a RAID device or
+	a stripe volume.  "value" must be specified in 512-byte block
+	units.
+	If this option is not specified and the filesystem was made on
+	a stripe volume or the stripe width or unit were specified for
+	the RAID device at mkfs time, then the mount system call will
+	restore the value from the superblock.  For filesystems that
+	are made directly on RAID devices, these options can be used
+	to override the information in the superblock if the underlying
+	disk layout changes after the filesystem has been created.
+	The "swidth" option is required if the "sunit" option has been
+	specified, and must be a multiple of the "sunit" value.
+
+  swalloc
+	Data allocations will be rounded up to stripe width boundaries
+	when the current end of file is being extended and the file
+	size is larger than the stripe width size.
+
+
+sysctls
+=======
+
+The following sysctls are available for the XFS filesystem:
+
+  fs.xfs.stats_clear		(Min: 0  Default: 0  Max: 1)
+	Setting this to "1" clears accumulated XFS statistics
+	in /proc/fs/xfs/stat.  It then immediately resets to "0".
+
+  fs.xfs.xfssyncd_centisecs	(Min: 100  Default: 3000  Max: 720000)
+  	The interval at which the xfssyncd thread flushes metadata
+  	out to disk.  This thread will flush log activity out, and
+  	do some processing on unlinked inodes.
+
+  fs.xfs.xfsbufd_centisecs	(Min: 50  Default: 100	Max: 3000)
+	The interval at which xfsbufd scans the dirty metadata buffers list.
+
+  fs.xfs.age_buffer_centisecs	(Min: 100  Default: 1500  Max: 720000)
+	The age at which xfsbufd flushes dirty metadata buffers to disk.
+
+  fs.xfs.error_level		(Min: 0  Default: 3  Max: 11)
+	A volume knob for error reporting when internal errors occur.
+	This will generate detailed messages & backtraces for filesystem
+	shutdowns, for example.  Current threshold values are:
+
+		XFS_ERRLEVEL_OFF:       0
+		XFS_ERRLEVEL_LOW:       1
+		XFS_ERRLEVEL_HIGH:      5
+
+  fs.xfs.panic_mask		(Min: 0  Default: 0  Max: 127)
+	Causes certain error conditions to call BUG(). Value is a bitmask;
+	AND together the tags which represent errors which should cause panics:
+
+		XFS_NO_PTAG                     0
+		XFS_PTAG_IFLUSH                 0x00000001
+		XFS_PTAG_LOGRES                 0x00000002
+		XFS_PTAG_AILDELETE              0x00000004
+		XFS_PTAG_ERROR_REPORT           0x00000008
+		XFS_PTAG_SHUTDOWN_CORRUPT       0x00000010
+		XFS_PTAG_SHUTDOWN_IOERROR       0x00000020
+		XFS_PTAG_SHUTDOWN_LOGERROR      0x00000040
+
+	This option is intended for debugging only.
+
+  fs.xfs.irix_symlink_mode	(Min: 0  Default: 0  Max: 1)
+	Controls whether symlinks are created with mode 0777 (default)
+	or whether their mode is affected by the umask (irix mode).
+
+  fs.xfs.irix_sgid_inherit	(Min: 0  Default: 0  Max: 1)
+	Controls files created in SGID directories.
+	If the group ID of the new file does not match the effective group
+	ID or one of the supplementary group IDs of the parent dir, the
+	ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
+	is set.
+
+  fs.xfs.restrict_chown		(Min: 0  Default: 1  Max: 1)
+  	Controls whether unprivileged users can use chown to "give away"
+	a file to another user.
+
+  fs.xfs.inherit_sync		(Min: 0  Default: 1  Max: 1)
+	Setting this to "1" will cause the "sync" flag set
+	by the xfs_io(8) chattr command on a directory to be
+	inherited by files in that directory.
+
+  fs.xfs.inherit_nodump		(Min: 0  Default: 1  Max: 1)
+	Setting this to "1" will cause the "nodump" flag set
+	by the xfs_io(8) chattr command on a directory to be
+	inherited by files in that directory.
+
+  fs.xfs.inherit_noatime	(Min: 0  Default: 1  Max: 1)
+	Setting this to "1" will cause the "noatime" flag set
+	by the xfs_io(8) chattr command on a directory to be
+	inherited by files in that directory.
+
+  fs.xfs.inherit_nosymlinks	(Min: 0  Default: 1  Max: 1)
+	Setting this to "1" will cause the "nosymlinks" flag set
+	by the xfs_io(8) chattr command on a directory to be
+	inherited by files in that directory.
+
+  fs.xfs.rotorstep		(Min: 1  Default: 1  Max: 256)
+	In "inode32" allocation mode, this option determines how many
+	files the allocator attempts to allocate in the same allocation
+	group before moving to the next allocation group.  The intent
+	is to control the rate at which the allocator moves between
+	allocation groups when allocating extents for new files.
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -0,0 +1,67 @@
+Execute-in-place for file mappings
+----------------------------------
+
+Motivation
+----------
+File mappings are performed by mapping page cache pages to userspace. In
+addition, read&write type file operations also transfer data from/to the page
+cache.
+
+For memory backed storage devices that use the block device interface, the page
+cache pages are in fact copies of the original storage. Various approaches
+exist to work around the need for an extra copy. The ramdisk driver for example
+does read the data into the page cache, keeps a reference, and discards the
+original data behind later on.
+
+Execute-in-place solves this issue the other way around: instead of keeping
+data in the page cache, the need to have a page cache copy is eliminated
+completely. With execute-in-place, read&write type operations are performed
+directly from/to the memory backed storage device. For file mappings, the
+storage device itself is mapped directly into userspace.
+
+This implementation was initialy written for shared memory segments between
+different virtual machines on s390 hardware to allow multiple machines to
+share the same binaries and libraries.
+
+Implementation
+--------------
+Execute-in-place is implemented in three steps: block device operation,
+address space operation, and file operations.
+
+A block device operation named direct_access is used to retrieve a
+reference (pointer) to a block on-disk. The reference is supposed to be
+cpu-addressable, physical address and remain valid until the release operation
+is performed. A struct block_device reference is used to address the device,
+and a sector_t argument is used to identify the individual block. As an
+alternative, memory technology devices can be used for this.
+
+The block device operation is optional, these block devices support it as of
+today:
+- dcssblk: s390 dcss block device driver
+
+An address space operation named get_xip_page is used to retrieve reference
+to a struct page. To address the target page, a reference to an address_space,
+and a sector number is provided. A 3rd argument indicates whether the
+function should allocate blocks if needed.
+
+This address space operation is mutually exclusive with readpage&writepage that
+do page cache read/write operations.
+The following filesystems support it as of today:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+
+A set of file operations that do utilize get_xip_page can be found in
+mm/filemap_xip.c . The following file operation implementations are provided:
+- aio_read/aio_write
+- readv/writev
+- sendfile
+
+The generic file operations do_sync_read/do_sync_write can be used to implement
+classic synchronous IO calls.
+
+Shortcomings
+------------
+This implementation is limited to storage devices that are cpu addressable at
+all times (no highmem or such). It works well on rom/ram, but enhancements are
+needed to make it work with flash in read+write mode.
+Putting the Linux kernel and/or its modules on a xip filesystem does not mean
+they are not copied.