Creation of Cybook 2416 (actually Gen4) repository
This commit is contained in:
93
Documentation/vm/balance
Normal file
93
Documentation/vm/balance
Normal file
@@ -0,0 +1,93 @@
|
||||
Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
|
||||
|
||||
Memory balancing is needed for non __GFP_WAIT as well as for non
|
||||
__GFP_IO allocations.
|
||||
|
||||
There are two reasons to be requesting non __GFP_WAIT allocations:
|
||||
the caller can not sleep (typically intr context), or does not want
|
||||
to incur cost overheads of page stealing and possible swap io for
|
||||
whatever reasons.
|
||||
|
||||
__GFP_IO allocation requests are made to prevent file system deadlocks.
|
||||
|
||||
In the absence of non sleepable allocation requests, it seems detrimental
|
||||
to be doing balancing. Page reclamation can be kicked off lazily, that
|
||||
is, only when needed (aka zone free memory is 0), instead of making it
|
||||
a proactive process.
|
||||
|
||||
That being said, the kernel should try to fulfill requests for direct
|
||||
mapped pages from the direct mapped pool, instead of falling back on
|
||||
the dma pool, so as to keep the dma pool filled for dma requests (atomic
|
||||
or not). A similar argument applies to highmem and direct mapped pages.
|
||||
OTOH, if there is a lot of free dma pages, it is preferable to satisfy
|
||||
regular memory requests by allocating one from the dma pool, instead
|
||||
of incurring the overhead of regular zone balancing.
|
||||
|
||||
In 2.2, memory balancing/page reclamation would kick off only when the
|
||||
_total_ number of free pages fell below 1/64 th of total memory. With the
|
||||
right ratio of dma and regular memory, it is quite possible that balancing
|
||||
would not be done even when the dma zone was completely empty. 2.2 has
|
||||
been running production machines of varying memory sizes, and seems to be
|
||||
doing fine even with the presence of this problem. In 2.3, due to
|
||||
HIGHMEM, this problem is aggravated.
|
||||
|
||||
In 2.3, zone balancing can be done in one of two ways: depending on the
|
||||
zone size (and possibly of the size of lower class zones), we can decide
|
||||
at init time how many free pages we should aim for while balancing any
|
||||
zone. The good part is, while balancing, we do not need to look at sizes
|
||||
of lower class zones, the bad part is, we might do too frequent balancing
|
||||
due to ignoring possibly lower usage in the lower class zones. Also,
|
||||
with a slight change in the allocation routine, it is possible to reduce
|
||||
the memclass() macro to be a simple equality.
|
||||
|
||||
Another possible solution is that we balance only when the free memory
|
||||
of a zone _and_ all its lower class zones falls below 1/64th of the
|
||||
total memory in the zone and its lower class zones. This fixes the 2.2
|
||||
balancing problem, and stays as close to 2.2 behavior as possible. Also,
|
||||
the balancing algorithm works the same way on the various architectures,
|
||||
which have different numbers and types of zones. If we wanted to get
|
||||
fancy, we could assign different weights to free pages in different
|
||||
zones in the future.
|
||||
|
||||
Note that if the size of the regular zone is huge compared to dma zone,
|
||||
it becomes less significant to consider the free dma pages while
|
||||
deciding whether to balance the regular zone. The first solution
|
||||
becomes more attractive then.
|
||||
|
||||
The appended patch implements the second solution. It also "fixes" two
|
||||
problems: first, kswapd is woken up as in 2.2 on low memory conditions
|
||||
for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
|
||||
so as to give a fighting chance for replace_with_highmem() to get a
|
||||
HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
|
||||
fall back into regular zone. This also makes sure that HIGHMEM pages
|
||||
are not leaked (for example, in situations where a HIGHMEM page is in
|
||||
the swapcache but is not being used by anyone)
|
||||
|
||||
kswapd also needs to know about the zones it should balance. kswapd is
|
||||
primarily needed in a situation where balancing can not be done,
|
||||
probably because all allocation requests are coming from intr context
|
||||
and all process contexts are sleeping. For 2.3, kswapd does not really
|
||||
need to balance the highmem zone, since intr context does not request
|
||||
highmem pages. kswapd looks at the zone_wake_kswapd field in the zone
|
||||
structure to decide whether a zone needs balancing.
|
||||
|
||||
Page stealing from process memory and shm is done if stealing the page would
|
||||
alleviate memory pressure on any zone in the page's node that has fallen below
|
||||
its watermark.
|
||||
|
||||
pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are
|
||||
per-zone fields, used to determine when a zone needs to be balanced. When
|
||||
the number of pages falls below pages_min, the hysteric field low_on_memory
|
||||
gets set. This stays set till the number of free pages becomes pages_high.
|
||||
When low_on_memory is set, page allocation requests will try to free some
|
||||
pages in the zone (providing GFP_WAIT is set in the request). Orthogonal
|
||||
to this, is the decision to poke kswapd to free some zone pages. That
|
||||
decision is not hysteresis based, and is done when the number of free
|
||||
pages is below pages_low; in which case zone_wake_kswapd is also set.
|
||||
|
||||
|
||||
(Good) Ideas that I have heard:
|
||||
1. Dynamic experience should influence balancing: number of failed requests
|
||||
for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
|
||||
2. Implement a replace_with_highmem()-like replace_with_regular() to preserve
|
||||
dma pages. (lkd@tantalophile.demon.co.uk)
|
||||
294
Documentation/vm/hugetlbpage.txt
Normal file
294
Documentation/vm/hugetlbpage.txt
Normal file
@@ -0,0 +1,294 @@
|
||||
|
||||
The intent of this file is to give a brief summary of hugetlbpage support in
|
||||
the Linux kernel. This support is built on top of multiple page size support
|
||||
that is provided by most modern architectures. For example, i386
|
||||
architecture supports 4K and 4M (2M in PAE mode) page sizes, ia64
|
||||
architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
|
||||
256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical
|
||||
translations. Typically this is a very scarce resource on processor.
|
||||
Operating systems try to make best use of limited number of TLB resources.
|
||||
This optimization is more critical now as bigger and bigger physical memories
|
||||
(several GBs) are more readily available.
|
||||
|
||||
Users can use the huge page support in Linux kernel by either using the mmap
|
||||
system call or standard SYSv shared memory system calls (shmget, shmat).
|
||||
|
||||
First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
|
||||
(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
|
||||
automatically when CONFIG_HUGETLBFS is selected) configuration
|
||||
options.
|
||||
|
||||
The kernel built with hugepage support should show the number of configured
|
||||
hugepages in the system by running the "cat /proc/meminfo" command.
|
||||
|
||||
/proc/meminfo also provides information about the total number of hugetlb
|
||||
pages configured in the kernel. It also displays information about the
|
||||
number of free hugetlb pages at any time. It also displays information about
|
||||
the configured hugepage size - this is needed for generating the proper
|
||||
alignment and size of the arguments to the above system calls.
|
||||
|
||||
The output of "cat /proc/meminfo" will have lines like:
|
||||
|
||||
.....
|
||||
HugePages_Total: xxx
|
||||
HugePages_Free: yyy
|
||||
HugePages_Rsvd: www
|
||||
Hugepagesize: zzz kB
|
||||
|
||||
where:
|
||||
HugePages_Total is the size of the pool of hugepages.
|
||||
HugePages_Free is the number of hugepages in the pool that are not yet
|
||||
allocated.
|
||||
HugePages_Rsvd is short for "reserved," and is the number of hugepages
|
||||
for which a commitment to allocate from the pool has been made, but no
|
||||
allocation has yet been made. It's vaguely analogous to overcommit.
|
||||
|
||||
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured
|
||||
in the kernel.
|
||||
|
||||
/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
|
||||
pages in the kernel. Super user can dynamically request more (or free some
|
||||
pre-configured) hugepages.
|
||||
The allocation (or deallocation) of hugetlb pages is possible only if there are
|
||||
enough physically contiguous free pages in system (freeing of hugepages is
|
||||
possible only if there are enough hugetlb pages free that can be transferred
|
||||
back to regular memory pool).
|
||||
|
||||
Pages that are used as hugetlb pages are reserved inside the kernel and cannot
|
||||
be used for other purposes.
|
||||
|
||||
Once the kernel with Hugetlb page support is built and running, a user can
|
||||
use either the mmap system call or shared memory system calls to start using
|
||||
the huge pages. It is required that the system administrator preallocate
|
||||
enough memory for huge page purposes.
|
||||
|
||||
Use the following command to dynamically allocate/deallocate hugepages:
|
||||
|
||||
echo 20 > /proc/sys/vm/nr_hugepages
|
||||
|
||||
This command will try to configure 20 hugepages in the system. The success
|
||||
or failure of allocation depends on the amount of physically contiguous
|
||||
memory that is preset in system at this time. System administrators may want
|
||||
to put this command in one of the local rc init files. This will enable the
|
||||
kernel to request huge pages early in the boot process (when the possibility
|
||||
of getting physical contiguous pages is still very high).
|
||||
|
||||
If the user applications are going to request hugepages using mmap system
|
||||
call, then it is required that system administrator mount a file system of
|
||||
type hugetlbfs:
|
||||
|
||||
mount none /mnt/huge -t hugetlbfs <uid=value> <gid=value> <mode=value>
|
||||
<size=value> <nr_inodes=value>
|
||||
|
||||
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
|
||||
/mnt/huge. Any files created on /mnt/huge uses hugepages. The uid and gid
|
||||
options sets the owner and group of the root of the file system. By default
|
||||
the uid and gid of the current process are taken. The mode option sets the
|
||||
mode of root of file system to value & 0777. This value is given in octal.
|
||||
By default the value 0755 is picked. The size option sets the maximum value of
|
||||
memory (huge pages) allowed for that filesystem (/mnt/huge). The size is
|
||||
rounded down to HPAGE_SIZE. The option nr_inodes sets the maximum number of
|
||||
inodes that /mnt/huge can use. If the size or nr_inodes options are not
|
||||
provided on command line then no limits are set. For size and nr_inodes
|
||||
options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For
|
||||
example, size=2K has the same meaning as size=2048. An example is given at
|
||||
the end of this document.
|
||||
|
||||
read and write system calls are not supported on files that reside on hugetlb
|
||||
file systems.
|
||||
|
||||
Regular chown, chgrp, and chmod commands (with right permissions) could be
|
||||
used to change the file attributes on hugetlbfs.
|
||||
|
||||
Also, it is important to note that no such mount command is required if the
|
||||
applications are going to use only shmat/shmget system calls. Users who
|
||||
wish to use hugetlb page via shared memory segment should be a member of
|
||||
a supplementary group and system admin needs to configure that gid into
|
||||
/proc/sys/vm/hugetlb_shm_group. It is possible for same or different
|
||||
applications to use any combination of mmaps and shm* calls, though the
|
||||
mount of filesystem will be required for using mmap calls.
|
||||
|
||||
*******************************************************************
|
||||
|
||||
/*
|
||||
* Example of using hugepage memory in a user application using Sys V shared
|
||||
* memory system calls. In this example the app is requesting 256MB of
|
||||
* memory that is backed by huge pages. The application uses the flag
|
||||
* SHM_HUGETLB in the shmget system call to inform the kernel that it is
|
||||
* requesting hugepages.
|
||||
*
|
||||
* For the ia64 architecture, the Linux kernel reserves Region number 4 for
|
||||
* hugepages. That means the addresses starting with 0x800000... will need
|
||||
* to be specified. Specifying a fixed address is not required on ppc64,
|
||||
* i386 or x86_64.
|
||||
*
|
||||
* Note: The default shared memory limit is quite low on many kernels,
|
||||
* you may need to increase it via:
|
||||
*
|
||||
* echo 268435456 > /proc/sys/kernel/shmmax
|
||||
*
|
||||
* This will increase the maximum size per shared memory segment to 256MB.
|
||||
* The other limit that you will hit eventually is shmall which is the
|
||||
* total amount of shared memory in pages. To set it to 16GB on a system
|
||||
* with a 4kB pagesize do:
|
||||
*
|
||||
* echo 4194304 > /proc/sys/kernel/shmall
|
||||
*/
|
||||
#include <stdlib.h>
|
||||
#include <stdio.h>
|
||||
#include <sys/types.h>
|
||||
#include <sys/ipc.h>
|
||||
#include <sys/shm.h>
|
||||
#include <sys/mman.h>
|
||||
|
||||
#ifndef SHM_HUGETLB
|
||||
#define SHM_HUGETLB 04000
|
||||
#endif
|
||||
|
||||
#define LENGTH (256UL*1024*1024)
|
||||
|
||||
#define dprintf(x) printf(x)
|
||||
|
||||
/* Only ia64 requires this */
|
||||
#ifdef __ia64__
|
||||
#define ADDR (void *)(0x8000000000000000UL)
|
||||
#define SHMAT_FLAGS (SHM_RND)
|
||||
#else
|
||||
#define ADDR (void *)(0x0UL)
|
||||
#define SHMAT_FLAGS (0)
|
||||
#endif
|
||||
|
||||
int main(void)
|
||||
{
|
||||
int shmid;
|
||||
unsigned long i;
|
||||
char *shmaddr;
|
||||
|
||||
if ((shmid = shmget(2, LENGTH,
|
||||
SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0) {
|
||||
perror("shmget");
|
||||
exit(1);
|
||||
}
|
||||
printf("shmid: 0x%x\n", shmid);
|
||||
|
||||
shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS);
|
||||
if (shmaddr == (char *)-1) {
|
||||
perror("Shared memory attach failure");
|
||||
shmctl(shmid, IPC_RMID, NULL);
|
||||
exit(2);
|
||||
}
|
||||
printf("shmaddr: %p\n", shmaddr);
|
||||
|
||||
dprintf("Starting the writes:\n");
|
||||
for (i = 0; i < LENGTH; i++) {
|
||||
shmaddr[i] = (char)(i);
|
||||
if (!(i % (1024 * 1024)))
|
||||
dprintf(".");
|
||||
}
|
||||
dprintf("\n");
|
||||
|
||||
dprintf("Starting the Check...");
|
||||
for (i = 0; i < LENGTH; i++)
|
||||
if (shmaddr[i] != (char)i)
|
||||
printf("\nIndex %lu mismatched\n", i);
|
||||
dprintf("Done.\n");
|
||||
|
||||
if (shmdt((const void *)shmaddr) != 0) {
|
||||
perror("Detach failure");
|
||||
shmctl(shmid, IPC_RMID, NULL);
|
||||
exit(3);
|
||||
}
|
||||
|
||||
shmctl(shmid, IPC_RMID, NULL);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
*******************************************************************
|
||||
|
||||
/*
|
||||
* Example of using hugepage memory in a user application using the mmap
|
||||
* system call. Before running this application, make sure that the
|
||||
* administrator has mounted the hugetlbfs filesystem (on some directory
|
||||
* like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this
|
||||
* example, the app is requesting memory of size 256MB that is backed by
|
||||
* huge pages.
|
||||
*
|
||||
* For ia64 architecture, Linux kernel reserves Region number 4 for hugepages.
|
||||
* That means the addresses starting with 0x800000... will need to be
|
||||
* specified. Specifying a fixed address is not required on ppc64, i386
|
||||
* or x86_64.
|
||||
*/
|
||||
#include <stdlib.h>
|
||||
#include <stdio.h>
|
||||
#include <unistd.h>
|
||||
#include <sys/mman.h>
|
||||
#include <fcntl.h>
|
||||
|
||||
#define FILE_NAME "/mnt/hugepagefile"
|
||||
#define LENGTH (256UL*1024*1024)
|
||||
#define PROTECTION (PROT_READ | PROT_WRITE)
|
||||
|
||||
/* Only ia64 requires this */
|
||||
#ifdef __ia64__
|
||||
#define ADDR (void *)(0x8000000000000000UL)
|
||||
#define FLAGS (MAP_SHARED | MAP_FIXED)
|
||||
#else
|
||||
#define ADDR (void *)(0x0UL)
|
||||
#define FLAGS (MAP_SHARED)
|
||||
#endif
|
||||
|
||||
void check_bytes(char *addr)
|
||||
{
|
||||
printf("First hex is %x\n", *((unsigned int *)addr));
|
||||
}
|
||||
|
||||
void write_bytes(char *addr)
|
||||
{
|
||||
unsigned long i;
|
||||
|
||||
for (i = 0; i < LENGTH; i++)
|
||||
*(addr + i) = (char)i;
|
||||
}
|
||||
|
||||
void read_bytes(char *addr)
|
||||
{
|
||||
unsigned long i;
|
||||
|
||||
check_bytes(addr);
|
||||
for (i = 0; i < LENGTH; i++)
|
||||
if (*(addr + i) != (char)i) {
|
||||
printf("Mismatch at %lu\n", i);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
int main(void)
|
||||
{
|
||||
void *addr;
|
||||
int fd;
|
||||
|
||||
fd = open(FILE_NAME, O_CREAT | O_RDWR, 0755);
|
||||
if (fd < 0) {
|
||||
perror("Open failed");
|
||||
exit(1);
|
||||
}
|
||||
|
||||
addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, 0);
|
||||
if (addr == MAP_FAILED) {
|
||||
perror("mmap");
|
||||
unlink(FILE_NAME);
|
||||
exit(1);
|
||||
}
|
||||
|
||||
printf("Returned address is %p\n", addr);
|
||||
check_bytes(addr);
|
||||
write_bytes(addr);
|
||||
read_bytes(addr);
|
||||
|
||||
munmap(addr, LENGTH);
|
||||
close(fd);
|
||||
unlink(FILE_NAME);
|
||||
|
||||
return 0;
|
||||
}
|
||||
130
Documentation/vm/locking
Normal file
130
Documentation/vm/locking
Normal file
@@ -0,0 +1,130 @@
|
||||
Started Oct 1999 by Kanoj Sarcar <kanojsarcar@yahoo.com>
|
||||
|
||||
The intent of this file is to have an uptodate, running commentary
|
||||
from different people about how locking and synchronization is done
|
||||
in the Linux vm code.
|
||||
|
||||
page_table_lock & mmap_sem
|
||||
--------------------------------------
|
||||
|
||||
Page stealers pick processes out of the process pool and scan for
|
||||
the best process to steal pages from. To guarantee the existence
|
||||
of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
|
||||
Page stealers hold kernel_lock to protect against a bunch of races.
|
||||
The vma list of the victim mm is also scanned by the stealer,
|
||||
and the page_table_lock is used to preserve list sanity against the
|
||||
process adding/deleting to the list. This also guarantees existence
|
||||
of the vma. Vma existence is not guaranteed once try_to_swap_out()
|
||||
drops the page_table_lock. To guarantee the existence of the underlying
|
||||
file structure, a get_file is done before the swapout() method is
|
||||
invoked. The page passed into swapout() is guaranteed not to be reused
|
||||
for a different purpose because the page reference count due to being
|
||||
present in the user's pte is not released till after swapout() returns.
|
||||
|
||||
Any code that modifies the vmlist, or the vm_start/vm_end/
|
||||
vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent
|
||||
kswapd from looking at the chain.
|
||||
|
||||
The rules are:
|
||||
1. To scan the vmlist (look but don't touch) you must hold the
|
||||
mmap_sem with read bias, i.e. down_read(&mm->mmap_sem)
|
||||
2. To modify the vmlist you need to hold the mmap_sem with
|
||||
read&write bias, i.e. down_write(&mm->mmap_sem) *AND*
|
||||
you need to take the page_table_lock.
|
||||
3. The swapper takes _just_ the page_table_lock, this is done
|
||||
because the mmap_sem can be an extremely long lived lock
|
||||
and the swapper just cannot sleep on that.
|
||||
4. The exception to this rule is expand_stack, which just
|
||||
takes the read lock and the page_table_lock, this is ok
|
||||
because it doesn't really modify fields anybody relies on.
|
||||
5. You must be able to guarantee that while holding page_table_lock
|
||||
or page_table_lock of mm A, you will not try to get either lock
|
||||
for mm B.
|
||||
|
||||
The caveats are:
|
||||
1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
|
||||
The update of mmap_cache is racy (page stealer can race with other code
|
||||
that invokes find_vma with mmap_sem held), but that is okay, since it
|
||||
is a hint. This can be fixed, if desired, by having find_vma grab the
|
||||
page_table_lock.
|
||||
|
||||
|
||||
Code that add/delete elements from the vmlist chain are
|
||||
1. callers of insert_vm_struct
|
||||
2. callers of merge_segments
|
||||
3. callers of avl_remove
|
||||
|
||||
Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
|
||||
the list:
|
||||
1. expand_stack
|
||||
2. mprotect
|
||||
3. mlock
|
||||
4. mremap
|
||||
|
||||
It is advisable that changes to vm_start/vm_end be protected, although
|
||||
in some cases it is not really needed. Eg, vm_start is modified by
|
||||
expand_stack(), it is hard to come up with a destructive scenario without
|
||||
having the vmlist protection in this case.
|
||||
|
||||
The page_table_lock nests with the inode i_mmap_lock and the kmem cache
|
||||
c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
|
||||
dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
|
||||
pagemap_lru_lock spinlocks, and no code asks for memory with these locks
|
||||
held.
|
||||
|
||||
The page_table_lock is grabbed while holding the kernel_lock spinning monitor.
|
||||
|
||||
The page_table_lock is a spin lock.
|
||||
|
||||
Note: PTL can also be used to guarantee that no new clones using the
|
||||
mm start up ... this is a loose form of stability on mm_users. For
|
||||
example, it is used in copy_mm to protect against a racing tlb_gather_mmu
|
||||
single address space optimization, so that the zap_page_range (from
|
||||
vmtruncate) does not lose sending ipi's to cloned threads that might
|
||||
be spawned underneath it and go to user mode to drag in pte's into tlbs.
|
||||
|
||||
swap_lock
|
||||
--------------
|
||||
The swap devices are chained in priority order from the "swap_list" header.
|
||||
The "swap_list" is used for the round-robin swaphandle allocation strategy.
|
||||
The #free swaphandles is maintained in "nr_swap_pages". These two together
|
||||
are protected by the swap_lock.
|
||||
|
||||
The swap_lock also protects all the device reference counts on the
|
||||
corresponding swaphandles, maintained in the "swap_map" array, and the
|
||||
"highest_bit" and "lowest_bit" fields.
|
||||
|
||||
The swap_lock is a spinlock, and is never acquired from intr level.
|
||||
|
||||
To prevent races between swap space deletion or async readahead swapins
|
||||
deciding whether a swap handle is being used, ie worthy of being read in
|
||||
from disk, and an unmap -> swap_free making the handle unused, the swap
|
||||
delete and readahead code grabs a temp reference on the swaphandle to
|
||||
prevent warning messages from swap_duplicate <- read_swap_cache_async.
|
||||
|
||||
Swap cache locking
|
||||
------------------
|
||||
Pages are added into the swap cache with kernel_lock held, to make sure
|
||||
that multiple pages are not being added (and hence lost) by associating
|
||||
all of them with the same swaphandle.
|
||||
|
||||
Pages are guaranteed not to be removed from the scache if the page is
|
||||
"shared": ie, other processes hold reference on the page or the associated
|
||||
swap handle. The only code that does not follow this rule is shrink_mmap,
|
||||
which deletes pages from the swap cache if no process has a reference on
|
||||
the page (multiple processes might have references on the corresponding
|
||||
swap handle though). lookup_swap_cache() races with shrink_mmap, when
|
||||
establishing a reference on a scache page, so, it must check whether the
|
||||
page it located is still in the swapcache, or shrink_mmap deleted it.
|
||||
(This race is due to the fact that shrink_mmap looks at the page ref
|
||||
count with pagecache_lock, but then drops pagecache_lock before deleting
|
||||
the page from the scache).
|
||||
|
||||
do_wp_page and do_swap_page have MP races in them while trying to figure
|
||||
out whether a page is "shared", by looking at the page_count + swap_count.
|
||||
To preserve the sum of the counts, the page lock _must_ be acquired before
|
||||
calling is_page_shared (else processes might switch their swap_count refs
|
||||
to the page count refs, after the page count ref has been snapshotted).
|
||||
|
||||
Swap device deletion code currently breaks all the scache assumptions,
|
||||
since it grabs neither mmap_sem nor page_table_lock.
|
||||
41
Documentation/vm/numa
Normal file
41
Documentation/vm/numa
Normal file
@@ -0,0 +1,41 @@
|
||||
Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
|
||||
|
||||
The intent of this file is to have an uptodate, running commentary
|
||||
from different people about NUMA specific code in the Linux vm.
|
||||
|
||||
What is NUMA? It is an architecture where the memory access times
|
||||
for different regions of memory from a given processor varies
|
||||
according to the "distance" of the memory region from the processor.
|
||||
Each region of memory to which access times are the same from any
|
||||
cpu, is called a node. On such architectures, it is beneficial if
|
||||
the kernel tries to minimize inter node communications. Schemes
|
||||
for this range from kernel text and read-only data replication
|
||||
across nodes, and trying to house all the data structures that
|
||||
key components of the kernel need on memory on that node.
|
||||
|
||||
Currently, all the numa support is to provide efficient handling
|
||||
of widely discontiguous physical memory, so architectures which
|
||||
are not NUMA but can have huge holes in the physical address space
|
||||
can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM.
|
||||
|
||||
The initial port includes NUMAizing the bootmem allocator code by
|
||||
encapsulating all the pieces of information into a bootmem_data_t
|
||||
structure. Node specific calls have been added to the allocator.
|
||||
In theory, any platform which uses the bootmem allocator should
|
||||
be able to put the bootmem and mem_map data structures anywhere
|
||||
it deems best.
|
||||
|
||||
Each node's page allocation data structures have also been encapsulated
|
||||
into a pg_data_t. The bootmem_data_t is just one part of this. To
|
||||
make the code look uniform between NUMA and regular UMA platforms,
|
||||
UMA platforms have a statically allocated pg_data_t too (contig_page_data).
|
||||
For the sake of uniformity, the function num_online_nodes() is also defined
|
||||
for all platforms. As we run benchmarks, we might decide to NUMAize
|
||||
more variables like low_on_memory, nr_free_pages etc into the pg_data_t.
|
||||
|
||||
The NUMA aware page allocation code currently tries to allocate pages
|
||||
from different nodes in a round robin manner. This will be changed to
|
||||
do concentratic circle search, starting from current node, once the
|
||||
NUMA port achieves more maturity. The call alloc_pages_node has been
|
||||
added, so that drivers can make the call and not worry about whether
|
||||
it is running on a NUMA or UMA platform.
|
||||
73
Documentation/vm/overcommit-accounting
Normal file
73
Documentation/vm/overcommit-accounting
Normal file
@@ -0,0 +1,73 @@
|
||||
The Linux kernel supports the following overcommit handling modes
|
||||
|
||||
0 - Heuristic overcommit handling. Obvious overcommits of
|
||||
address space are refused. Used for a typical system. It
|
||||
ensures a seriously wild allocation fails while allowing
|
||||
overcommit to reduce swap usage. root is allowed to
|
||||
allocate slighly more memory in this mode. This is the
|
||||
default.
|
||||
|
||||
1 - Always overcommit. Appropriate for some scientific
|
||||
applications.
|
||||
|
||||
2 - Don't overcommit. The total address space commit
|
||||
for the system is not permitted to exceed swap + a
|
||||
configurable percentage (default is 50) of physical RAM.
|
||||
Depending on the percentage you use, in most situations
|
||||
this means a process will not be killed while accessing
|
||||
pages but will receive errors on memory allocation as
|
||||
appropriate.
|
||||
|
||||
The overcommit policy is set via the sysctl `vm.overcommit_memory'.
|
||||
|
||||
The overcommit percentage is set via `vm.overcommit_ratio'.
|
||||
|
||||
The current overcommit limit and amount committed are viewable in
|
||||
/proc/meminfo as CommitLimit and Committed_AS respectively.
|
||||
|
||||
Gotchas
|
||||
-------
|
||||
|
||||
The C language stack growth does an implicit mremap. If you want absolute
|
||||
guarantees and run close to the edge you MUST mmap your stack for the
|
||||
largest size you think you will need. For typical stack usage this does
|
||||
not matter much but it's a corner case if you really really care
|
||||
|
||||
In mode 2 the MAP_NORESERVE flag is ignored.
|
||||
|
||||
|
||||
How It Works
|
||||
------------
|
||||
|
||||
The overcommit is based on the following rules
|
||||
|
||||
For a file backed map
|
||||
SHARED or READ-only - 0 cost (the file is the map not swap)
|
||||
PRIVATE WRITABLE - size of mapping per instance
|
||||
|
||||
For an anonymous or /dev/zero map
|
||||
SHARED - size of mapping
|
||||
PRIVATE READ-only - 0 cost (but of little use)
|
||||
PRIVATE WRITABLE - size of mapping per instance
|
||||
|
||||
Additional accounting
|
||||
Pages made writable copies by mmap
|
||||
shmfs memory drawn from the same pool
|
||||
|
||||
Status
|
||||
------
|
||||
|
||||
o We account mmap memory mappings
|
||||
o We account mprotect changes in commit
|
||||
o We account mremap changes in size
|
||||
o We account brk
|
||||
o We account munmap
|
||||
o We report the commit status in /proc
|
||||
o Account and check on fork
|
||||
o Review stack handling/building on exec
|
||||
o SHMfs accounting
|
||||
o Implement actual limit enforcement
|
||||
|
||||
To Do
|
||||
-----
|
||||
o Account ptrace pages (this is hard)
|
||||
147
Documentation/vm/page_migration
Normal file
147
Documentation/vm/page_migration
Normal file
@@ -0,0 +1,147 @@
|
||||
Page migration
|
||||
--------------
|
||||
|
||||
Page migration allows the moving of the physical location of pages between
|
||||
nodes in a numa system while the process is running. This means that the
|
||||
virtual addresses that the process sees do not change. However, the
|
||||
system rearranges the physical location of those pages.
|
||||
|
||||
The main intend of page migration is to reduce the latency of memory access
|
||||
by moving pages near to the processor where the process accessing that memory
|
||||
is running.
|
||||
|
||||
Page migration allows a process to manually relocate the node on which its
|
||||
pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
|
||||
a new memory policy via mbind(). The pages of process can also be relocated
|
||||
from another process using the sys_migrate_pages() function call. The
|
||||
migrate_pages function call takes two sets of nodes and moves pages of a
|
||||
process that are located on the from nodes to the destination nodes.
|
||||
Page migration functions are provided by the numactl package by Andi Kleen
|
||||
(a version later than 0.9.3 is required. Get it from
|
||||
ftp://ftp.suse.com/pub/people/ak). numactl provided libnuma which
|
||||
provides an interface similar to other numa functionality for page migration.
|
||||
cat /proc/<pid>/numa_maps allows an easy review of where the pages of
|
||||
a process are located. See also the numa_maps manpage in the numactl package.
|
||||
|
||||
Manual migration is useful if for example the scheduler has relocated
|
||||
a process to a processor on a distant node. A batch scheduler or an
|
||||
administrator may detect the situation and move the pages of the process
|
||||
nearer to the new processor. The kernel itself does only provide
|
||||
manual page migration support. Automatic page migration may be implemented
|
||||
through user space processes that move pages. A special function call
|
||||
"move_pages" allows the moving of individual pages within a process.
|
||||
A NUMA profiler may f.e. obtain a log showing frequent off node
|
||||
accesses and may use the result to move pages to more advantageous
|
||||
locations.
|
||||
|
||||
Larger installations usually partition the system using cpusets into
|
||||
sections of nodes. Paul Jackson has equipped cpusets with the ability to
|
||||
move pages when a task is moved to another cpuset (See ../cpusets.txt).
|
||||
Cpusets allows the automation of process locality. If a task is moved to
|
||||
a new cpuset then also all its pages are moved with it so that the
|
||||
performance of the process does not sink dramatically. Also the pages
|
||||
of processes in a cpuset are moved if the allowed memory nodes of a
|
||||
cpuset are changed.
|
||||
|
||||
Page migration allows the preservation of the relative location of pages
|
||||
within a group of nodes for all migration techniques which will preserve a
|
||||
particular memory allocation pattern generated even after migrating a
|
||||
process. This is necessary in order to preserve the memory latencies.
|
||||
Processes will run with similar performance after migration.
|
||||
|
||||
Page migration occurs in several steps. First a high level
|
||||
description for those trying to use migrate_pages() from the kernel
|
||||
(for userspace usage see the Andi Kleen's numactl package mentioned above)
|
||||
and then a low level description of how the low level details work.
|
||||
|
||||
A. In kernel use of migrate_pages()
|
||||
-----------------------------------
|
||||
|
||||
1. Remove pages from the LRU.
|
||||
|
||||
Lists of pages to be migrated are generated by scanning over
|
||||
pages and moving them into lists. This is done by
|
||||
calling isolate_lru_page().
|
||||
Calling isolate_lru_page increases the references to the page
|
||||
so that it cannot vanish while the page migration occurs.
|
||||
It also prevents the swapper or other scans to encounter
|
||||
the page.
|
||||
|
||||
2. We need to have a function of type new_page_t that can be
|
||||
passed to migrate_pages(). This function should figure out
|
||||
how to allocate the correct new page given the old page.
|
||||
|
||||
3. The migrate_pages() function is called which attempts
|
||||
to do the migration. It will call the function to allocate
|
||||
the new page for each page that is considered for
|
||||
moving.
|
||||
|
||||
B. How migrate_pages() works
|
||||
----------------------------
|
||||
|
||||
migrate_pages() does several passes over its list of pages. A page is moved
|
||||
if all references to a page are removable at the time. The page has
|
||||
already been removed from the LRU via isolate_lru_page() and the refcount
|
||||
is increased so that the page cannot be freed while page migration occurs.
|
||||
|
||||
Steps:
|
||||
|
||||
1. Lock the page to be migrated
|
||||
|
||||
2. Insure that writeback is complete.
|
||||
|
||||
3. Prep the new page that we want to move to. It is locked
|
||||
and set to not being uptodate so that all accesses to the new
|
||||
page immediately lock while the move is in progress.
|
||||
|
||||
4. The new page is prepped with some settings from the old page so that
|
||||
accesses to the new page will discover a page with the correct settings.
|
||||
|
||||
5. All the page table references to the page are converted
|
||||
to migration entries or dropped (nonlinear vmas).
|
||||
This decrease the mapcount of a page. If the resulting
|
||||
mapcount is not zero then we do not migrate the page.
|
||||
All user space processes that attempt to access the page
|
||||
will now wait on the page lock.
|
||||
|
||||
6. The radix tree lock is taken. This will cause all processes trying
|
||||
to access the page via the mapping to block on the radix tree spinlock.
|
||||
|
||||
7. The refcount of the page is examined and we back out if references remain
|
||||
otherwise we know that we are the only one referencing this page.
|
||||
|
||||
8. The radix tree is checked and if it does not contain the pointer to this
|
||||
page then we back out because someone else modified the radix tree.
|
||||
|
||||
9. The radix tree is changed to point to the new page.
|
||||
|
||||
10. The reference count of the old page is dropped because the radix tree
|
||||
reference is gone. A reference to the new page is established because
|
||||
the new page is referenced to by the radix tree.
|
||||
|
||||
11. The radix tree lock is dropped. With that lookups in the mapping
|
||||
become possible again. Processes will move from spinning on the tree_lock
|
||||
to sleeping on the locked new page.
|
||||
|
||||
12. The page contents are copied to the new page.
|
||||
|
||||
13. The remaining page flags are copied to the new page.
|
||||
|
||||
14. The old page flags are cleared to indicate that the page does
|
||||
not provide any information anymore.
|
||||
|
||||
15. Queued up writeback on the new page is triggered.
|
||||
|
||||
16. If migration entries were page then replace them with real ptes. Doing
|
||||
so will enable access for user space processes not already waiting for
|
||||
the page lock.
|
||||
|
||||
19. The page locks are dropped from the old and new page.
|
||||
Processes waiting on the page lock will redo their page faults
|
||||
and will reach the new page.
|
||||
|
||||
20. The new page is moved to the LRU and can be scanned by the swapper
|
||||
etc again.
|
||||
|
||||
Christoph Lameter, May 8, 2006.
|
||||
|
||||
Reference in New Issue
Block a user