Creation of Cybook 2416 (actually Gen4) repository

This commit is contained in:
mlt
2009-12-18 17:10:00 +00:00
committed by godzil
commit 76f20f4d40
13791 changed files with 6812321 additions and 0 deletions

93
Documentation/vm/balance Normal file
View File

@@ -0,0 +1,93 @@
Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
Memory balancing is needed for non __GFP_WAIT as well as for non
__GFP_IO allocations.
There are two reasons to be requesting non __GFP_WAIT allocations:
the caller can not sleep (typically intr context), or does not want
to incur cost overheads of page stealing and possible swap io for
whatever reasons.
__GFP_IO allocation requests are made to prevent file system deadlocks.
In the absence of non sleepable allocation requests, it seems detrimental
to be doing balancing. Page reclamation can be kicked off lazily, that
is, only when needed (aka zone free memory is 0), instead of making it
a proactive process.
That being said, the kernel should try to fulfill requests for direct
mapped pages from the direct mapped pool, instead of falling back on
the dma pool, so as to keep the dma pool filled for dma requests (atomic
or not). A similar argument applies to highmem and direct mapped pages.
OTOH, if there is a lot of free dma pages, it is preferable to satisfy
regular memory requests by allocating one from the dma pool, instead
of incurring the overhead of regular zone balancing.
In 2.2, memory balancing/page reclamation would kick off only when the
_total_ number of free pages fell below 1/64 th of total memory. With the
right ratio of dma and regular memory, it is quite possible that balancing
would not be done even when the dma zone was completely empty. 2.2 has
been running production machines of varying memory sizes, and seems to be
doing fine even with the presence of this problem. In 2.3, due to
HIGHMEM, this problem is aggravated.
In 2.3, zone balancing can be done in one of two ways: depending on the
zone size (and possibly of the size of lower class zones), we can decide
at init time how many free pages we should aim for while balancing any
zone. The good part is, while balancing, we do not need to look at sizes
of lower class zones, the bad part is, we might do too frequent balancing
due to ignoring possibly lower usage in the lower class zones. Also,
with a slight change in the allocation routine, it is possible to reduce
the memclass() macro to be a simple equality.
Another possible solution is that we balance only when the free memory
of a zone _and_ all its lower class zones falls below 1/64th of the
total memory in the zone and its lower class zones. This fixes the 2.2
balancing problem, and stays as close to 2.2 behavior as possible. Also,
the balancing algorithm works the same way on the various architectures,
which have different numbers and types of zones. If we wanted to get
fancy, we could assign different weights to free pages in different
zones in the future.
Note that if the size of the regular zone is huge compared to dma zone,
it becomes less significant to consider the free dma pages while
deciding whether to balance the regular zone. The first solution
becomes more attractive then.
The appended patch implements the second solution. It also "fixes" two
problems: first, kswapd is woken up as in 2.2 on low memory conditions
for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
so as to give a fighting chance for replace_with_highmem() to get a
HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
fall back into regular zone. This also makes sure that HIGHMEM pages
are not leaked (for example, in situations where a HIGHMEM page is in
the swapcache but is not being used by anyone)
kswapd also needs to know about the zones it should balance. kswapd is
primarily needed in a situation where balancing can not be done,
probably because all allocation requests are coming from intr context
and all process contexts are sleeping. For 2.3, kswapd does not really
need to balance the highmem zone, since intr context does not request
highmem pages. kswapd looks at the zone_wake_kswapd field in the zone
structure to decide whether a zone needs balancing.
Page stealing from process memory and shm is done if stealing the page would
alleviate memory pressure on any zone in the page's node that has fallen below
its watermark.
pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are
per-zone fields, used to determine when a zone needs to be balanced. When
the number of pages falls below pages_min, the hysteric field low_on_memory
gets set. This stays set till the number of free pages becomes pages_high.
When low_on_memory is set, page allocation requests will try to free some
pages in the zone (providing GFP_WAIT is set in the request). Orthogonal
to this, is the decision to poke kswapd to free some zone pages. That
decision is not hysteresis based, and is done when the number of free
pages is below pages_low; in which case zone_wake_kswapd is also set.
(Good) Ideas that I have heard:
1. Dynamic experience should influence balancing: number of failed requests
for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
2. Implement a replace_with_highmem()-like replace_with_regular() to preserve
dma pages. (lkd@tantalophile.demon.co.uk)

View File

@@ -0,0 +1,294 @@
The intent of this file is to give a brief summary of hugetlbpage support in
the Linux kernel. This support is built on top of multiple page size support
that is provided by most modern architectures. For example, i386
architecture supports 4K and 4M (2M in PAE mode) page sizes, ia64
architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical
translations. Typically this is a very scarce resource on processor.
Operating systems try to make best use of limited number of TLB resources.
This optimization is more critical now as bigger and bigger physical memories
(several GBs) are more readily available.
Users can use the huge page support in Linux kernel by either using the mmap
system call or standard SYSv shared memory system calls (shmget, shmat).
First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
automatically when CONFIG_HUGETLBFS is selected) configuration
options.
The kernel built with hugepage support should show the number of configured
hugepages in the system by running the "cat /proc/meminfo" command.
/proc/meminfo also provides information about the total number of hugetlb
pages configured in the kernel. It also displays information about the
number of free hugetlb pages at any time. It also displays information about
the configured hugepage size - this is needed for generating the proper
alignment and size of the arguments to the above system calls.
The output of "cat /proc/meminfo" will have lines like:
.....
HugePages_Total: xxx
HugePages_Free: yyy
HugePages_Rsvd: www
Hugepagesize: zzz kB
where:
HugePages_Total is the size of the pool of hugepages.
HugePages_Free is the number of hugepages in the pool that are not yet
allocated.
HugePages_Rsvd is short for "reserved," and is the number of hugepages
for which a commitment to allocate from the pool has been made, but no
allocation has yet been made. It's vaguely analogous to overcommit.
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured
in the kernel.
/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
pages in the kernel. Super user can dynamically request more (or free some
pre-configured) hugepages.
The allocation (or deallocation) of hugetlb pages is possible only if there are
enough physically contiguous free pages in system (freeing of hugepages is
possible only if there are enough hugetlb pages free that can be transferred
back to regular memory pool).
Pages that are used as hugetlb pages are reserved inside the kernel and cannot
be used for other purposes.
Once the kernel with Hugetlb page support is built and running, a user can
use either the mmap system call or shared memory system calls to start using
the huge pages. It is required that the system administrator preallocate
enough memory for huge page purposes.
Use the following command to dynamically allocate/deallocate hugepages:
echo 20 > /proc/sys/vm/nr_hugepages
This command will try to configure 20 hugepages in the system. The success
or failure of allocation depends on the amount of physically contiguous
memory that is preset in system at this time. System administrators may want
to put this command in one of the local rc init files. This will enable the
kernel to request huge pages early in the boot process (when the possibility
of getting physical contiguous pages is still very high).
If the user applications are going to request hugepages using mmap system
call, then it is required that system administrator mount a file system of
type hugetlbfs:
mount none /mnt/huge -t hugetlbfs <uid=value> <gid=value> <mode=value>
<size=value> <nr_inodes=value>
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
/mnt/huge. Any files created on /mnt/huge uses hugepages. The uid and gid
options sets the owner and group of the root of the file system. By default
the uid and gid of the current process are taken. The mode option sets the
mode of root of file system to value & 0777. This value is given in octal.
By default the value 0755 is picked. The size option sets the maximum value of
memory (huge pages) allowed for that filesystem (/mnt/huge). The size is
rounded down to HPAGE_SIZE. The option nr_inodes sets the maximum number of
inodes that /mnt/huge can use. If the size or nr_inodes options are not
provided on command line then no limits are set. For size and nr_inodes
options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For
example, size=2K has the same meaning as size=2048. An example is given at
the end of this document.
read and write system calls are not supported on files that reside on hugetlb
file systems.
Regular chown, chgrp, and chmod commands (with right permissions) could be
used to change the file attributes on hugetlbfs.
Also, it is important to note that no such mount command is required if the
applications are going to use only shmat/shmget system calls. Users who
wish to use hugetlb page via shared memory segment should be a member of
a supplementary group and system admin needs to configure that gid into
/proc/sys/vm/hugetlb_shm_group. It is possible for same or different
applications to use any combination of mmaps and shm* calls, though the
mount of filesystem will be required for using mmap calls.
*******************************************************************
/*
* Example of using hugepage memory in a user application using Sys V shared
* memory system calls. In this example the app is requesting 256MB of
* memory that is backed by huge pages. The application uses the flag
* SHM_HUGETLB in the shmget system call to inform the kernel that it is
* requesting hugepages.
*
* For the ia64 architecture, the Linux kernel reserves Region number 4 for
* hugepages. That means the addresses starting with 0x800000... will need
* to be specified. Specifying a fixed address is not required on ppc64,
* i386 or x86_64.
*
* Note: The default shared memory limit is quite low on many kernels,
* you may need to increase it via:
*
* echo 268435456 > /proc/sys/kernel/shmmax
*
* This will increase the maximum size per shared memory segment to 256MB.
* The other limit that you will hit eventually is shmall which is the
* total amount of shared memory in pages. To set it to 16GB on a system
* with a 4kB pagesize do:
*
* echo 4194304 > /proc/sys/kernel/shmall
*/
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <sys/mman.h>
#ifndef SHM_HUGETLB
#define SHM_HUGETLB 04000
#endif
#define LENGTH (256UL*1024*1024)
#define dprintf(x) printf(x)
/* Only ia64 requires this */
#ifdef __ia64__
#define ADDR (void *)(0x8000000000000000UL)
#define SHMAT_FLAGS (SHM_RND)
#else
#define ADDR (void *)(0x0UL)
#define SHMAT_FLAGS (0)
#endif
int main(void)
{
int shmid;
unsigned long i;
char *shmaddr;
if ((shmid = shmget(2, LENGTH,
SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0) {
perror("shmget");
exit(1);
}
printf("shmid: 0x%x\n", shmid);
shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS);
if (shmaddr == (char *)-1) {
perror("Shared memory attach failure");
shmctl(shmid, IPC_RMID, NULL);
exit(2);
}
printf("shmaddr: %p\n", shmaddr);
dprintf("Starting the writes:\n");
for (i = 0; i < LENGTH; i++) {
shmaddr[i] = (char)(i);
if (!(i % (1024 * 1024)))
dprintf(".");
}
dprintf("\n");
dprintf("Starting the Check...");
for (i = 0; i < LENGTH; i++)
if (shmaddr[i] != (char)i)
printf("\nIndex %lu mismatched\n", i);
dprintf("Done.\n");
if (shmdt((const void *)shmaddr) != 0) {
perror("Detach failure");
shmctl(shmid, IPC_RMID, NULL);
exit(3);
}
shmctl(shmid, IPC_RMID, NULL);
return 0;
}
*******************************************************************
/*
* Example of using hugepage memory in a user application using the mmap
* system call. Before running this application, make sure that the
* administrator has mounted the hugetlbfs filesystem (on some directory
* like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this
* example, the app is requesting memory of size 256MB that is backed by
* huge pages.
*
* For ia64 architecture, Linux kernel reserves Region number 4 for hugepages.
* That means the addresses starting with 0x800000... will need to be
* specified. Specifying a fixed address is not required on ppc64, i386
* or x86_64.
*/
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#define FILE_NAME "/mnt/hugepagefile"
#define LENGTH (256UL*1024*1024)
#define PROTECTION (PROT_READ | PROT_WRITE)
/* Only ia64 requires this */
#ifdef __ia64__
#define ADDR (void *)(0x8000000000000000UL)
#define FLAGS (MAP_SHARED | MAP_FIXED)
#else
#define ADDR (void *)(0x0UL)
#define FLAGS (MAP_SHARED)
#endif
void check_bytes(char *addr)
{
printf("First hex is %x\n", *((unsigned int *)addr));
}
void write_bytes(char *addr)
{
unsigned long i;
for (i = 0; i < LENGTH; i++)
*(addr + i) = (char)i;
}
void read_bytes(char *addr)
{
unsigned long i;
check_bytes(addr);
for (i = 0; i < LENGTH; i++)
if (*(addr + i) != (char)i) {
printf("Mismatch at %lu\n", i);
break;
}
}
int main(void)
{
void *addr;
int fd;
fd = open(FILE_NAME, O_CREAT | O_RDWR, 0755);
if (fd < 0) {
perror("Open failed");
exit(1);
}
addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, 0);
if (addr == MAP_FAILED) {
perror("mmap");
unlink(FILE_NAME);
exit(1);
}
printf("Returned address is %p\n", addr);
check_bytes(addr);
write_bytes(addr);
read_bytes(addr);
munmap(addr, LENGTH);
close(fd);
unlink(FILE_NAME);
return 0;
}

130
Documentation/vm/locking Normal file
View File

@@ -0,0 +1,130 @@
Started Oct 1999 by Kanoj Sarcar <kanojsarcar@yahoo.com>
The intent of this file is to have an uptodate, running commentary
from different people about how locking and synchronization is done
in the Linux vm code.
page_table_lock & mmap_sem
--------------------------------------
Page stealers pick processes out of the process pool and scan for
the best process to steal pages from. To guarantee the existence
of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
Page stealers hold kernel_lock to protect against a bunch of races.
The vma list of the victim mm is also scanned by the stealer,
and the page_table_lock is used to preserve list sanity against the
process adding/deleting to the list. This also guarantees existence
of the vma. Vma existence is not guaranteed once try_to_swap_out()
drops the page_table_lock. To guarantee the existence of the underlying
file structure, a get_file is done before the swapout() method is
invoked. The page passed into swapout() is guaranteed not to be reused
for a different purpose because the page reference count due to being
present in the user's pte is not released till after swapout() returns.
Any code that modifies the vmlist, or the vm_start/vm_end/
vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent
kswapd from looking at the chain.
The rules are:
1. To scan the vmlist (look but don't touch) you must hold the
mmap_sem with read bias, i.e. down_read(&mm->mmap_sem)
2. To modify the vmlist you need to hold the mmap_sem with
read&write bias, i.e. down_write(&mm->mmap_sem) *AND*
you need to take the page_table_lock.
3. The swapper takes _just_ the page_table_lock, this is done
because the mmap_sem can be an extremely long lived lock
and the swapper just cannot sleep on that.
4. The exception to this rule is expand_stack, which just
takes the read lock and the page_table_lock, this is ok
because it doesn't really modify fields anybody relies on.
5. You must be able to guarantee that while holding page_table_lock
or page_table_lock of mm A, you will not try to get either lock
for mm B.
The caveats are:
1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
The update of mmap_cache is racy (page stealer can race with other code
that invokes find_vma with mmap_sem held), but that is okay, since it
is a hint. This can be fixed, if desired, by having find_vma grab the
page_table_lock.
Code that add/delete elements from the vmlist chain are
1. callers of insert_vm_struct
2. callers of merge_segments
3. callers of avl_remove
Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
the list:
1. expand_stack
2. mprotect
3. mlock
4. mremap
It is advisable that changes to vm_start/vm_end be protected, although
in some cases it is not really needed. Eg, vm_start is modified by
expand_stack(), it is hard to come up with a destructive scenario without
having the vmlist protection in this case.
The page_table_lock nests with the inode i_mmap_lock and the kmem cache
c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
pagemap_lru_lock spinlocks, and no code asks for memory with these locks
held.
The page_table_lock is grabbed while holding the kernel_lock spinning monitor.
The page_table_lock is a spin lock.
Note: PTL can also be used to guarantee that no new clones using the
mm start up ... this is a loose form of stability on mm_users. For
example, it is used in copy_mm to protect against a racing tlb_gather_mmu
single address space optimization, so that the zap_page_range (from
vmtruncate) does not lose sending ipi's to cloned threads that might
be spawned underneath it and go to user mode to drag in pte's into tlbs.
swap_lock
--------------
The swap devices are chained in priority order from the "swap_list" header.
The "swap_list" is used for the round-robin swaphandle allocation strategy.
The #free swaphandles is maintained in "nr_swap_pages". These two together
are protected by the swap_lock.
The swap_lock also protects all the device reference counts on the
corresponding swaphandles, maintained in the "swap_map" array, and the
"highest_bit" and "lowest_bit" fields.
The swap_lock is a spinlock, and is never acquired from intr level.
To prevent races between swap space deletion or async readahead swapins
deciding whether a swap handle is being used, ie worthy of being read in
from disk, and an unmap -> swap_free making the handle unused, the swap
delete and readahead code grabs a temp reference on the swaphandle to
prevent warning messages from swap_duplicate <- read_swap_cache_async.
Swap cache locking
------------------
Pages are added into the swap cache with kernel_lock held, to make sure
that multiple pages are not being added (and hence lost) by associating
all of them with the same swaphandle.
Pages are guaranteed not to be removed from the scache if the page is
"shared": ie, other processes hold reference on the page or the associated
swap handle. The only code that does not follow this rule is shrink_mmap,
which deletes pages from the swap cache if no process has a reference on
the page (multiple processes might have references on the corresponding
swap handle though). lookup_swap_cache() races with shrink_mmap, when
establishing a reference on a scache page, so, it must check whether the
page it located is still in the swapcache, or shrink_mmap deleted it.
(This race is due to the fact that shrink_mmap looks at the page ref
count with pagecache_lock, but then drops pagecache_lock before deleting
the page from the scache).
do_wp_page and do_swap_page have MP races in them while trying to figure
out whether a page is "shared", by looking at the page_count + swap_count.
To preserve the sum of the counts, the page lock _must_ be acquired before
calling is_page_shared (else processes might switch their swap_count refs
to the page count refs, after the page count ref has been snapshotted).
Swap device deletion code currently breaks all the scache assumptions,
since it grabs neither mmap_sem nor page_table_lock.

41
Documentation/vm/numa Normal file
View File

@@ -0,0 +1,41 @@
Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
The intent of this file is to have an uptodate, running commentary
from different people about NUMA specific code in the Linux vm.
What is NUMA? It is an architecture where the memory access times
for different regions of memory from a given processor varies
according to the "distance" of the memory region from the processor.
Each region of memory to which access times are the same from any
cpu, is called a node. On such architectures, it is beneficial if
the kernel tries to minimize inter node communications. Schemes
for this range from kernel text and read-only data replication
across nodes, and trying to house all the data structures that
key components of the kernel need on memory on that node.
Currently, all the numa support is to provide efficient handling
of widely discontiguous physical memory, so architectures which
are not NUMA but can have huge holes in the physical address space
can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM.
The initial port includes NUMAizing the bootmem allocator code by
encapsulating all the pieces of information into a bootmem_data_t
structure. Node specific calls have been added to the allocator.
In theory, any platform which uses the bootmem allocator should
be able to put the bootmem and mem_map data structures anywhere
it deems best.
Each node's page allocation data structures have also been encapsulated
into a pg_data_t. The bootmem_data_t is just one part of this. To
make the code look uniform between NUMA and regular UMA platforms,
UMA platforms have a statically allocated pg_data_t too (contig_page_data).
For the sake of uniformity, the function num_online_nodes() is also defined
for all platforms. As we run benchmarks, we might decide to NUMAize
more variables like low_on_memory, nr_free_pages etc into the pg_data_t.
The NUMA aware page allocation code currently tries to allocate pages
from different nodes in a round robin manner. This will be changed to
do concentratic circle search, starting from current node, once the
NUMA port achieves more maturity. The call alloc_pages_node has been
added, so that drivers can make the call and not worry about whether
it is running on a NUMA or UMA platform.

View File

@@ -0,0 +1,73 @@
The Linux kernel supports the following overcommit handling modes
0 - Heuristic overcommit handling. Obvious overcommits of
address space are refused. Used for a typical system. It
ensures a seriously wild allocation fails while allowing
overcommit to reduce swap usage. root is allowed to
allocate slighly more memory in this mode. This is the
default.
1 - Always overcommit. Appropriate for some scientific
applications.
2 - Don't overcommit. The total address space commit
for the system is not permitted to exceed swap + a
configurable percentage (default is 50) of physical RAM.
Depending on the percentage you use, in most situations
this means a process will not be killed while accessing
pages but will receive errors on memory allocation as
appropriate.
The overcommit policy is set via the sysctl `vm.overcommit_memory'.
The overcommit percentage is set via `vm.overcommit_ratio'.
The current overcommit limit and amount committed are viewable in
/proc/meminfo as CommitLimit and Committed_AS respectively.
Gotchas
-------
The C language stack growth does an implicit mremap. If you want absolute
guarantees and run close to the edge you MUST mmap your stack for the
largest size you think you will need. For typical stack usage this does
not matter much but it's a corner case if you really really care
In mode 2 the MAP_NORESERVE flag is ignored.
How It Works
------------
The overcommit is based on the following rules
For a file backed map
SHARED or READ-only - 0 cost (the file is the map not swap)
PRIVATE WRITABLE - size of mapping per instance
For an anonymous or /dev/zero map
SHARED - size of mapping
PRIVATE READ-only - 0 cost (but of little use)
PRIVATE WRITABLE - size of mapping per instance
Additional accounting
Pages made writable copies by mmap
shmfs memory drawn from the same pool
Status
------
o We account mmap memory mappings
o We account mprotect changes in commit
o We account mremap changes in size
o We account brk
o We account munmap
o We report the commit status in /proc
o Account and check on fork
o Review stack handling/building on exec
o SHMfs accounting
o Implement actual limit enforcement
To Do
-----
o Account ptrace pages (this is hard)

View File

@@ -0,0 +1,147 @@
Page migration
--------------
Page migration allows the moving of the physical location of pages between
nodes in a numa system while the process is running. This means that the
virtual addresses that the process sees do not change. However, the
system rearranges the physical location of those pages.
The main intend of page migration is to reduce the latency of memory access
by moving pages near to the processor where the process accessing that memory
is running.
Page migration allows a process to manually relocate the node on which its
pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
a new memory policy via mbind(). The pages of process can also be relocated
from another process using the sys_migrate_pages() function call. The
migrate_pages function call takes two sets of nodes and moves pages of a
process that are located on the from nodes to the destination nodes.
Page migration functions are provided by the numactl package by Andi Kleen
(a version later than 0.9.3 is required. Get it from
ftp://ftp.suse.com/pub/people/ak). numactl provided libnuma which
provides an interface similar to other numa functionality for page migration.
cat /proc/<pid>/numa_maps allows an easy review of where the pages of
a process are located. See also the numa_maps manpage in the numactl package.
Manual migration is useful if for example the scheduler has relocated
a process to a processor on a distant node. A batch scheduler or an
administrator may detect the situation and move the pages of the process
nearer to the new processor. The kernel itself does only provide
manual page migration support. Automatic page migration may be implemented
through user space processes that move pages. A special function call
"move_pages" allows the moving of individual pages within a process.
A NUMA profiler may f.e. obtain a log showing frequent off node
accesses and may use the result to move pages to more advantageous
locations.
Larger installations usually partition the system using cpusets into
sections of nodes. Paul Jackson has equipped cpusets with the ability to
move pages when a task is moved to another cpuset (See ../cpusets.txt).
Cpusets allows the automation of process locality. If a task is moved to
a new cpuset then also all its pages are moved with it so that the
performance of the process does not sink dramatically. Also the pages
of processes in a cpuset are moved if the allowed memory nodes of a
cpuset are changed.
Page migration allows the preservation of the relative location of pages
within a group of nodes for all migration techniques which will preserve a
particular memory allocation pattern generated even after migrating a
process. This is necessary in order to preserve the memory latencies.
Processes will run with similar performance after migration.
Page migration occurs in several steps. First a high level
description for those trying to use migrate_pages() from the kernel
(for userspace usage see the Andi Kleen's numactl package mentioned above)
and then a low level description of how the low level details work.
A. In kernel use of migrate_pages()
-----------------------------------
1. Remove pages from the LRU.
Lists of pages to be migrated are generated by scanning over
pages and moving them into lists. This is done by
calling isolate_lru_page().
Calling isolate_lru_page increases the references to the page
so that it cannot vanish while the page migration occurs.
It also prevents the swapper or other scans to encounter
the page.
2. We need to have a function of type new_page_t that can be
passed to migrate_pages(). This function should figure out
how to allocate the correct new page given the old page.
3. The migrate_pages() function is called which attempts
to do the migration. It will call the function to allocate
the new page for each page that is considered for
moving.
B. How migrate_pages() works
----------------------------
migrate_pages() does several passes over its list of pages. A page is moved
if all references to a page are removable at the time. The page has
already been removed from the LRU via isolate_lru_page() and the refcount
is increased so that the page cannot be freed while page migration occurs.
Steps:
1. Lock the page to be migrated
2. Insure that writeback is complete.
3. Prep the new page that we want to move to. It is locked
and set to not being uptodate so that all accesses to the new
page immediately lock while the move is in progress.
4. The new page is prepped with some settings from the old page so that
accesses to the new page will discover a page with the correct settings.
5. All the page table references to the page are converted
to migration entries or dropped (nonlinear vmas).
This decrease the mapcount of a page. If the resulting
mapcount is not zero then we do not migrate the page.
All user space processes that attempt to access the page
will now wait on the page lock.
6. The radix tree lock is taken. This will cause all processes trying
to access the page via the mapping to block on the radix tree spinlock.
7. The refcount of the page is examined and we back out if references remain
otherwise we know that we are the only one referencing this page.
8. The radix tree is checked and if it does not contain the pointer to this
page then we back out because someone else modified the radix tree.
9. The radix tree is changed to point to the new page.
10. The reference count of the old page is dropped because the radix tree
reference is gone. A reference to the new page is established because
the new page is referenced to by the radix tree.
11. The radix tree lock is dropped. With that lookups in the mapping
become possible again. Processes will move from spinning on the tree_lock
to sleeping on the locked new page.
12. The page contents are copied to the new page.
13. The remaining page flags are copied to the new page.
14. The old page flags are cleared to indicate that the page does
not provide any information anymore.
15. Queued up writeback on the new page is triggered.
16. If migration entries were page then replace them with real ptes. Doing
so will enable access for user space processes not already waiting for
the page lock.
19. The page locks are dropped from the old and new page.
Processes waiting on the page lock will redo their page faults
and will reach the new page.
20. The new page is moved to the LRU and can be scanned by the swapper
etc again.
Christoph Lameter, May 8, 2006.