Memory Management in Linux.


Introduction.

Memory management on any platform and on any operating system is a complex task. This task becomes even more difficult when you have an operating system that can be used on several platforms. Linux is an unix style operating system that is Posix compatible. Linux can be run on IBM PC compatible 80386 (ia32) and above, Alpha AXP LCA4 21066 and above, Sun Sparc, Sun Ultra Sparc, Apple Power Macintosh, Amiga, Atari, Acorn ARM, IBM esa/390 mainframe, SGI indy, MIPS and even the new still unreleased ia64 (Itanium (Merced)) based processors from Intel. There are current projects to run Linux on IBM mainframes, Palmtops and even the Nintendo 64 .


The memory management routines are written in C with platform dependent routines written in Assembler. This is needed as not only does every one of these architectures have different ways of addressing memory management hardware, but some of them are 32bit and some are 64bit. Most are little endian and others are big endian. This makes linux a very complex, but powerful system.


Scenario

When running Linux virtual memory is used extensively. When you tell Linux to run a program the program image is Not loaded into ram. The image is allocated a section of virtual memory. Only the first part of the image is loaded. This initial part of the program is needed to start running the program. When this process requests a section of memory that is within the virtual memory that's been allocated to it, but outside the memory that it is currently using (as only the part of the program that is running is actually loaded into memory and not the entire image) then the operating system will receive a page fault from the processor. Linux will handle this by first checking the type of page fault. If the access to memory was illegal in some way (Trying to access kernel space, or another programs memory) then a memory fault will be logged. If the access to memory was legal, then the operating system will do one of two things, it will either load an already cached page from the swap file or it will load it from the image.


The operating system will first check to see if the page has been cached and it does this by checking a page table entry for the address that was requested. The page table is a table the makes up the list of pages that are being used in virtual memory for a process. These tables map virtual memory onto physical memory. A process sees these pages as virtual pages, but uses them as logical pages. These pages do not need to be held in memory consecutively, this is the reason for having page tables. The pages may be in any order when loaded in memory, but the process does not know this and will address them as if they are logical consecutive pages.


The operating system kernel will allocate logical pages to processes as and when they need them. This is done with kmalloc. This is a kernel process that allocates and unallocates the logical pages. The page tables are setup so that when a process asks for memory in a page that it has in it's page table (and therefor has been allocated to in in virtual memory) the page will be marked as either loaded into ram, in cache or not yet loaded at all. In the first case the process will have it's memory request remapped to use the logical page that is is requesting. In the second case the page will first be pulled back into memory, the page table will then be updated and the request will be handled in the same way it was in the first case. In the third case the memory requested will have to be loaded from the image file. This will then be loaded into a page, the page table updated and then the process may continue as it did in the first case.



This works fine until there are no free pages to allocate to processes when they request a page.

In this situation the kernel process kswapd will swap out a page to allow the new page to be loaded. The kswapd runs at a predetermined interval, when it runs it checks to see if there are enough free pages. If there are less than it's free_pages_high limit it will try to free up 3 pages before it next runs, if the number of pages has fallen below free_pages_low limit it will try to free up 6 pages before it next runs. If the number of pages had been below the free_pages_low limit kswapd will not sleep as long as it normally does. To free up pages it will try three methods. The first method is to try reducing page caches and buffers. If this fails to free up enough pages then it will try to swap out shared memory pages. If this fails to free enough pages it will swap out and discard pages.



Freeing memory that is being used for something will always cause a system slowdown. This is due to the fact that a page that may have been cached before now has to be reloaded into memory. The best way to free memory is just to reduce page caches and buffers. This will affect every process the same as pages are reduced equally. It is not done that a certain process has all it's cache taken away, the cache is reduced across all processes. The same applies to buffer space. A very similar method is used on the shared memory pages that are used for inter process communication (IPC). Pages are looked at to see if they can be swapped (ie, they are not being used or have not been used for some time) and then swapped out to disk. The last method is to swap out and discard general pages. A page will be swapped out if it has been changed and the data contained in the page can not be retrieved any other way. If the page can be retrieved another way (it may be a page containing part of the image file or some other read only memory that has to be loaded for reading or executing, but after that is not needed) then the page will be discarded without swapping the page out to disk. This save much time when freeing pages, but does mean that if that page is needed again it will take much longer to restore it.



If a user process request memory for storage of data the number of tables needed will be added to the page table and the virtual memory will be mapped to logical memory as and when the pages are read from or written to. In this way you may have many processes running that all require more memory (or the total is more) than you have in your system.



If you had seven processes (p1 to p7) that need 1 MB, 2.5 MB, 1.5 MB, 700 KB, 5 MB, 10 MB and 15 MB of memory running on a system with only 16 MB of ram only the pages that were being used would be in memory. When the programs were started only the first part of each would be loaded then as each requested pages they would be loaded into logical pages. The entire image of every program would be mapped to virtual pages from the very start.



Linux can use many file systems from the old Dos 8.3 style fat (16 & 12) up to OS/2 HPFS and windows VFAT (Fat 16 with Long file names and Fat 32) and many others (Minix, umsdos, smb, iso9660, iso9660 Juliet, affs and NTFS in read only). The most common file system to use with Linux is ext/2 (extended file system 2). This has what's know as a super block and this is the equivalent of the dos fat. This super block contains information about the file system on this partition. It contains information like block size, block per group, free block, free inodes, first inode. Each block group then has a block group descriptor, this contains blocks bitmap (a bit mapped field show what blocks are free), inode bitmap (same but for inodes), Inode Table, Free blocks count, free inodes count, used directory count.



Every file, directory, link (symbolic or not), fifo, has an inode. The inode describes it, with file permissions, dates (created, modified), size and where the first 12 blocks of data are contained, and then links to other blocks that contain information about where the rest of the file is. This was done for efficiency as a lot of files under Linux are very small, and some are links, fifo's or device files that take up no real space at all except that that the inode is using.



When a file has filled up it's block the file must use another block. To make file access faster and to stop fragmentation the file system will pre allocate blocks. These block are not really allocated, they are reserved for the file before them and they will be used last if the file system is getting full and it needs to write data to a different file. If there are pre allocated blocks the file can grow into those. If there are no pre allocated blocks then the file system tries to allocate the block after the current end of the file. If this fails as it is already allocated, the file system will try to find a block within 64 data blocks, or within the same block group. If this fails it will try to find 8 consecutive blocks that are free and use those. If this fails the next free block is used. If the file size is reduced the blocks that are freed are added to the pre allocated block list if they are consecutive blocks from the new end of the file. If they are not consecutive then they simply marked as freed.

Directories in ext/2 are special files that describe the file name length and file name and the inode of each file in the directory. The inode itself holds the block addresses for the data. Each directory implicitly holds directories . and .. except the / directory.



For processes that are run under Linux, there is an important distinction to be made between real time and non-real time processes. A real time process is a process that needs things to be done when it requests them. A non real time process requests for something to happen and does not mind if it is done immediately, in fact quite often a process could wait for days for signals, and when it received them it would not need to process it instantly. A non real time process might be used for things like e-mail servers web servers. Real time processes would be used to things where timing is important, so if you were developing a system to drive a car that would have to be real time. Linux is not a real time OS. Linux will handle a real time process as fast as it can, a real time process has higher priority than almost anything else in the system. Real time processes can have two different types of scheduling algorithms. The first is round robin. This gives every real time process an equal amount of processor time. The second is first in first out, this simply runs a process until it ends and then starts the next in the process queue and so on. For non real time processes each process has a priority, and the priority of the process determines it's place in the queue if a process has just been run the amount of time it has been running is taken off it's priority and it is placed at the back of the queue. As a process progresses up the queue it may be chosen to be run if its priority is high. If a process has had a lot of cpu time and it had a high priority and a process that has no cpu time, but has a low priority will both get some cpu time, as the low priority process will move to the front of the queue and be run, but as soon as it's priority has been decremented to bellow the priority of the next process in the queue the next process will get control of the processor.