On the 32-bit Windows platform, JVM programs can only ever use up to about 1.5–1.6 GiB of memory in RAM per Java VM process. Allocating a heap size greater than this amount will either not work, or force paging and poor performance.
From my own research on this issue (and taking into account Microsoft’s advice on KnowledgeBase about Win32 virtual memory allocation), this is what I understand about the limits of RAM usability, both for Win32 itself, and for 32-bit JVMs running on Windows. I haven’t researched if this limit applies for 64-bit Windows or JVMs, nor what Vista might be doing. Also, though the 4GiB limit is inherent to 32-bit machines, the 2GiB limit seems to be peculiar to Windows, and I’ve not seen it anywhere in Linux, Solaris or BSD: when unix runs on a 32-bit machine, there’s a 4GiB limit, but not 2GiB.
Any 32-bit binary processor has a hard limit of 4,294,967,296 which is the largest number that can be represented in a single 32 bit machine word: 232 = 4,294,967,296. On a byte-addressed computer like Intel IA32, that equates 4294967296 bytes = 4194304 KiB = 4096 MiB = 4GiB.
Normal memory access techniques used by 32-bit Windows programs use standard linear byte addressing, and so are limited to 4GiB of addressable memory, whether it is real or virtual.
On Windows, the amount of this 4GiB space that can actually occupy physical RAM is halved to 2GiB per process. Windows uses the other 2GiB of virutal address space as a per-process overhead for the kernel, and to speed up paging. This is really dumb because it means that if your process allocates > 2GiB heap memory, then Windows must page some to virtual memory, even if you have > 2GiB of actual RAM! Sort of like the DOS 640KiB limit reborn!
To overcome this dumb design, Windows has a memory addressing scheme called the AWE API, which allows a process to allocate up to 3GiB of memory and have that memory reside on chips. To use this, the program must be specially written to use the AWE API.There’s also another virtual memory technology in Windows called PAE. This is not useful to application programs — it is how the Windows kernel can use > 4GiB of real memory to allocate physical memory to all processes on the system. Each process is still limited to the 4GiB address space each, with 2GiB mapped to RAM (or 3GiB, if the program uses the AWE API) and the rest having to be virtual (paged to disc). PAE just lets Windows keep more than one of these big processes in memory at once even if their combined total heap is more than 4GiB (and assuming you have more than 4GiB of RAM of course).
Both Sun and BAE have refused to use the AWE API for implementing their Java VMs. This is probably because the AWE API does not allow for a contiguous address space of 3GiB, but rather breaks it at the 2GiB mark. The Java VM spec’ used to require a contiguous addressed heap (though this has changed for the JVMS 2nd Ed.…). So any Java VM running on Windows is still limited to at most < 2GiB per running program (actually, only about 1.5GiB is usable because of further overhead for the JVM itself). At least, so long as Win32 JVMs don’t use the AWE API. I’m not sure how difficult it would be for Sun or BAE to change their JVMs to make use of the AWE API in Windows, but the fact that they haven’t done so seems to indicate to me that it wouldn’t be easy. I have been unable to determine if IBM’s JVM uses it…
The only workable solution for utilizing greater than about 1.5–1.6 GiB per JVM process on a 32-bit host is to not run it in Windows (i.e. use Solaris, Linux or BSD). Real operating systems can let processes use 4GiB on 32-bit machines without special programming tricks. Or, you could switch to a 64-bit platform. Although there is a Windows for IA-64, I’m not sure about the availability of a 64-bit JVM for that platform.
It would be better to have a smaller heap on Win32, and if you need more, consider re-engineering the program to use less anyway. If your program is genuinely memory bound and you can’t get away from needing more than 1.5GiB heap, you could work around the Win32 memory limit by splitting your Java program into more than once process, each running on it’s own JVM, allocating 1.5GiB to each JVM, and then having the processes communicate with an IPC mechanism as needed, such as JMS. However there’s probably a lot more re-engineering work involved in this than there is to just migrate away from Win32 …
A reasonable approach that has been well established over the years is a layered system. Dijkstra's THE system (Fig. 1-25) was the first layered operating system. UNIX (Fig. 10-3) and Windows 2000 (Fig. 11-7) also have a layered structure, but the layering in both of them is more a way of trying to describe the system than a real guiding principle that was used in building the system.
For a new system, designers choosing to go this route should first very carefully choose the layers and define the functionality of each one. The bottom layer should always try to hide the worst idiosyncracies of the hardware, as the HAL does in Fig. 11-7. Probably the next layer should handle interrupts, context switching, and the MMU, so above this level, the code is mostly machine independent. Above this, different designers will have different tastes (and biases). One possibility is to have layer 3 manage threads, including scheduling and interthread synchronization, as shown in Fig. 12-1. The idea here is that starting at layer 4 we have proper threads that are scheduled normally and synchronize using a standard mechanism (e.g., mutexes).
In layer 4 we might find the device drivers, each one running as a separate thread, with its own state, program counter, registers, etc., possibly (but not necessarily) within the kernel address space. Such a design can greatly simplify the I/O structure because when an interrupt occurs, it can be converted into an unlock on a mutex and a call to the scheduler to (potentially) schedule the newly readied thread that was blocked on the mutex. MINIX uses this approach, but in UNIX, Linux, and Windows 2000, the interrupt handlers run in a kind of no-man's land, rather than as proper threads that can be scheduled, suspended, etc. Since a huge amount of the complexity of any operating system is in the I/O, any technique for making it more tractable and encapsulated is worth considering.
Above layer 4, we would expect to find virtual memory, one or more file systems, and the system call handlers. If the virtual memory is at a lower level than the file systems, then the block cache can be paged out, allowing the virtual memory manager to dynamically determine how the real memory should be divided among user pages and kernel pages, including the cache. Windows 2000 works this way.