Discussion:
Main memory as an I/O device
(too old to reply)
Thomas Koenig
2020-05-03 11:58:11 UTC
Permalink
[F'up to comp.arch]

The first mainframe I worked on had 32 MB of main memory, two
processors, ran MVS and could handle 100 concurrent users under TSO.

Looking at the z15 processor, it has 960 MB level 4 cache and 12 cores,
and even the level 3 cache is 256 MB for 12 cores - more memory
for core than the old mainframe.

Given the (lack of) speed of today's main memory, I wonder if it
would be better to run the OS and application software in cache
and use the main memory just as an I/O device or backup storage.
Douglas Miller
2020-05-03 14:57:13 UTC
Permalink
Post by Thomas Koenig
[F'up to comp.arch]
The first mainframe I worked on had 32 MB of main memory, two
processors, ran MVS and could handle 100 concurrent users under TSO.
Looking at the z15 processor, it has 960 MB level 4 cache and 12 cores,
and even the level 3 cache is 256 MB for 12 cores - more memory
for core than the old mainframe.
Given the (lack of) speed of today's main memory, I wonder if it
would be better to run the OS and application software in cache
and use the main memory just as an I/O device or backup storage.
It's a pretty complex subject, requiring more than I know. But I have dealt with situations where "running entirely from cache" was done.

Just one minor correction: it's not that modern memory is slow, but rather modern processor speed has out-paced memory technology.

Also note that, while L3 and L4 cache are faster than main memory, they are still slower than the CPU. Even L1 cache is slower.

Running an entire system (OS, apps, etc) entirely out of cache is a pretty big order. You tend to throw away most of main memory. But tuning applications to run mostly/entirely out of cache is a reasonable thing, and something that has been going on in the industry for quite awhile. "cache warmth" has been a concept in Unix/Linux system tuning for several decades. Many of the supercomputer applications run mostly/entirely out of cache.

I'm not certain what you mean by "main memory just as an I/O device". Shared memory can be a medium of communication between cores/processors, but not for offline storage (unless you are using modern persistent RAM technology) let alone any sort of external data exchange. Memory-mapped I/O does not actually use RAM, it just uses the address space.

This is not even bringing process migration, CPU scheduling, NUMA, etc. into the discussion. Running "cache hot", or even "cache warm", tends to mean you have to sacrifice a lot of the modern OS amenities. It is usually a delicate balancing act.
Thomas Koenig
2020-05-03 19:46:55 UTC
Permalink
Post by Douglas Miller
I'm not certain what you mean by "main memory just as an I/O
device".
I could probably have phrased that better, probably.

What I mean that a CPU has a certain amount of high-speed RAM,
at the speed of today's level-3 or level-4 caches.

When it needs data that is not present, it requests that data from
a bigger RAM with slow access times (like today's main memory);
this is then transferred and into the main memory, and the core
gets an interrupt when it is there and can then continue
doing what it needs to do.

Very much like today's system handle I/O of disks or other
I/O devices. It might also involve an amount of "paging".
Peter Flass
2020-05-03 23:01:50 UTC
Permalink
Post by Thomas Koenig
Post by Douglas Miller
I'm not certain what you mean by "main memory just as an I/O
device".
I could probably have phrased that better, probably.
What I mean that a CPU has a certain amount of high-speed RAM,
at the speed of today's level-3 or level-4 caches.
When it needs data that is not present, it requests that data from
a bigger RAM with slow access times (like today's main memory);
this is then transferred and into the main memory, and the core
gets an interrupt when it is there and can then continue
doing what it needs to do.
Very much like today's system handle I/O of disks or other
I/O devices. It might also involve an amount of "paging".
I’m not sure it’s a good idea. First, the OS, and even the kernel contain
some seldom-used bits of code, error recovery and such, so putting that in
fast memory doesn’t help. Second, the highly-used bits of the kernel may be
in the cache anyway. I know Linux is designed to optimize placement of
pieces of code. Third, as already mentioned, some numerical programs may
contain heavily used inner loops that might better be in cache instead of
some parts of the kernel. The present system is supposed to keep heavily
used code and data in cache, and might work best.
--
Pete
Scott Lurndal
2020-05-04 14:35:56 UTC
Permalink
Post by Thomas Koenig
Post by Douglas Miller
I'm not certain what you mean by "main memory just as an I/O
device".
I could probably have phrased that better, probably.
What I mean that a CPU has a certain amount of high-speed RAM,
at the speed of today's level-3 or level-4 caches.
When it needs data that is not present, it requests that data from
a bigger RAM with slow access times (like today's main memory);
this is then transferred and into the main memory, and the core
gets an interrupt when it is there and can then continue
doing what it needs to do.
But that's exactly what happens. If you miss in L1, you try
to fill from L2. If you miss in L2, you try to fill from L3
and if you miss in L3, you fill from DRAM, and if you miss
in DRAM, you fill from disk.

The difference from paging is that the caches work in units
of cache line size (32 to 128 bytes depending on processor).
Theo Markettos
2020-05-06 20:47:24 UTC
Permalink
Post by Scott Lurndal
But that's exactly what happens. If you miss in L1, you try
to fill from L2. If you miss in L2, you try to fill from L3
and if you miss in L3, you fill from DRAM, and if you miss
in DRAM, you fill from disk.
The L4 referred by the OP is eDRAM anyway, so your cache is just some
slightly faster DRAM than the DRAM you have in the rest of the system.

Some Intel laptop CPUs today have 128MB eDRAM, mainly intended for the GPU
but usable as general-purpose L4 cache. If you can fit your workload in
128MB then you can run out of eDRAM - but it only cuts your access time in
half compared with regular DRAM so it gains you something but only factors
of 2 improvement.

Theo
a***@math.uni.wroc.pl
2020-05-28 15:19:57 UTC
Permalink
Post by Thomas Koenig
Post by Douglas Miller
I'm not certain what you mean by "main memory just as an I/O
device".
I could probably have phrased that better, probably.
What I mean that a CPU has a certain amount of high-speed RAM,
at the speed of today's level-3 or level-4 caches.
When it needs data that is not present, it requests that data from
a bigger RAM with slow access times (like today's main memory);
this is then transferred and into the main memory, and the core
gets an interrupt when it is there and can then continue
doing what it needs to do.
Very much like today's system handle I/O of disks or other
I/O devices. It might also involve an amount of "paging".
Once you have paging you need to take into account cost
of handling page faults. IIRC Lynn Wheeler wrote that
after extensive optimization in VM/370 he got page fault
handling down to few hundreds of instructions. He also wrote
that other systems may need thousands of instructions.
With current DRAM access times thousands of instructions
would add quite high overhead. Few hundreds of instructions
may be acceptable, but still is costly. OTOH, putting
eqivalent of few hundreds of instructions in hardware is
acceptable increase in hardware complexity. So it makes
sense to put "page" handling in hardware. Since DRAM
transfer rate and latency are not so high compared to
cache ("memory"), one is lead to simpilfied "page"
handling. In particular, it makes sense to decouple
page protection and remaping (virtual memory) from
"page" replacement. Old analyses indicated that
for replacement rather small "page" size gives better
results, proposed page sizes were in range of 32 bytes
to 2 kilobytes. With main memory of about 32 kilobytes
(of order of modern L1 cache) optimal page sizes seem to
be pretty close to cache line size in modern system.
At deeper level bigger size may give slightly better
results, but using the same size at several levels
has advantages and gain from varying "page" size between
levels is probably too small to compensate.

Another issue is automatic versus manual control. In PC
class machines prefetch instructions give some amount
of manual control. OTOH there is several stories indicating
that at large scale automatic systems can do better.
Early OS360 depended on overlays and swaping overlays
led to significant efficiency loss. Later bigger memories
reduced need for overlays. But it seems that paging and
not bigger memories was main factor almost eliminating
overlays. IIUC in practice paging was not only easier
for programmers but also more efficient than overlays.

So, at conceptual level, doing performance analysis it makes
sense to think of main memory as I/O device. OTOH, when
programming it make sense to hide nonuniform structure
of memory and delegate needed support to hardware.
--
Waldek Hebisch
t***@gmail.com
2020-05-06 14:02:43 UTC
Permalink
Post by Thomas Koenig
[F'up to comp.arch]
The first mainframe I worked on had 32 MB of main memory, two
processors, ran MVS and could handle 100 concurrent users under TSO.
Looking at the z15 processor, it has 960 MB level 4 cache and 12 cores,
and even the level 3 cache is 256 MB for 12 cores - more memory
for core than the old mainframe.
Given the (lack of) speed of today's main memory, I wonder if it
would be better to run the OS and application software in cache
and use the main memory just as an I/O device or backup storage.
Congratulations, you just reinvented the IBM Cell chip.
Loading...