Post by Robin VowelsPost by J. ClarkePost by John LevinePost by J. ClarkePost by Thomas KoenigRISC vs. CISC: The really complex CISC-architectures died out.
What do you consider to be a "really complex CISC-architecture"?
The usual example is VAX.
I'd say IBM zSeries is pretty CISC but it has a unique niche..
You might want to compare those to Intel.
The instruction set reference for the VAX is a single chapter with 141
pages. The instruction set reference for Intel is three volumes with
more than 500 pages each.
Such comparison completely misses the point. Important
design point for RISC that instructions should be
implementable to execute in one cycle using high clock
frequency.
.
The design of RISC machines was largely misguided as a
better method than CISC machines.
RISC was more suited to simple microprocessors, with limited
instruction sets.
A CISC instruction such as a memory move, or a translate
instruction, did a lot of work. To run at the same speed,
a RISC needed a clock rate about ten times faster
than CISC to achieve the same speed.
What you write is extremaly misleading. RISC design was
based on observing actual running programs and taking
statistics of instructions use. RISC got rid of infrequently
used complex instructions, but it does not mean that
single RISC instruction only a little work. For
example autoincremet was frequent featue. In typical
program that marches trough an array RISC step would be
two instructions:
load with autoincrement
computing op
On i386 one could do this is similar way or use different
two instrucions:
compute with register indirect argument
increment address register
On early 360 only the second possibility was availble (of
course, each machine could also use longer sequences, but
I am interested in most efficient one).
SPARC and MIPS had register windows, conseqently procedure
entry and return did a lot of work in single instruction.
Unlike STM on 360 register window operation was done in
once clock. Later it turned out that procedure entry
and return while frequent is not frequent enough to
justify cost of hardware. Addionaly, with better compilers
RISC machine without register windows could do calls
only marginally slower than machine with register windows,
so gain was small and register windows went out of fashion.
But they nicely illustrate that single RISC instruction
could do a lot of work. The real question was which
instructions were important enough to allocate hardware
resources needed to do the work, and which were
unimportant and offered possibility of savings. Also,
part of RISC philosopy was that multicycle instructions
frequently can be split into seqence of single-cycle
ones. So while RISC may need more instructions for
given work, number of cycles was usually smaler than
for CISC. This is very visible comparing i386 and
RISC of comparable complexity: all i386 were multi-cycle
ones, frequently needing more than 3 inctructions,
RISC could do most (or all) in single cycle.
Post by Robin VowelsIn other words, a programmer writing code for a RISC was
effectively writing microcode.
To be useful to an assembler programmer, a computer instruction
needed to do more work rather than less.
Do you have any experience writing RISC assembler? I have
worked on compiler backend for ARM and have written few
thousends lines of ARM assembly. Several ARM routines
had _less_ instructions that routine performing equivalent
function on i386. On average ARM seem to require slightly
more instructions than i386, but probably of order of few
percent more. Compiled code for ARM is longer by about
30%, but the main reason is not number of intructions. Rather
ARM instructions (more precisely ARM 5) are all 32-bits.
i386 instructions on average tend to be shorter than
4 bytes, so this is one reason for shorter code. Other
is constants: one can put 32-bit constants directly into
i386 intructions, but on ARM only small constant can
be included in instruction while other need to go into
literal pool (the same happens on old 360).
While I did not write anything substantial in assembler
for other RISC-s I saw reasonably large samples of assembler
for MIPS, SPARC and HPPA and I can assure you that neither
requires much more instructions than CISC. I also compiled
(using GCC) program having about 24000 lines of C for
different architectures. Longest executables were s390
and SPARC (IIRC of order 240 kB object code), shortest i360
(of order 180 kB), HAPPA was sligthly larger tham i386
(IIRC something like 190 kB).
Post by Robin VowelsLooking back at first generation computers, we see that
array operations were possible in 1951 on Pilot ACE,
and on DEUCE (1955). These operations included memory
move, array addition, array subtraction, etc.
Such minimised the number of instructions needed to do
a given computation, as well as, of course, to reduce
execution time.
Such instructions did not seem important to designers
of second generation machines, with widespread use of
transistors.
More recently, computers implementing array operations
did not appear until the 1970s.
.
In 1980 that required drastic simplification
of instructions,
.
now one can have more complexity and
still fit in one cycle. CPU designers formulated
several features deemed necessary for fast 1 IPC
implementation. This set of features became
religious definiton of RISC. RISC versus CICS
war died out mostly because due to advance in
manufacturing and CPU design several of religious
RISC featurs became almost irrelevant to CPU
speed.
VAX instructions do complex things, in particular
multiple memory refereces with interesting
addressing modes. That was impossible to implement
in one cycle using technology from 1990 (and probably
still is impossible). 360 and 386 and their descendants
.
I disagree.
Most of the S/360 character instructions that move/compare/
translate/search character strings are a long way from RISC.
Sure, S/360 has a lot of complex instructons. But most of
them are either system instructions or can be replaced by
seqences of simpler S/360 operations. In machine like
VAX almost all is complex intructions, if you remove then
machine probably would be useless. On S/360 if you want
fast code there is good chance that your program uses
mainly simple instructions (they are the fast ones).
Post by Robin VowelsFloating-point instructions also are far from RISC, especially
multiplication and division.
Huh? Every RISC that I used had FPU.
Post by Robin VowelsEven addition and subtraction can
require multiple steps for post-normalization.
Maybe you did not realize that RISC machines are pipelined?
FPU addition usually needs 2-3 pipeline stages, multiplication
may need between 2 and 5 (depending on actual machine).
On machine pipelined you may new operation every cycle, but
need to wait between providning arguments and using the
result. That is after issuing FPU instructons there must
be some other instructions (possibly another FPU instructins,
possible NOP if you have no useful work) before you may
use result. HPPA 712 had FPU multiply and add instruction
and could execute loads in the same cycle as FPU operation.
In effect 60 MHz HPPA could do 120 M flops per second
(usually it was less but I saw some small but real code
running at that speed).
Very early RISC-s had FPU as coprocessor that did FPU work
while main CPU simultanously executed integer instructions.
Clearly such coprocessor was much slower than later RISC,
but was not much different than coprocessors used on
comparable CISC. Of course, in early RISC times big mainframes
and supercomputers had better floating point speed than
RISC, but big machines had much more hardware and cost
was hundreds if not thousends times higher than RISC cost.
Post by Robin Vowels(One of the few
computational instructions that could have been have been
implemented in a RISC was Halve floating-point.
And then there are the decimal instrucuitons. Even addition and
subtraction require multiple steps (not to mention multiplication
and division. All these are CISC instructions.
Packed decimal instructitons on 360 do not fly either, they
are multicycle instructions. On machines where timings
are published it is clear that they are done by microcode
loop. For example on 360-85 decimal additon costs slightly
more per byte than 32-bit addition. With hardware for decimal
step RISC subroutine could do them at comparable speed. Even
on RISC without decimal hardware decimal subroutine can
run at resonable speed. Angain, on 360-85 single step of
TR has time equal to 3 ADD-s, plus substantial setup time.
RISC subroutine can do that at comparable speed.
The same applies to string instructions: on RISC you
need subroutine but subroutine can be quite fast.
Anyway, I do not consider decimal instructitons as core
instructions. I know that they are widely used in IBM
shops. However, somebody wanting best speed would go
to binary: binary data is smaller so whan your main
data is on tapes or discs transfer of bianry data is
faster. The only reason for decimal is inital entry
(which even in 1965 was not the main computational
cost), printing (again printers were much slower than
CPU-s so converion cost was not important) and
(main reason) inertia. Granted, decimal make a lot
of sense for cards-only shop, but when RISC arrived
I think that punched card as main storage were obsolete
and uneconomical (but there were enough inertia that
apparently some institutions used card quite long).
Post by Robin VowelsIn the integer instruction set, multiplication and division are CISC.
Instructions such as Test and Set are complex, and possibly the
loop control instructions BXLE and BXH.
No. Early RISC skipped multiplication because at that time
one could not fit fast multiplier on the chip. But
pretty quickly chip techonology catched up and RISC chips
included multiplies. Similarly for other instructions,
main point if instruction is useful and can have fast
implementation. There is nothing RISC-y in loop control
instruction that simutaneously jumps and updates register.
Some RISCS have instructions of this sort. RISC
normally avoids instructions needing multiple memory
accesses, as most such instructions can be replaced
by sequences of simpler instructins. But pragmatic
RISC recognizes that atomics have to be done as one
instruction. You may call them CISC, but as long as
you can do them without microcode (just using hardwired
control) and you do not spoil pipeline structure
they are OK. Similarly with division: it is CISC-y,
but if divider does not blow up your transistor budget
and rest of chip stays RISC, then it is OK. Around
15 years ago chip techology advanced enough that
high end RISC-s included dividers. Currently, tiny
RISC (Cortex-M0) has multipler, but no divider.
Bigger (but still relativly small) chips have dividers.
Post by Robin Vowelsthere is plenty
of complex instructions of dubious utility. But core
instruction set consists of instructions having one
memory access which starting from around 1990 can be
implemented in single cycle. They have complex
instruction encoding which required extra chip
space compared to religious RISC. But in modern
chips instruction decoders are tiny compared to
other parts. Around 1995 AMD and Intel invented
a trick so that effective speed of instruction
decoders is very high and religious RISC has
little if any advantage over 386 (or 360) there.
To put this in historical context: I have translation
of computer architecture book from 1976 by Tanenbaum.
In this book Tanenbaum writes about implementing
very complex high-level style instructions using
microcode ("Cobol" machine, "Fortran" machine).
Tanenbaum was very positive about such machines
and advocated future design should be of this
sort. RISC movement was in competely different
direction, simplifing instruction set and eliminating
microcode. In a sense, RISC movement realised
that with moderate extra effort one could
turn former microcode engines into actually
useful and very fast processor.
.
The problem with RISC design is that one needs many more
instructions to do the same amount of work. Many more instructions
need to be fetched (compared to CISC), tying up the data bus at the
same time that data is being being fetched from / stored to memory.
Most RISC-s have dedicated instruction fetch bus, separate
from data bus. So no problem with tying bus. Note that
all RISC-s that I used had caches, so buses were part of
CPU complex. Since wast majority of instructions comes
from cache, istruction fetch has limited impact on
main memory access (traffic between main memory and cache).
There is disadvantage: longer + more instructions mean that
RISC needs bigger cache or has lower hit rate for the
same cache size. This is important factor explaing why
i386 won: i386 made better use of caches than RISC.
Modern ARM in 32-bit mode offers option of mixed 16-bit
and 32-bit instructions -- this is example of RISC dropping
one of featurs that were claimed to be essential for RISC
(that is fixed length instructions).
--
Waldek Hebisch