The indi32 Architecture Purpose and Instruction Set

The indi32 has a 32 bit word, and so has an extra 16 bits for operand control every two instructions. The features considered to occupy the extra bits are to increase thruput of stack opertations, privide virtual and/or cache memory features and things relating to protected mode style operation. The extra bits occupy the high half word of any instruction word.

In the end it was decided that the 32bit instruction set extension will use 2 bits of the extra 16 bits as a code and interrupt execution entry guard. This will prevent modification of p via the =p mode destination except when the guard bits are flagged correctly. This can be implementeted efficiently and gives a significant improvement in code stability. Preserving 16 bit modulo pre post indexing and carry execution semantics on the instructions will consume 1 bit * 2, And the last 6 bits * 2 of instruction provide fixed and ASIC function unit interfaces of 512 bits each (256 in and 256 out), along with a cache mode interface to fast RAM. This leads to a significant saving in MMU area, while providing many of the benefits, especially for running 16 bit applications within a 32 bit protected environment.

The ADLY is a two bit accumulator delay select controlling if an earlier accumulator (A) value should be output. A0 works as normal (the accumulator getting the current output), A1 is a single instruction delayed copy (the accumulator contents) and A2 and A3 are further delayed. This makes SWAP and ROT operations faster amongst others. W16 cotrols writeback inhibit on the index registers, memory and accumulator high words. This limits code to being 16 bit compatable as the carry is also set at the 16th bit if W16 is set. Default zeros give 32 bit execution.

INX and OUTX modify the way a register is used to acces memory and also provide access to the fixed and ASIC function interfacwes. All OUTX and ADLY bit combinations are described later if the output mode is the special null .p mode.

The extra 16 bits are placed in the high word of the instruction double word with the MSB of the high word relating to the MSB instruction of the low word. And complimentary for the LSB instruction. This makes it easier to control access to the protection of the high word in 16 bit mode, but does allow 32 bit operations within a block of protected code.

Bits 31 and 23 (POK bits) make a guard combination. Bit 31 when set allows setting of the p register to point to this instruction pair, or when clear inhibits the save to the p register and continues with program flow. The purpose of bit 23 is to indicate a cache volutile instruction.

The bus design is based arround, at its simplest, 32 bit memory ports. They could be one only combined, but this would limit memory bandwidth, and ASIC IO addressing options. A writeback cache common to all ports reduces memory traffic. there could be fetch reordering to postpone writes in the SDRAM (what else) driver. Each port can read one machine word at a time.

The INX and OUTX modifiers (XBits) of an opcode select which memory portage is used in indirect fetches, making internal addressing 34 bit. The direct modes access the on chip fixed and ASIC register files. Of the indirect portages 00 is defined as direct primary memory. 01 is defined as some kind of indirect index MMU transform or separate memory. 10 is defined as pointer (address relocation memory a.k.a MMU keys) or seperate memory. Finally 11 is ASIC defined or seperate memory.

The fixed and ASIC registers are accessed in 32 bit word register groupings, with the ASIC registers presenting seperate read and write ports to the ASIC logic. The ASIC Registers are labelled <LSB-Xbit><Register>, eg 1Q and 0R. The fixed function registers are labelled using 2 and 3 e.g. 3S and 2P. They have alternate names and function to the following scheme.

The fixed function register set includes some useful constant reads, signed arithmetic utility, a JTAG interface to the ASIC section (with 32 bit alternating read and write. Non alternation or read before write toggles programming mode) and a FourCC communications bus between cores. Data codecs may be built into the chip which intercept certain FourCC codes, and output to standardized IO FourCC devices. EXWA provides an exchange of the low and high 16 bit half words of the accumulator.

The operation of the multiplier-dividor is paused for 1 insruction on a read or write to any of MQHI (Mul Hi, Div Rem), MQLO (Mul Lo, Div Quot), but not to DRM1 (Divisor Minus One) or MRM1 (Multipler Minus One) as they are not modified. Operation continues after the pause to produce a result in 16 instructions cycles or 8 instruction words. With $FFFF FFFF in DRM1 and MRM1 the MQHI and MQLO will operate as a shift register, shifting 2 places left, with opverflow carries wraping to LSB, per instruction cycle. All arithmetic is performed unsigned. Internally MQHI:MQLO is shifted one place and also a sum with DRM1:MRM1 with looped carry is made (DRM1 is logically inverted). If the looped carry is 1 then MQHI:MQLO is loaded with the sum of itself shifted with NOT(DRM1):MRM1. It is assumed that any vector or SSE type demands will be factored into a FourCC core or within the ASIC section. It is impossible to perform both division and multiplication to full word size at the same time. DRM1 at -1 does no division and MRM1 at 0 does no multiplication.

This combination of divider and multiplier in one unit is the most minimal I have found, and does present a double word architecture extension where (x:y)+(a:b) = (x-a-1:y+b+1) with cross carries. This is a new mathematical structure I call a ringfield.

The PNUL mode has the following execution semantics. A simple provision for carry flag state control is included. All unused combinations are reserved. The format is <ADLY><OUTX>

Assembly Language

 

Caching Strategies

All writes to memory are performed through a writeback shifting cache of fixed length. Any write to be added to the in end of the FIFO has its address tag checked against all cache elements in parallel to stop a write which will be replaced with a later write just added to the FIFO. A null write does not advance the FIFO and so no write has to be done from the cache. A second set of parallel address tag comparators are used to recall any pending write for immediate read. Any memory mapped IO needing to be done is done through the cache, and so time must be left between instructions writing to the same location, along with no dependance on any effect of any location being pre-read before being written (as the pre-read may happen).

A simple read cache for p register access of code would complement the writeback cache on the other 3 registers, as the read cache would not have to accept write updates. To avoid running old stale code, when there is new code, a read cached fetched instruction word may have its volotile bit set, and so will always be fetched from main memory. This has the effect of being able to compare fetched versus cached instruction words. A difference results in entering cascade invalidation of code until the next volotile instruction. That is to say that the cache has a load valid mode, as well as a cache and check volotile mode. N-way set associative and may include a ROM set.

A yeild on cache read wait state multitasking system seems like a good performance option, and is easier to implement due to the low number of registers to be switched. A total read stall would be unlikely, but would involve continual retrying of all stalled read fetches from the caches.

A small fully associative fetch cache could be filled from any bus, and may be multiported. It may be combined with the writeback cache by allowing an apparent writeback entry, read from memory, when a p null mode write operation or stall does not occupy the write port(using a read mode). This would have a slight inefficiency of writing back unmodified read data, so a single modified flag per entry could prevent this.

A little further thought along the lines of a circular buffer instead of arrays of serial shift registers, allows an apparent write deletion by unflagging dirty, and an insertion at the head of the list on a write combine.This leads to a slightly lower cache efficiency per cache line, but to a more efficient dual prted ram density maybe, or maybe not? Dual ports maybe can be optimized to a read modify write and flag clean or dirty. DRAM cells with the writeback high refresh rate may be possible. But as the cache macro cell of the cache line maybe smaller, actual storage cache density may increase.

It may be possible to have the processor built into DRAM chips. It would involve using a number of bank split static columns functioning as cache. The cache would be bus multiplexed to many processors, When a write to the writeback cache occurs, it dirties the column bank section, and this dirtied by information can be flushed in and out of the main DRAM array on a slightly wider column. The dirty information works as a write lock for any part column, and also as a read lock. When all writeback cache relating to this dirty page has been flushed from a processor then the dirty flags can be cleared. Any writeback cache updates from any other processor but the dirty lock holder have to be ignored (code like this is lock arbitration in OS).

The small P read lazy snoop cache can be integrated with the writeback cache. A locked column part can still be flushed to the main DRAM array, so a seperate dirty flag is needed in order to avoid writing a column back which has not been changed. Any changed columns can be flushed when the main DRAM to column bus is not occupied, and no precharge needs to happen as they are writes.

Along with the dirtied by lock flags, information could also be stored on page access counts. This allows a full LRU replacement algorithm. If the access counts are stored in the main DRAM array, then further imporovement is possible. If say an 8 bit access count is used, and all active columned pages have there counts divided by two every 100 accesses, then they would overflow infrequently, and provided a bias to keeping just loaded pages in the column cache.

An external DDR bus interface could compete for the memory just like a processor, and a burst read of 8 words would be provided from the sequential big-endian nibbles of 8 chips. This implies a memory transform from a 32 bit word to nibbles within eight words as the memory is programmed up with the code to be executed.

Default indi32 IO Modules