microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	dc6b1df653	execute1: Don't execute ld/st instruction when taking interrupt This fixes a bug in the logic where we would still send a load or store instruction to loadstore1 even though we have decided to take an asynchronous interrupt. Reported-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	56577827d4	Decode attn in the major opcode decode table This decodes attn using entry 0 of the major_decode_rom_array table instead of a special case in the decode1_1 process. This means that only the major opcode (the top 6 bits) is checked at decode time. To make sure the instruction is attn not some random illegal pattern, we now check bits 1-10 of the instruction at execute time and generate an illegal instruction interrupt if those bits are not 0100000000. This reduces LUT consumption by 42 LUTs on the Arty A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6f7ef8b1b9	Decode sc in the major opcode decode table This decodes sc using entry 17 of the major_decode_rom_array table instead of a special case in the decode1_1 process. This means that only the major opcode (the top 6 bits) is checked at decode time. To make sure that the instruction is sc not scv, we now check bit 1 of the instruction at execute time and generate an illegal instruction interrupt if it is 0 (indicating scv). The level field of the sc instruction is now ignored. This reduces LUT consumption by 31 LUTs on the Arty A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	278ac5e0eb	Remove sim_config instruction It's not used any more, and it's not in the ISA. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	f5f17c24fd	execute1: Implement trap instructions properly This implements the trap instructions (tw, twi, td, tdi) using much of the same code as is used for the cmp/cmpl instructions. A 5-bit comparison value is generated, and for cmp/cmpl, the appropriate 3 bits are used to update the destination CR, and for trap instructions, the comparison value is ANDed with the TO field, and an exception is generated if any bit of the result is 1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	381149b2cc	Consolidate trap variants under a single OP_TRAP This replaces OP_TD, OP_TDI, OP_TW and OP_TWI with a single OP_TRAP, distinguishing the cases by the input_reg_b and is_32bit fields of the decode ROM. This adds the twi and td cases to the decode tables. For now we make all of the trap instructions unconditionally generate a trap-type program interrupt if the TO field of the instruction is all ones, and do nothing otherwise. This reduces the number of values in insn_type_t from 65 to 62, meaning that an insn_type_t can now be encoded in 6 bits rather than 7. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	d77033aa92	execute1: Simplify the interrupt logic a little This makes some simplifications to the interrupt logic which will help with later commits. - When irq_valid is set, don't set exception to 1 until we have a valid instruction. That means we can remove the if e_in.valid = '1' test from the exception = '1' block. - Don't assert stall_out on the first cycle of delivering an interrupt. If we do get another instruction in the next cycle, nothing will happen because we have ctrl.irq_state set and we will just continue writing the interrupt registers. - Make sure we deliver as many completions as we got instructions, otherwise the outstanding instruction count in control.vhdl gets out of sync. - In writeback, make sure all of the other write enables are ignored when e_in.exc_write_enable is set. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	fe077a116a	Rename OP_MCRF to OP_CROP and trim insn_type_t OP_MCRF covers the CR logical ops as well as mcrf since commit `c05441bf47` ("Implement CRNOR and friends"), so this renames OP_MCRF to OP_CROP. The OP_* values for the individual CR logical ops (OP_CRAND, etc.) are not used, so remove them from insn_type_t. No functional change. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	fb8f3da128	Give exceptions a separate path to writeback This adds separate fields in Execute1ToWritebackType for use in writing SRR0/1 (and in future other SPRs) on an interrupt. With this, we make timing once again on the Arty A7-100 -- previously we were missing by 0.2ns, presumably due to the result mux being wider than before. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Michael Neuling	5ef5604b65	Add sc, illegal and decrementer exceptions and some supervisor state This adds the following exceptions: - 0x700 program check (for illegal instructions) - 0x900 decrementer - 0xc00 system call This also adds some supervisor state: - decremeter - msr (SPRG0/1 and SRR0/1 already exist as fast SPRs) It also adds some supporting instructions: - rfid - mtmsrd - mfmsr - sc MSR state is added but only EE is used in this patch set. Other bits are read/written but are not used at all. This adds a 2 stage state machine to execute1.vhdl. This state machine allows fast SPRS SRR0/1 to be written in different cycles. This state machine can be extended later to add DAR and DSISR SPR writing for more complex exceptions like page faults. Signed-off-by: Michael Neuling <mikey@neuling.org>	4 years ago
Michael Neuling	594a19de37	Plumb attn instruction through to execute1 Currently we decode attn but we just mark it as an illegal. This adds a separate case statement in execute 1 for attn to terminate the core. Illegals also do this currently but we are soon implementing a 0x700 execption for them. Signed-off-by: Michael Neuling <mikey@neuling.org>	4 years ago
Paul Mackerras	81369187c0	loadstore1: Add support for cache-inhibited load and store instructions This adds support for lbzcix, lhzcix, lwzcix, ldcix, stbcix, sthcix, stwcix and stdcix. The temporary hack where accesses to addresses of the form 0xc??????? are made non-cacheable is left in for now to avoid making existing programs non-functional. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b349cc891a	loadstore1: Move logic from dcache to loadstore1 So that the dcache could in future be used by an MMU, this moves logic to do with data formatting, rA updates for update-form instructions, and handling of unaligned loads and stores out of dcache and into loadstore1. For now, dcache connects only to loadstore1, and loadstore1 now has the connection to writeback. Dcache generates a stall signal to loadstore1 which indicates that the request presented in the current cycle was not accepted and should be presented again. However, loadstore1 doesn't currently use it because we know that we can never hit the circumstances where it might be set. For unaligned transfers, loadstore1 generates two requests to dcache back-to-back, and then waits to see two acks back from dcache (cycles where d_in.valid is true). Loadstore1 now has a FSM for tracking how many acks we are expecting from dcache and for doing the rA update cycles when necessary. Handling for reservations and conditional stores is still in dcache. Loadstore1 now generates its own stall signal back to decode2, so we no longer need the logic in execute1 that generated the stall for the first two cycles. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	5d85ede97d	dcache: Implement load-reserve and store-conditional instructions This involves plumbing the (existing) 'reserve' and 'rc' bits in the decode tables down to dcache, and 'rc' and 'store_done' bits from dcache to writeback. It turns out that we had 'RC' set in the 'rc' column for several ordinary stores and for the attn instruction. This corrects them to 'NONE', and sets the 'rc' column to 'ONE' for the conditional stores. In writeback we now have logic to set CR0 when the input from dcache has rc = 1. In dcache we have the reservation itself, which has a valid bit and the address down to cache line granularity. We don't currently store the reservation length. For a store conditional which fails, we set a 'cancel_store' signal which inhibits the write to the cache and prevents the state machine from starting a bus cycle or going to the STORE_WAIT_ACK state. Instead we set r1.stcx_fail which causes the instruction to complete in the next cycle with rc=1 and store_done=0. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	1a244d3470	Remove single-issue constraint for most loads and stores This removes the constraint that loads and stores are single-issue, at the expense of a stall of at least 2 cycles for every load and store. To do this, we plumb the existing stall signal that was generated in dcache to core, where it gets ORed with the stall signal from execute1. Execute1 generates a stall signal for the first two cycles of each load and store, and dcache generates the stall signal in the 3rd and subsequent cycles if it needs to. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	441160d865	execute1: Use truth table embedded in instruction for CR logical ops It turns out that CR logical instructions have the truth table of the operation embedded in the instruction word. This means that we can collect the two input operand bits into a 2-bit value and use that as the index to select the appropriate bit from the instruction word. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	e08ca4ab8e	countzero: Add a register to help make timing This adds a register in the middle of the countzero computation, so that we now have two cycles to count leading or trailing zeroes instead of just one. Execute1 now outputs a one-cycle stall signal when it encounters a cntlz* or cnttz* instruction. With this, the countzero path no longer fails timing on the Artix-7 at 100MHz. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	5422007f83	Plumb loadstore1 input from execute1 not decode2 This allows us to use the bypass at the input of execute1 for the address and data operands for loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b14d982011	execute: Implement bypass from output of execute1 to input This enables back-to-back execution of integer instructions where the first instruction writes a GPR and the second reads the same GPR. This is done with a set of multiplexers at the start of execute1 which enable any of the three input operands to be taken from the output of execute1 (i.e. r.e.write_data) rather than the input from decode2 (i.e. e_in.read_data[123]). This also requires changes to the hazard detection and handling. Decode2 generates a signal indicating that the GPR being written is available for bypass, which is true for instructions that are executed in execute1 (rather than loadstore1/dcache). The gpr_hazard module stores this "bypassable" bit, and if the same GPR needs to be read by a subsequent instruction, it outputs a "use_bypass" signal rather than generating a stall. The use_bypass signal is then latched at the output of decode2 and passed down to execute1 to control the input multiplexer. At the moment there is no bypass on the inputs to loadstore1, but that is OK because all load and store instructions are marked as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0c714f1be6	execute: Move popcnt and prty instructions into the logical unit This implements logic in the logical entity to calculate the results of the popcnt* and prty* instructions. We now have one insn_type_t value for the 3 popcnt variants and one for the two prty variants, using the length field of the decode_rom_t to distinguish between them. The implementations in logical.vhdl using recursive algorithms rather than the simple functions in ppc_fx_insns.vhdl. This gives a saving of about 140 slice LUTs on the A7-100 and improves timing slightly. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	d2ca625b3b	execute: Do comparisons using the main adder This handles OP_CMP like a subtraction; the main adder computes ~RA + RB + 1, and the condition codes are computed from the results. A direct comparison of the two input operands is used to calculate the EQ bit of the condition result. The LT and GT bits are computed from the MSB of the subtraction result, the carry out from the subtraction, and the MSBs of the operands. For a 32-bit comparison, the 32-bit carry and bit 31 of the result and input operands are used; for a 64-bit comparison, the 64-bit carry and bit 63 of the operands and result are used. It turns out to be more convenient to use the 'signed' field of the decode table to distinguish signed from unsigned comparisons, rather than the insn_type. Therefore this uses OP_CMP for both cmp and cmpl, which also has the benefit of reducing the number of values in insn_type_t. Doing this saves over 200 slice LUTs on the Arty A7-100 and improves timing slightly as well. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	d956846667	execute1: Move EXTS* instruction back into execute1 This moves the sign extension done by the extsb, extsh and extsw instructions back into execute1. This means that we no longer need any data formatting in writeback for results coming from execute1, so this modifies writeback so the data formatter inputs come directly from the loadstore unit output. The condition code updates for RC=1 form instructions are now done on the value from execute1 rather than the output of the data formatter, which should help timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c9a2076dd3	execute1: Remember dest GPR, RC, OE, XER for slow operations For multiply and divide operations, execute1 now records the destination GPR number, RC and OE from the instruction, and the XER value. This means that the multiply and divide units don't need to record those values and then send them back to execute1. This makes the interface to those units a bit simpler. They simply report an overflow signal along with the result value, and execute1 takes care of updating XER if necessary. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	39d18d2738	Make divider hang off the side of execute1 With this, the divider is a unit that execute1 sends operands to and which sends its results back to execute1, which then send them to writeback. Execute1 now sends a stall signal when it gets a divide or modulus instruction until it gets a valid signal back from the divider. Divide and modulus instructions are no longer marked as single-issue. The data formatting step that used to be done in decode2 for div and mod instructions is now done in execute1. We also do the absolute value operation in that same cycle instead of taking an extra cycle inside the divider for signed operations with a negative operand. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	2167186b5f	Make multiplier hang off the side of execute1 With this, the multiplier isn't a separate pipe that decode2 issues instructions to, but rather is a unit that execute1 sends operands to and which sends the result back to execute1, which then sends it to writeback. Execute1 now sends a stall signal when it gets a multiply instruction until it gets a valid signal back from the multiplier. This all means that we no longer need to mark the multiply instructions as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Anton Blanchard	ad3db18dce	Fix a ghdysynth inferred latch error in execute It should never happen in practise, but ghdlsynth is complaining about an inferred latch here. Fix it Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	cc8a9e7893	Upper 32 bits of XER should read as 0s From the architecture: bits 0:31 and 35:43 are treated as reserved and return 0s when read using mfxer Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Tom Vijlbrief	c05441bf47	Implement CRNOR and friends Signed-off-by: Tom Vijlbrief <tvijlbrief@gmail.com>	4 years ago
Benjamin Herrenschmidt	e4f475e17f	sprs: Store common SPRs in register file This stores the most common SPRs in the register file. This includes CTR and LR and a not yet final list of others. The register file is set to 64 entries for now. Specific types are defined that can represent a GPR index (gpr_index_t) or a GPR/SPR index (gspr_index_t) along with conversion functions between the two. On order to deal with some forms of branch updating both LR and CTR, we introduced a delayed update of LR after a branch link. Note: We currently stall the pipeline on such a delayed branch, but we could avoid stalling fetch in that specific case as we know we have a branch delay. We could also limit that to the specific case where we need to update both CTR and LR. This allows us to make bcreg, mtspr and mfspr pipelined. decode1 will automatically force the single issue flag on mfspr/mtspr to a "slow" SPR. [paulus@ozlabs.org - fix direction of decode2.stall_in] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	ec9b27660f	execute: Copy XER[SO] to CR for cmp[i] and cmpl[i] instructions We were copying in XER[SO] for the dot-form instructions but not the explicit compare instructions. Fix this. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	501b6daf9b	Add basic XER support The carry is currently internal to execute1. We don't handle any of the other XER fields. This creates type called "xer_common_t" that contains the commonly used XER bits (CA, CA32, SO, OV, OV32). The value is stored in the CR file (though it could be a separate module). The rest of the bits will be implemented as a separate SPR and the two parts reconciled in mfspr/mtspr in latter commits. We always read XER in decode2 (there is little point not to) and send it down all pipeline branches as it will be needed in writeback for all type of instructions when CR0:SO needs to be updated (such forms exist for all pipeline branches even if we don't yet implement them). To avoid having to track XER hazards, we forward it back in EX1. This assumes that other pipeline branches that can modify it (mult and div) are running single issue for now. One additional hazard to beware of is an XER:SO modifying instruction in EX1 followed immediately by a store conditional. Due to our writeback latency, the store will go down the LSU with the previous XER value, thus the stcx. will set CR0:SO using an obsolete SO value. I doubt there exist any code relying on this behaviour being correct but we should account for it regardless, possibly by ensuring that stcx. remain single issue initially, or later by adding some minimal tracking or moving the LSU into the same pipeline as execute. Missing some obscure XER affecting instructions like addex or mcrxrx. [paulus@ozlabs.org - fix CA32 and OV32 for OP_ADD, fix order of arguments to set_ov] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	83a8bb0238	spr: Cleanup decoding of SPR numbers Use a function to obtain the integer number and use constants with the architected numbers. Replace std_match with a case statement. This also has the side effect of returning 0 instead of some random previous result on mfspr of an unknown SPR. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	247d7d4aa0	Merge pull request #113 from mikey/exec-sim-remove Remove SIM generic from execute1	5 years ago
Michael Neuling	bd4ac06243	Remove SIM generic from execute1 This does nothing, so remove. Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Benjamin Herrenschmidt	742b21480e	insn: Simplistic implementation of icbi We don't yet have a proper snooper for the icache, so for now make icbi just flush the whole thing Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	a0d95e791e	insn: Implement isync instruction The instruction works by redirecting fetch to nia+4 (hopefully using the same adder used to generate LR) and doing a backflush. Along with being single issue, this should guarantee that the next instruction only gets fetched after the pipe's been emptied. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	e67924f55e	isel takes a CR bit, not a CR field Fix a GHDL assert in isel. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Benjamin Herrenschmidt	bddc9327cc	execute1: Remove mux on "write_data" and "rc" outputs Only "write_enable" needs to change, this shrinks the core a bit more Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	da0bd89c43	crhelpers: Constraint "crnum" integer This seems to save quite a few LUTs Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	4437487ad0	execute1: Reformat No functional change Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	4433118c91	Merge pull request #105 from paulusmack/writeback Writeback	5 years ago
Paul Mackerras	f49a5a99a5	Remove execute2 stage Since the condition setting got moved to writeback, execute2 does nothing aside from wasting a cycle. This removes it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	9646fe28b0	Do sign-extension instructions in writeback instead of execute1 This makes the exts[bhw] instructions do the sign extension in the writeback stage using the sign-extension logic there instead of having unique sign extension logic in execute1. This requires passing the data length and sign extend flag from decode2 down through execute1 and execute2 and into writeback. As a side bonus we reduce the number of values in insn_type_t by two. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	86c53aa3f7	Implement neg using OP_ADD We have all the machinery in place to implement the neg instruction as OP_ADD. Doing that means we can ditch OP_NEG, and saves about 66 slice LUTs on the A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	57b7f1ed71	Don't infer latch for newcrf Always initialize newcrf to avoid inferring a latch. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	24a4a796ce	execute: Consolidate count-leading/trailing-zeroes implementations This adds combinatorial logic that does 32-bit and 64-bit count leading and trailing zeroes in one unit, and consolidates the four instructions under a single OP_CNTZ opcode. This saves 84 slice LUTs on the Arty A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	b8fb721b81	Consolidate logical instructions Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common invert on the input and output. This saves us about 200 LUTs. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	f7c393ba7e	Add a rotate/mask/shift unit and use it in execute1 This adds a new entity 'rotator' which contains combinatorial logic for rotating and masking 64-bit values. It implements the operations of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl, rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions. It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at each stage, two mask generators, output logic and control logic. The insn_type_t values used for these instructions have been reduced to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask instructions (clear both left and right, clear left, clear right variants), OP_SHL for left shifts, and OP_SHR for right shifts. The control signals for the rotator are derived from the opcode and from the is_32bit and is_signed fields of the decode_rom_t. The rotator is instantiated as an entity in execute1 so that we can be sure we only have one of it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	7fe84220a5	decode: Avoid multiplexing from instruction reg fields to regfile address ports This aims to simplify the logic between the instruction image and the register file read address ports and reduce the size of the decode tables. With this patch, the input_reg_a column of the decode tables can only select RA or zeroes, the input_reg_b column can only select RB or a constant (0, -1, or an immediate value from the instruction), and the input_reg_c columns can only select RS or zeroes. That means that the rotate/shift/logical ops now have their first input coming in via the input_reg_c column. That means we need to add a read_data3 field to the Decode2ToExecuteType record, but that will go away again when we split out the rotate/mask/logical ops to their own unit. As a related but not tightly connected change, this patch also sets the read1_enable signal to the register file be 0 when RA=0 and the input_reg_a for the instruction is RA_OR_ZERO (previously it was 1). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	96b402a4bf	Consolidate add/subtract instructions into a single op All of the PPC add and subtract instructions, including carrying and extended versions, do much the same arithmetic operation: result = (I xor A) + B + C where A is the value from RA, I provides a logical inversion of A (i.e. I is 0 or -1), B is either from RB or is a constant 0 or -1, and C is 0, 1 or the carry bit from XER (CA). To consolidate all the add/subtract instructions into a single OP_ADD, we add a column to decode_rom_t to indicate when A should be inverted, and change the input_carry field to a 3-state selector to select C in the equation above. This also adds a new "CONST_M1" value for input_reg_b_t to indicate that B is a constant -1. This allows us to implement addme and subfme. The addex instruction appears not to exist, so the comments referring to it are removed. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago

1 2

70 Commits (f21f9dd5a0acbe9215703161f599252f1bf87d80)