microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	c2da82764f	core: Implement CFAR register This implements the CFAR SPR as a slow SPR stored in 'ctrl'. Taken branches and rfid update it to the address of the branch or rfid instruction. To simplify the logic, this makes rfid use the branch logic to generate its redirect (requiring SRR0 to come in to execute1 on the B input and SRR1 on the A input), and the masking of the bottom 2 bits of NIA is moved to fetch1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	76e2c7d81c	ex1: Add SPR_TBU support It's used by the boot wrapper in Linux and possibly some userspace programs. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Paul Mackerras	ec2fa61792	execute1: Reduce width of the result mux to help timing This reduces the number of different things that are assigned to the result variable. - The computations for the popcnt, prty, cmpb and exts instruction families are moved into the logical unit. - The result of mfspr from the slow SPRs is computed in 'spr_val' before being assigned to 'result'. - Writes to LR as a result of a blr or bclr instruction are done through the exc_write path to writeback. This eases timing considerably. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6687aae4d6	core: Implement a simple branch predictor This implements a simple branch predictor in the decode1 stage. If it sees that the instruction is b or bc and the branch is predicted to be taken, it sends a flush and redirect upstream (to icache and fetch1) to redirect fetching to the branch target. The prediction is sent downstream with the branch instruction, and execute1 now only sends a flush/redirect upstream if the prediction was wrong. Unconditional branches are always predicted to be taken, and conditional branches are predicted to be taken if and only if the offset is negative. Branches that take the branch address from a register (bclr, bcctr) are predicted not taken, as we don't have any way to predict the branch address. Since we can now have a mflr being executed immediately after a bl or bcl, we now track the update to LR in the hazard tracker, using the second write register field that is used to track RA updates for update-form loads and stores. For those branches that update LR but don't write any other result (i.e. that don't decrementer CTR), we now write back LR in the same cycle as the instruction rather than taking a second cycle for the LR writeback. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	209aa9ce3f	loadstore1: Reduce busy cycles This reduces the number of cycles where loadstore1 asserts its busy output, leading to increased throughput of loads and stores. Loads that hit in the cache can now be executed at the rate of one every two cycles. Stores take 4 cycles assuming the wishbone slave responds with an ack the cycle after we assert strobe. To achieve this, the state machine code is split into two parts, one for when we have an existing instruction in progress, and one for starting a new instruction. We can now combinatorially clear busy and start a new instruction in the same cycle that we get a done signal from the dcache; in other words we are completing one instruction and potentially writing back results in the same cycle that we start a new instruction and send its address and data to the dcache. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6701e7346b	core: Use a busy signal rather than a stall This changes the instruction dependency tracking so that we can generate a "busy" signal from execute1 and loadstore1 which comes along one cycle later than the current "stall" signal. This will enable us to signal busy cycles only when we need to from loadstore1. The "busy" signal from execute1/loadstore1 indicates "I didn't take the thing you gave me on this cycle", as distinct from the previous stall signal which meant "I took that but don't give me anything next cycle". That means that decode2 proactively gives execute1 a new instruction as soon as it has taken the previous one (assuming there is a valid instruction available from decode1), and that then sits in decode2's output until execute1 can take it. So instructions are issued by decode2 somewhat earlier than they used to be. Decode2 now only signals a stall upstream when its output buffer is full, meaning that we can fill up bubbles in the upstream pipe while a long instruction is executing. This gives a small boost in performance. This also adds dependency tracking for rA updates by update-form load/store instructions. The GPR and CR hazard detection machinery now has one extra stage, which may not be strictly necessary. Some of the code now really only applies to PIPELINE_DEPTH=1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	9880fc7435	multiply: Move selection of result bits into execute1 This puts the logic that selects which bits of the multiplier result get written into the destination GPR into execute1, moved out from multiply. The multiplier is now expected to do an unsigned multiplication of 64-bit operands, optionally negate the result, detect 32-bit or 64-bit signed overflow of the result, and return a full 128-bit result. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	49a4d9f67a	Add core logging This logs 256 bits of data per cycle to a ring buffer in BRAM. The data collected can be read out through 2 new SPRs or through the debug interface. The new SPRs are LOG_ADDR (724) and LOG_DATA (725). LOG_ADDR contains the buffer write pointer in the upper 32 bits (in units of entries, i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of doublewords, i.e. 8 bytes). Reading LOG_DATA gives the doubleword from the buffer at the read pointer and increments the read pointer. Setting bit 31 of LOG_ADDR inhibits the trace log system from writing to the log buffer, so the contents are stable and can be read. There are two new debug addresses which function similarly to the LOG_ADDR and LOG_DATA SPRs. The log is frozen while either or both of the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set. The buffer defaults to 2048 entries, i.e. 64kB. The size is set by the LOG_LENGTH generic on the core_debug module. Software can determine the length of the buffer because the length is ORed into the buffer write pointer in the upper 32 bits of LOG_ADDR. Hence the length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)). There is a program to format the log entries in a somewhat readable fashion in scripts/fmt_log/fmt_log.c. The log_entry struct in that file describes the layout of the bits in the log entries. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4a4a98d4b9	core: Do addpcis using the main adder (#189 ) By adding logic to decode2 to be able to send the instruction address down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add 4 to the immediate value (easy since the bottom 16 bits were zero), we can do addpcis using the main adder. This reduces the width of the result mux and frees up one value in insn_type_t, since we can now use OP_ADD for addpcis. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	f089f2145a	Merge pull request #183 from shawnanastasio/addpcis Add support for the addpcis instruction	4 years ago
Shawn Anastasio	e606772aeb	Implement the addpcis instruction This commit adds support for the addpcis instruction from ISA 3.0. A new input_reg_b_t type, CONST_DX_HI, was added to support the shifted immediate value used in DX-Form instructions. Signed-off-by: Shawn Anastasio <shawn@anastas.io>	5 years ago
Benjamin Herrenschmidt	f86fb74bfe	irq: Simplify xics->core irq input Use a simple wire. common.vhdl types are better kept for things local to the core. We can add more wires later if we need to for HV irqs etc... Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Jonathan Balkind	cc532dd065	Changes for compilation with VCS: - Changing use of others in core files to satisfy VCS - Adding workaround for VCS subtype constraint inconsistencies in common.vhdl Signed-off-by: Jonathan Balkind <jbalkind@princeton.edu>	5 years ago
Paul Mackerras	a658766fcf	Implement slbia as a dTLB/iTLB flush Slbia (with IH=7) is used in the Linux kernel to flush the ERATs (our iTLB/dTLB), so make it do that. This moves the logic to work out whether to flush a single entry or the whole TLB from dcache and icache into mmu. We now invalidate all dTLB and iTLB entries when the AP (actual pagesize) field of RB is non-zero on a tlbie[l], as well as when IS is non-zero. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	01046527ba	MMU: Do radix page table walks on iTLB misses This hooks up the connections so that an OP_FETCH_FAILED coming down to loadstore1 will get sent to the MMU for it to do a radix tree walk for the instruction address. The MMU then sends the resulting PTE to the icache module to be installed in the iTLB. If no valid PTE can be found, the MMU sends an error signal back to loadstore1 which sends it on to execute1 to generate an ISI. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	3d4712ad43	Add TLB to icache This adds a direct-mapped TLB to the icache, with 64 entries by default. Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along with redirects to indicate whether instruction addresses should be translated through the TLB, and fetch1 sends that on to icache. Similarly a "priv_mode" signal is sent to indicate the privilege mode for instruction fetches. This means that changes to MSR[IR] or MSR[PR] don't take effect until the next redirect, meaning an isync, rfid, branch, etc. The icache uses a hash of the effective address (i.e. next instruction address) to index the TLB. The hash is an XOR of three fields of the address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and 24--29 of the address. TLB invalidations simply invalidate the indexed TLB entry without checking the contents. If the icache detects a TLB miss with virt_mode=1, it will send a fetch_failed indication through fetch2 to decode1, which will turn it into a special OP_FETCH_FAILED opcode with unit=LDST. That will get sent down to loadstore1 which will currently just raise a Instruction Storage Interrupt (0x400) exception. One bit in the PTE obtained from the TLB is used to check whether an instruction access is allowed -- the privilege bit (bit 3). If bit 3 is 1 and priv_mode=0, then a fetch_failed indication is sent down to fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put into the iTLB since such PTEs would not allow execution by any context. Tlbie operations get sent from mmu to icache over a new connection. Unfortunately the privileged instruction tests are broken for now. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	f6a0d7f9da	MMU: Implement data segment interrupts A data segment interrupt (DSegI) occurs when an address to be translated by the MMU is outside the range of the radix tree or the top two bits of the address (the quadrant) are 01 or 10. This is detected in a new state of the MMU state machine, and is sent back to loadstore1 as an error, which sends it on to execute1 to generate an interrupt to the 0x380 vector. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d47fbf88d1	Implement access permission checks This adds logic to the dcache to check the permissions encoded in the PTE that it gets from the dTLB. The bits that are checked are: R must be 1 C must be 1 for a store EAA(0) - if this is 1, MSR[PR] must be 0 EAA(2) must be 1 for a store EAA(1) \| EAA(2) must be 1 for a load In addition, ATT(0) is used to indicate a cache-inhibited access. This now implements DSISR bits 36, 38 and 45. (Bit numbers above correspond to the ISA, i.e. using big-endian numbering.) MSR[PR] is now conveyed to loadstore1 for use in permission checking. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	42d0fcc511	Implement data storage interrupts This adds a path from loadstore1 back to execute1 for reporting errors, and machinery in execute1 for generating data storage interrupts at vector 0x300. If dcache is given two requests in successive cycles and the first encounters an error (e.g. a TLB miss), it will now cancel the second request. Loadstore1 now responds to errors reported by dcache by sending an exception signal to execute1 and returning to the idle state. Execute1 then writes SRR0 and SRR1 and jumps to the 0x300 Data Storage Interrupt vector. DAR and DSISR are held in loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	750b3a8e28	dcache: Implement data TLB This adds a TLB to dcache, providing the ability to translate addresses for loads and stores. No protection mechanism has been implemented yet. The MSR_DR bit controls whether addresses are translated through the TLB. The TLB is a fixed-pagesize, set-associative cache. Currently the page size is 4kB and the TLB is 2-way set associative with 64 entries per set. This implements the tlbie instruction. RB bits 10 and 11 control whether the whole TLB is invalidated (if either bit is 1) or just a single entry corresponding to the effective page number in bits 12-63 of RB. As an extension until we get a hardware page table walk, a tlbie instruction with RB bits 9-11 set to 001 will load an entry into the TLB. The TLB entry value is in RS in the format of a radix PTE. Currently there is no proper handling of TLB misses. The load or store will not be performed but no interrupt is generated. In order to make timing at 100MHz on the Arty A7-100, we compare the real address from each way of the TLB with the tag from each way of the cache in parallel (requiring # TLB ways * # cache ways comparators). Then the result is selected based on which way hit in the TLB. That avoids a timing path going through the TLB EA comparators, the multiplexer that selects the RA, and the cache tag comparators. The hack where addresses of the form 0xc------- are marked as cache-inhibited is kept for now but restricted to real-mode accesses. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	635e316f9b	Pass mtspr/mfspr to MMU-related SPRs down to loadstore1 This arranges for some mfspr and mtspr to get sent to loadstore1 instead of being handled in execute1. In particular, DAR and DSISR are handled this way. They are therefore "slow" SPRs. While we're at it, fix the spelling of HEIR and remove mention of DAR and DSISR from the comments in execute1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	dd2e71930c	debug: Provide a way to examine GPRs, fast SPRs and MSR This provides commands on the debug interface to read the value of the MSR or any of the 64 GSPR register file entries. The GSPR values are read using the B port of the register file in a cycle when decode2 is not using it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	5d282a950c	Improve architectural compliance of mfspr and mtspr Mfspr from an unimplemented SPR should be a no-op in privileged state, so in this case we need to write back whatever was previously in the destination register. For problem state, both mtspr and mfspr to unimplemented SPRs should cause a program interrupt. There are special cases in the architecture for SPRs 0, 4 5 and 6 which we still don't implement. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	8a0a907e2f	Implement the extswsli instruction This mainly required the addition of an entry to the opcode 31 decode table and a 32-bit sign-extender in the rotator. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	102fbcfe9a	execute1: Fix interrupt delivery during slow instructions During slow instructions such as multiply or divide, if a decrementer (or other asynchronous) interrupt becomes pending, it disrupts the logic that keeps stall asserted until the end of the slow instruction, and the interrupt logic starts trying to deliver the interrupt before the slow instruction has finished. To fix that, make the interrupt logic wait until it sees e_in.valid set before setting exception to 1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	102b304db7	Merge remote-tracking branch 'remotes/origin/master'	5 years ago
Paul Mackerras	167e37d667	Plumb insn_type through to loadstore1 In preparation for adding a TLB to the dcache, this plumbs the insn_type from execute1 through to loadstore1, so that we can have other operations besides loads and stores (e.g. tlbie) going to loadstore1 and thence to the dcache. This also plumbs the unit field of the decode ROM from decode2 through to execute1 to simplify the logic around which ops need to go to loadstore1. The load and store data formatting are now not conditional on the op being OP_LOAD or OP_STORE. This eliminates the inferred latches clocked by each of the bits of r.op that we were getting previously. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	74db071067	execute1: Generate privileged instruction interrupts when MSR[PR] = 1 This adds logic to execute1 to check, when MSR[PR] = 1, whether each instruction arriving to be executed is a privileged instruction. If it is, a privileged-instruction type program interrupt is generated. For the mtspr and mfspr instructions, we need to look at bit 20 of the instruction (bit 4 of the SPR number) to determine if the SPR is privileged. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b55c9cc298	execute1: Improve architecture compliance of MSR and related instructions This makes our treatment of the MSR conform better with the ISA. - On reset, initialize the MSR to have the SF and LE bits set and all the others reset. For good measure initialize r properly too. - Fix the bit numbering in msr_copy (the code was using big-endian bit numbers, not little-endian). - Use constants like MSR_EE to index MSR bits instead of expressions like '63 - 48', for readability. - Set MSR[SF, LE] and clear MSR[PR, IR, DR, RI] on interrupts. - Copy the relevant fields for rfid instead of using msr_copy, because the partial function fields of the MSR should be left unchanged, not zeroed. Our implementation of rfid is like the architecture description of hrfid, because we don't implement hypervisor mode. - Return the whole MSR for mfmsr. - Implement the L field for mtmsrd (L=1 copies just EE and RI). - For mtmsrd with L=0, leave out the HV, ME and LE bits as per the arch. - For mtmsrd and rfid, if PR ends up set, then also set EE, IR and DR as per the arch. - A few other minor tidyups (no semantic change). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Michael Neuling	b4f20c20b9	XICS interrupt controller New unified ICP and ICS XICS compliant interrupt controller. Configurable number of hardware sources. Fixed hardware source number based on hardware line taken. All hardware interrupts are a fixed priority. Level interrupts supported only. Hardwired to 0xc0004000 in SOC (UART is kept at 0xc0002000). Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Paul Mackerras	dc6b1df653	execute1: Don't execute ld/st instruction when taking interrupt This fixes a bug in the logic where we would still send a load or store instruction to loadstore1 even though we have decided to take an asynchronous interrupt. Reported-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	56577827d4	Decode attn in the major opcode decode table This decodes attn using entry 0 of the major_decode_rom_array table instead of a special case in the decode1_1 process. This means that only the major opcode (the top 6 bits) is checked at decode time. To make sure the instruction is attn not some random illegal pattern, we now check bits 1-10 of the instruction at execute time and generate an illegal instruction interrupt if those bits are not 0100000000. This reduces LUT consumption by 42 LUTs on the Arty A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	6f7ef8b1b9	Decode sc in the major opcode decode table This decodes sc using entry 17 of the major_decode_rom_array table instead of a special case in the decode1_1 process. This means that only the major opcode (the top 6 bits) is checked at decode time. To make sure that the instruction is sc not scv, we now check bit 1 of the instruction at execute time and generate an illegal instruction interrupt if it is 0 (indicating scv). The level field of the sc instruction is now ignored. This reduces LUT consumption by 31 LUTs on the Arty A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	278ac5e0eb	Remove sim_config instruction It's not used any more, and it's not in the ISA. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	f5f17c24fd	execute1: Implement trap instructions properly This implements the trap instructions (tw, twi, td, tdi) using much of the same code as is used for the cmp/cmpl instructions. A 5-bit comparison value is generated, and for cmp/cmpl, the appropriate 3 bits are used to update the destination CR, and for trap instructions, the comparison value is ANDed with the TO field, and an exception is generated if any bit of the result is 1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	381149b2cc	Consolidate trap variants under a single OP_TRAP This replaces OP_TD, OP_TDI, OP_TW and OP_TWI with a single OP_TRAP, distinguishing the cases by the input_reg_b and is_32bit fields of the decode ROM. This adds the twi and td cases to the decode tables. For now we make all of the trap instructions unconditionally generate a trap-type program interrupt if the TO field of the instruction is all ones, and do nothing otherwise. This reduces the number of values in insn_type_t from 65 to 62, meaning that an insn_type_t can now be encoded in 6 bits rather than 7. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d77033aa92	execute1: Simplify the interrupt logic a little This makes some simplifications to the interrupt logic which will help with later commits. - When irq_valid is set, don't set exception to 1 until we have a valid instruction. That means we can remove the if e_in.valid = '1' test from the exception = '1' block. - Don't assert stall_out on the first cycle of delivering an interrupt. If we do get another instruction in the next cycle, nothing will happen because we have ctrl.irq_state set and we will just continue writing the interrupt registers. - Make sure we deliver as many completions as we got instructions, otherwise the outstanding instruction count in control.vhdl gets out of sync. - In writeback, make sure all of the other write enables are ignored when e_in.exc_write_enable is set. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	fe077a116a	Rename OP_MCRF to OP_CROP and trim insn_type_t OP_MCRF covers the CR logical ops as well as mcrf since commit `c05441bf47` ("Implement CRNOR and friends"), so this renames OP_MCRF to OP_CROP. The OP_* values for the individual CR logical ops (OP_CRAND, etc.) are not used, so remove them from insn_type_t. No functional change. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	fb8f3da128	Give exceptions a separate path to writeback This adds separate fields in Execute1ToWritebackType for use in writing SRR0/1 (and in future other SPRs) on an interrupt. With this, we make timing once again on the Arty A7-100 -- previously we were missing by 0.2ns, presumably due to the result mux being wider than before. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Michael Neuling	5ef5604b65	Add sc, illegal and decrementer exceptions and some supervisor state This adds the following exceptions: - 0x700 program check (for illegal instructions) - 0x900 decrementer - 0xc00 system call This also adds some supervisor state: - decremeter - msr (SPRG0/1 and SRR0/1 already exist as fast SPRs) It also adds some supporting instructions: - rfid - mtmsrd - mfmsr - sc MSR state is added but only EE is used in this patch set. Other bits are read/written but are not used at all. This adds a 2 stage state machine to execute1.vhdl. This state machine allows fast SPRS SRR0/1 to be written in different cycles. This state machine can be extended later to add DAR and DSISR SPR writing for more complex exceptions like page faults. Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Michael Neuling	594a19de37	Plumb attn instruction through to execute1 Currently we decode attn but we just mark it as an illegal. This adds a separate case statement in execute 1 for attn to terminate the core. Illegals also do this currently but we are soon implementing a 0x700 execption for them. Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Paul Mackerras	81369187c0	loadstore1: Add support for cache-inhibited load and store instructions This adds support for lbzcix, lhzcix, lwzcix, ldcix, stbcix, sthcix, stwcix and stdcix. The temporary hack where accesses to addresses of the form 0xc??????? are made non-cacheable is left in for now to avoid making existing programs non-functional. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b349cc891a	loadstore1: Move logic from dcache to loadstore1 So that the dcache could in future be used by an MMU, this moves logic to do with data formatting, rA updates for update-form instructions, and handling of unaligned loads and stores out of dcache and into loadstore1. For now, dcache connects only to loadstore1, and loadstore1 now has the connection to writeback. Dcache generates a stall signal to loadstore1 which indicates that the request presented in the current cycle was not accepted and should be presented again. However, loadstore1 doesn't currently use it because we know that we can never hit the circumstances where it might be set. For unaligned transfers, loadstore1 generates two requests to dcache back-to-back, and then waits to see two acks back from dcache (cycles where d_in.valid is true). Loadstore1 now has a FSM for tracking how many acks we are expecting from dcache and for doing the rA update cycles when necessary. Handling for reservations and conditional stores is still in dcache. Loadstore1 now generates its own stall signal back to decode2, so we no longer need the logic in execute1 that generated the stall for the first two cycles. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	5d85ede97d	dcache: Implement load-reserve and store-conditional instructions This involves plumbing the (existing) 'reserve' and 'rc' bits in the decode tables down to dcache, and 'rc' and 'store_done' bits from dcache to writeback. It turns out that we had 'RC' set in the 'rc' column for several ordinary stores and for the attn instruction. This corrects them to 'NONE', and sets the 'rc' column to 'ONE' for the conditional stores. In writeback we now have logic to set CR0 when the input from dcache has rc = 1. In dcache we have the reservation itself, which has a valid bit and the address down to cache line granularity. We don't currently store the reservation length. For a store conditional which fails, we set a 'cancel_store' signal which inhibits the write to the cache and prevents the state machine from starting a bus cycle or going to the STORE_WAIT_ACK state. Instead we set r1.stcx_fail which causes the instruction to complete in the next cycle with rc=1 and store_done=0. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	1a244d3470	Remove single-issue constraint for most loads and stores This removes the constraint that loads and stores are single-issue, at the expense of a stall of at least 2 cycles for every load and store. To do this, we plumb the existing stall signal that was generated in dcache to core, where it gets ORed with the stall signal from execute1. Execute1 generates a stall signal for the first two cycles of each load and store, and dcache generates the stall signal in the 3rd and subsequent cycles if it needs to. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	441160d865	execute1: Use truth table embedded in instruction for CR logical ops It turns out that CR logical instructions have the truth table of the operation embedded in the instruction word. This means that we can collect the two input operand bits into a 2-bit value and use that as the index to select the appropriate bit from the instruction word. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	e08ca4ab8e	countzero: Add a register to help make timing This adds a register in the middle of the countzero computation, so that we now have two cycles to count leading or trailing zeroes instead of just one. Execute1 now outputs a one-cycle stall signal when it encounters a cntlz* or cnttz* instruction. With this, the countzero path no longer fails timing on the Artix-7 at 100MHz. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	5422007f83	Plumb loadstore1 input from execute1 not decode2 This allows us to use the bypass at the input of execute1 for the address and data operands for loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b14d982011	execute: Implement bypass from output of execute1 to input This enables back-to-back execution of integer instructions where the first instruction writes a GPR and the second reads the same GPR. This is done with a set of multiplexers at the start of execute1 which enable any of the three input operands to be taken from the output of execute1 (i.e. r.e.write_data) rather than the input from decode2 (i.e. e_in.read_data[123]). This also requires changes to the hazard detection and handling. Decode2 generates a signal indicating that the GPR being written is available for bypass, which is true for instructions that are executed in execute1 (rather than loadstore1/dcache). The gpr_hazard module stores this "bypassable" bit, and if the same GPR needs to be read by a subsequent instruction, it outputs a "use_bypass" signal rather than generating a stall. The use_bypass signal is then latched at the output of decode2 and passed down to execute1 to control the input multiplexer. At the moment there is no bypass on the inputs to loadstore1, but that is OK because all load and store instructions are marked as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	0c714f1be6	execute: Move popcnt and prty instructions into the logical unit This implements logic in the logical entity to calculate the results of the popcnt* and prty* instructions. We now have one insn_type_t value for the 3 popcnt variants and one for the two prty variants, using the length field of the decode_rom_t to distinguish between them. The implementations in logical.vhdl using recursive algorithms rather than the simple functions in ppc_fx_insns.vhdl. This gives a saving of about 140 slice LUTs on the A7-100 and improves timing slightly. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d2ca625b3b	execute: Do comparisons using the main adder This handles OP_CMP like a subtraction; the main adder computes ~RA + RB + 1, and the condition codes are computed from the results. A direct comparison of the two input operands is used to calculate the EQ bit of the condition result. The LT and GT bits are computed from the MSB of the subtraction result, the carry out from the subtraction, and the MSBs of the operands. For a 32-bit comparison, the 32-bit carry and bit 31 of the result and input operands are used; for a 64-bit comparison, the 64-bit carry and bit 63 of the operands and result are used. It turns out to be more convenient to use the 'signed' field of the decode table to distinguish signed from unsigned comparisons, rather than the insn_type. Therefore this uses OP_CMP for both cmp and cmpl, which also has the benefit of reducing the number of values in insn_type_t. Doing this saves over 200 slice LUTs on the Arty A7-100 and improves timing slightly as well. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d956846667	execute1: Move EXTS* instruction back into execute1 This moves the sign extension done by the extsb, extsh and extsw instructions back into execute1. This means that we no longer need any data formatting in writeback for results coming from execute1, so this modifies writeback so the data formatter inputs come directly from the loadstore unit output. The condition code updates for RC=1 form instructions are now done on the value from execute1 rather than the output of the data formatter, which should help timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	c9a2076dd3	execute1: Remember dest GPR, RC, OE, XER for slow operations For multiply and divide operations, execute1 now records the destination GPR number, RC and OE from the instruction, and the XER value. This means that the multiply and divide units don't need to record those values and then send them back to execute1. This makes the interface to those units a bit simpler. They simply report an overflow signal along with the result value, and execute1 takes care of updating XER if necessary. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	39d18d2738	Make divider hang off the side of execute1 With this, the divider is a unit that execute1 sends operands to and which sends its results back to execute1, which then send them to writeback. Execute1 now sends a stall signal when it gets a divide or modulus instruction until it gets a valid signal back from the divider. Divide and modulus instructions are no longer marked as single-issue. The data formatting step that used to be done in decode2 for div and mod instructions is now done in execute1. We also do the absolute value operation in that same cycle instead of taking an extra cycle inside the divider for signed operations with a negative operand. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	2167186b5f	Make multiplier hang off the side of execute1 With this, the multiplier isn't a separate pipe that decode2 issues instructions to, but rather is a unit that execute1 sends operands to and which sends the result back to execute1, which then sends it to writeback. Execute1 now sends a stall signal when it gets a multiply instruction until it gets a valid signal back from the multiplier. This all means that we no longer need to mark the multiply instructions as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	ad3db18dce	Fix a ghdysynth inferred latch error in execute It should never happen in practise, but ghdlsynth is complaining about an inferred latch here. Fix it Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	cc8a9e7893	Upper 32 bits of XER should read as 0s From the architecture: bits 0:31 and 35:43 are treated as reserved and return 0s when read using mfxer Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Tom Vijlbrief	c05441bf47	Implement CRNOR and friends Signed-off-by: Tom Vijlbrief <tvijlbrief@gmail.com>	5 years ago
Benjamin Herrenschmidt	e4f475e17f	sprs: Store common SPRs in register file This stores the most common SPRs in the register file. This includes CTR and LR and a not yet final list of others. The register file is set to 64 entries for now. Specific types are defined that can represent a GPR index (gpr_index_t) or a GPR/SPR index (gspr_index_t) along with conversion functions between the two. On order to deal with some forms of branch updating both LR and CTR, we introduced a delayed update of LR after a branch link. Note: We currently stall the pipeline on such a delayed branch, but we could avoid stalling fetch in that specific case as we know we have a branch delay. We could also limit that to the specific case where we need to update both CTR and LR. This allows us to make bcreg, mtspr and mfspr pipelined. decode1 will automatically force the single issue flag on mfspr/mtspr to a "slow" SPR. [paulus@ozlabs.org - fix direction of decode2.stall_in] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	ec9b27660f	execute: Copy XER[SO] to CR for cmp[i] and cmpl[i] instructions We were copying in XER[SO] for the dot-form instructions but not the explicit compare instructions. Fix this. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	501b6daf9b	Add basic XER support The carry is currently internal to execute1. We don't handle any of the other XER fields. This creates type called "xer_common_t" that contains the commonly used XER bits (CA, CA32, SO, OV, OV32). The value is stored in the CR file (though it could be a separate module). The rest of the bits will be implemented as a separate SPR and the two parts reconciled in mfspr/mtspr in latter commits. We always read XER in decode2 (there is little point not to) and send it down all pipeline branches as it will be needed in writeback for all type of instructions when CR0:SO needs to be updated (such forms exist for all pipeline branches even if we don't yet implement them). To avoid having to track XER hazards, we forward it back in EX1. This assumes that other pipeline branches that can modify it (mult and div) are running single issue for now. One additional hazard to beware of is an XER:SO modifying instruction in EX1 followed immediately by a store conditional. Due to our writeback latency, the store will go down the LSU with the previous XER value, thus the stcx. will set CR0:SO using an obsolete SO value. I doubt there exist any code relying on this behaviour being correct but we should account for it regardless, possibly by ensuring that stcx. remain single issue initially, or later by adding some minimal tracking or moving the LSU into the same pipeline as execute. Missing some obscure XER affecting instructions like addex or mcrxrx. [paulus@ozlabs.org - fix CA32 and OV32 for OP_ADD, fix order of arguments to set_ov] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	83a8bb0238	spr: Cleanup decoding of SPR numbers Use a function to obtain the integer number and use constants with the architected numbers. Replace std_match with a case statement. This also has the side effect of returning 0 instead of some random previous result on mfspr of an unknown SPR. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	247d7d4aa0	Merge pull request #113 from mikey/exec-sim-remove Remove SIM generic from execute1	5 years ago
Michael Neuling	bd4ac06243	Remove SIM generic from execute1 This does nothing, so remove. Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Benjamin Herrenschmidt	742b21480e	insn: Simplistic implementation of icbi We don't yet have a proper snooper for the icache, so for now make icbi just flush the whole thing Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	a0d95e791e	insn: Implement isync instruction The instruction works by redirecting fetch to nia+4 (hopefully using the same adder used to generate LR) and doing a backflush. Along with being single issue, this should guarantee that the next instruction only gets fetched after the pipe's been emptied. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	e67924f55e	isel takes a CR bit, not a CR field Fix a GHDL assert in isel. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Benjamin Herrenschmidt	bddc9327cc	execute1: Remove mux on "write_data" and "rc" outputs Only "write_enable" needs to change, this shrinks the core a bit more Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	da0bd89c43	crhelpers: Constraint "crnum" integer This seems to save quite a few LUTs Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	4437487ad0	execute1: Reformat No functional change Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	4433118c91	Merge pull request #105 from paulusmack/writeback Writeback	5 years ago
Paul Mackerras	f49a5a99a5	Remove execute2 stage Since the condition setting got moved to writeback, execute2 does nothing aside from wasting a cycle. This removes it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	9646fe28b0	Do sign-extension instructions in writeback instead of execute1 This makes the exts[bhw] instructions do the sign extension in the writeback stage using the sign-extension logic there instead of having unique sign extension logic in execute1. This requires passing the data length and sign extend flag from decode2 down through execute1 and execute2 and into writeback. As a side bonus we reduce the number of values in insn_type_t by two. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	86c53aa3f7	Implement neg using OP_ADD We have all the machinery in place to implement the neg instruction as OP_ADD. Doing that means we can ditch OP_NEG, and saves about 66 slice LUTs on the A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	57b7f1ed71	Don't infer latch for newcrf Always initialize newcrf to avoid inferring a latch. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	24a4a796ce	execute: Consolidate count-leading/trailing-zeroes implementations This adds combinatorial logic that does 32-bit and 64-bit count leading and trailing zeroes in one unit, and consolidates the four instructions under a single OP_CNTZ opcode. This saves 84 slice LUTs on the Arty A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	b8fb721b81	Consolidate logical instructions Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common invert on the input and output. This saves us about 200 LUTs. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	f7c393ba7e	Add a rotate/mask/shift unit and use it in execute1 This adds a new entity 'rotator' which contains combinatorial logic for rotating and masking 64-bit values. It implements the operations of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl, rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions. It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at each stage, two mask generators, output logic and control logic. The insn_type_t values used for these instructions have been reduced to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask instructions (clear both left and right, clear left, clear right variants), OP_SHL for left shifts, and OP_SHR for right shifts. The control signals for the rotator are derived from the opcode and from the is_32bit and is_signed fields of the decode_rom_t. The rotator is instantiated as an entity in execute1 so that we can be sure we only have one of it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	7fe84220a5	decode: Avoid multiplexing from instruction reg fields to regfile address ports This aims to simplify the logic between the instruction image and the register file read address ports and reduce the size of the decode tables. With this patch, the input_reg_a column of the decode tables can only select RA or zeroes, the input_reg_b column can only select RB or a constant (0, -1, or an immediate value from the instruction), and the input_reg_c columns can only select RS or zeroes. That means that the rotate/shift/logical ops now have their first input coming in via the input_reg_c column. That means we need to add a read_data3 field to the Decode2ToExecuteType record, but that will go away again when we split out the rotate/mask/logical ops to their own unit. As a related but not tightly connected change, this patch also sets the read1_enable signal to the register file be 0 when RA=0 and the input_reg_a for the instruction is RA_OR_ZERO (previously it was 1). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	96b402a4bf	Consolidate add/subtract instructions into a single op All of the PPC add and subtract instructions, including carrying and extended versions, do much the same arithmetic operation: result = (I xor A) + B + C where A is the value from RA, I provides a logical inversion of A (i.e. I is 0 or -1), B is either from RB or is a constant 0 or -1, and C is 0, 1 or the carry bit from XER (CA). To consolidate all the add/subtract instructions into a single OP_ADD, we add a column to decode_rom_t to indicate when A should be inverted, and change the input_carry field to a 3-state selector to select C in the equation above. This also adds a new "CONST_M1" value for input_reg_b_t to indicate that B is a constant -1. This allows us to implement addme and subfme. The addex instruction appears not to exist, so the comments referring to it are removed. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	58b06eb5f3	decode: Remove const fields from decode_rom_t The const* fields of decode_rom_t drove multiplexers in decode2 that picked out various instruction fields and put them into the const* fields of the Decode2ToExecute1Type record, from where they were used in execute1. However, the code in execute1 can just as easily use the appropriate fields of the original instruction word, since that is now available in execute1. This therefore changes the code to do that, resulting in smaller decode tables. Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	bbae2d1eda	decode: Index minor op table with insn bits for opcode 31 This changes decode_op_31_array from being indexed by a ppc_insn_t (which is derived from the instruction word by a whole series of if/elsif statements) to being indexed directly by bits 10...1 of the instruction word. With this we no longer need ppc_insn. This then means that the decode1 stage doesn't distinguish between mfcr and mfocrf, or between mtcrf and mtocrf, since those are distinguished by the value in bit 20 of the instruction. To accommodate that, execute1 changes so that the one op value (OP_MFCR) does either the mfcr or the mfocrf behaviour depending on bit 20 of the instruction word; and similarly for mtcrf/mtocrf. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	21d3f8a5ed	decode: Index minor op table with insn bits for opcode 30 This comprises the 64-bit rotate and mask instructions. In order to reduce the table index to 3 bits, we combine rldcl and rdlcr into a single op (OP_RLDCX), and choose the right mask at execute time based on bit 1 of the instruction word. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	00e9f801f6	decode: Index minor op table with insn bits for opcode 19 This changes the decoding of major opcode 19 from using the ppc_insn_t index to using bits of the instruction word directly. Opcode 19 has a 10-bit minor opcode field (bits 10..1) but the space is sparsely filled. Therefore we index a table of single-bit entries with the 10-bit minor opcode to filter out the illegal minor opcodes, and index a table using just 3 bits -- 5, 3 and 2 -- of the instruction to get the decode entry. This groups together all the instructions in 4 columns of the opcode map as a single entry. That means that mcrf and all the CR logical ops get grouped together, and bcctr, bclr and bctar get grouped together. At present the CR logical ops are not implemented, so their grouping has no impact. The code for bclr and bcctr in execute1 is now common, using a single op, and it now determines the branch address by looking at bit 10 of the instruction word at execute time. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	c9e92483b8	decode: Push mtspr/mfspr register decoding down into execute1 Instead of doing mfctr, mflr, mftb, mtctr, mtlr as separate ops, just pass down mfspr and mtspr ops with the spr number and let execute1 decode which SPR we're addressing. This will help reduce the number of instruction bits decode1 needs to look at. In fact we now pass down the whole instruction from decode2 to execute1. We will need more bits of the instruction in future, and the tools should just optimize away any that we don't end up using. Since the 'aa' bit was just a copy of an instruction bit, we can now remove it from the record. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	3e6f656a90	Add MCRF instruction Hopefully it's not too timing catastrophic. The variable newcrf will be handy for the other CR ops when we implement them I suspect. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	554ae88540	Implement absolute branches Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	80a0e7fcf3	execute1: simplify flush_out It's always set when f_out.redirect is set, so may as well set it once at the end. It's all combo from the register. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	b57325ce29	Merge branch 'divider' of https://github.com/paulusmack/microwatt	5 years ago
Anton Blanchard	5a6f8d26d1	Rename OP_SUBFC -> OP_SUBFE, OP_ADDC -> OP_ADDE These were somewhat badly named. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	d5bc6c8824	Add a divider unit and a testbench for it This adds a divider unit, connected to the core in much the same way that the multiplier unit is connected. The division algorithm is very simple-minded, taking 64 clock cycles for any division (even 32-bit division instructions). The decoding is simplified by making use of regularities in the instruction encoding for div* and mod* instructions. Instead of having PPC_* encodings from the first-stage decoder for each of the different div* and mod* instructions, we now just have PPC_DIV and PPC_MOD, and the inputs to the divider that indicate what sort of division operation to do are derived from instruction word bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	6d85920068	execute1 no longer needs sim_console Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Michael Neuling	1e1b799382	Remove FIXME comment This was mistakenly left behind in `4d5abfb430` ("Remove dynamic ranges from code") Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Anton Blanchard	a2df2a10a2	Remove sim console We can force all existing code to use the UART console by passing 0 in bit zero of the sim config register. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	a8f8c54b77	Move debug execute output into decode2 This covers all units, and we avoid double printing. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	92a7152370	Rework pipeline, add stall and flush signals This adds stall and flush signals to the pipeline. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Michael Neuling	4d5abfb430	Remove dynamic ranges from code Some VHDL compilers like verific [1] don't like these, so let's remove them. Lots of random code changes, but passes make check. Also add basic script to run verific and generate verilog. 1. https://www.verific.com/ Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Anton Blanchard	0fd18c2455	Add srd and srw Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	73daacbcd4	Add sim only divw Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	5a29cb4699	Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago

1 2 3

150 Commits (c0f7f54276a0901dd605e4983007aac3b4a001fe)