microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	856e9e955f	core: Add framework for an FPU This adds the skeleton of a floating-point unit and implements the mffs and mtfsf instructions. Execute1 sends FP instructions to the FPU and receives busy, exception, FP interrupt and illegal interrupt signals from it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	9d285a265c	core: Add support for single-precision FP loads and stores This adds code to loadstore1 to convert between single-precision and double-precision formats, and implements the lfs* and stfs* instructions. The conversion processes are described in Power ISA v3.1 Book 1 sections 4.6.2 and 4.6.3. These conversions take one cycle, so lfs* and stfs* are one cycle slower than lfd* and stfd*. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	45cd8f4fc3	core: Add support for floating-point loads and stores This extends the register file so it can hold FPR values, and implements the FP loads and stores that do not require conversion between single and double precision. We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores cause a FP unavailable interrupt if MSR[FP] = 0. The FPU facilities are optional and their presence is controlled by the HAS_FPU generic passed down from the top-level board file. It defaults to true for all except the A7-35 boards. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b589d2d472	execute1: Implement trace interrupts Trace interrupts occur when the MSR[TE] field is non-zero and an instruction other than rfid has been successfully completed. A trace interrupt occurs before the next instruction is executed or any asynchronous interrupt is taken. Since the trace interrupt is defined to set SRR1 bits depending on whether the traced instruction is a load or an instruction treated as a load, or a store or an instruction treated as a store, we need to make sure the treated-as-a-load instructions (icbi, icbt, dcbt, dcbst, dcbf) and the treated-as-a-store instructions (dcbtst, dcbz) have the correct opcodes in decode1. Several of them were previously marked as OP_NOP. We don't yet implement the SIAR or SDAR registers, which should be set by trace interrupts. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	eee90a0815	loadstore1: Generate alignment interrupts for unaligned larx/stcx Load-and-reserve and store-conditional instructions are required to generate an alignment interrupt (0x600 vector) if their EA is not aligned. Implement this. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	033ee909fd	core: Implement 32-bit mode In 32-bit mode, effective addresses are truncated to 32 bits, both for instruction fetches and data accesses, and CR0 is set for Rc=1 (record form) instructions based on the lower 32 bits of the result rather than all 64 bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	2e7b371305	core: Implement big-endian mode Big-endian mode affects both instruction fetches and data accesses. For instruction fetches, we byte-swap each word read from memory when writing it into the icache data RAM, and use a tag bit to indicate whether each cache line contains instructions in BE or LE form. For data accesses, we simply need to invert the existing byte_reverse signal in BE mode. The only thing to be careful of is to get the sign bit from the correct place when doing a sign-extending load that crosses two doublewords of memory. For now, interrupts unconditionally set MSR[LE]. We will need some sort of interrupt-little-endian bit somewhere, perhaps in LPCR. This also fixes a debug report statement in fetch1.vhdl. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0fb8967290	core: Implement the TAR register and the bctar instruction Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	535341961d	multiplier: Generalize interface to the multiplier This makes the interface to the multiplier more general so an instance of it can be used in the FPU. It now has a 128-bit addend that is added on to the product. Instead of an input to negate the output, it now has a "not_result" input to complement the output. Execute1 uses not_result=1 and addend=-1 to get the effect of negating the output. The interface is defined this way because this is what can be done easily with the Xilinx DSP slices in xilinx-mult.vhdl. This also adds clock enable signals to the DSP slices, mostly for the sake of reducing power consumption. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	91cbeee77c	loadstore1: Generate busy signal earlier This makes the calculation of busy as simple as possible and dependent only on register outputs. The timing of busy is critical, as it gates the valid signal for the next instruction, and therefore any delays in dropping busy at the end of a load or store directly impact the timing of a host of other paths. This also separates the 'done without error' and 'done with error' cases from the MMU into separate signals that are both driven directly from registers. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Jordan Niethe	17fc77cef2	core: Implement PVR register Microwatt has been allocated a PVR version of 0x0063. Implement a PVR with this value. Signed-off-by: Jordan Niethe <jniethe5@gmail.com>	4 years ago
Paul Mackerras	74062195ca	execute1: Do forwarding of the CR result to the next instruction This adds a path to allow the CR result of one instruction to be forwarded to the next instruction, so that sequences such as cmp; bc can avoid having a 1-cycle bubble. Forwarding is not available for dot-form (Rc=1) instructions, since the CR result for them is calculated in writeback. The decode.output_cr field is used to identify those instructions that compute the CR result in execute1. For some reason, the multiply instructions incorrectly had output_cr = 1 in the decode tables. This fixes that. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0f0573903b	execute1: Add latch to redirect path This latches the redirect signal inside execute1, so that it is sent a cycle later to fetch1 (and to decode/icache as flush). This breaks a long combinatorial chain from the branch and interrupt detection in execute1 through the redirect/flush signals all the way back to fetch1, icache and decode. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c2da82764f	core: Implement CFAR register This implements the CFAR SPR as a slow SPR stored in 'ctrl'. Taken branches and rfid update it to the address of the branch or rfid instruction. To simplify the logic, this makes rfid use the branch logic to generate its redirect (requiring SRR0 to come in to execute1 on the B input and SRR1 on the A input), and the masking of the bottom 2 bits of NIA is moved to fetch1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	76e2c7d81c	ex1: Add SPR_TBU support It's used by the boot wrapper in Linux and possibly some userspace programs. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	5c2fc47e2c	xics: Add simple ICS Move the external interrupt generation to a separate module "ICS" (source controller) which a register per source containing currently only the priority control. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Paul Mackerras	6687aae4d6	core: Implement a simple branch predictor This implements a simple branch predictor in the decode1 stage. If it sees that the instruction is b or bc and the branch is predicted to be taken, it sends a flush and redirect upstream (to icache and fetch1) to redirect fetching to the branch target. The prediction is sent downstream with the branch instruction, and execute1 now only sends a flush/redirect upstream if the prediction was wrong. Unconditional branches are always predicted to be taken, and conditional branches are predicted to be taken if and only if the offset is negative. Branches that take the branch address from a register (bclr, bcctr) are predicted not taken, as we don't have any way to predict the branch address. Since we can now have a mflr being executed immediately after a bl or bcl, we now track the update to LR in the hazard tracker, using the second write register field that is used to track RA updates for update-form loads and stores. For those branches that update LR but don't write any other result (i.e. that don't decrementer CTR), we now write back LR in the same cycle as the instruction rather than taking a second cycle for the LR writeback. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	6701e7346b	core: Use a busy signal rather than a stall This changes the instruction dependency tracking so that we can generate a "busy" signal from execute1 and loadstore1 which comes along one cycle later than the current "stall" signal. This will enable us to signal busy cycles only when we need to from loadstore1. The "busy" signal from execute1/loadstore1 indicates "I didn't take the thing you gave me on this cycle", as distinct from the previous stall signal which meant "I took that but don't give me anything next cycle". That means that decode2 proactively gives execute1 a new instruction as soon as it has taken the previous one (assuming there is a valid instruction available from decode1), and that then sits in decode2's output until execute1 can take it. So instructions are issued by decode2 somewhat earlier than they used to be. Decode2 now only signals a stall upstream when its output buffer is full, meaning that we can fill up bubbles in the upstream pipe while a long instruction is executing. This gives a small boost in performance. This also adds dependency tracking for rA updates by update-form load/store instructions. The GPR and CR hazard detection machinery now has one extra stage, which may not be strictly necessary. Some of the code now really only applies to PIPELINE_DEPTH=1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	62b24a8dae	icache: Improve latencies when reloading cache lines The icache can now detect a hit on a line being refilled from memory, as we have an array of individual valid bits per row for the line that is currently being loaded. This enables the request that initiated the refill to be satisfied earlier, and also enables following requests to the same cache line to be satisfied before the line is completely refilled. Furthermore, the refill now starts at the row that is needed. This should reduce the latency for an icache miss. We now get a 'sequential' indication from fetch1, and use that to know when we can deliver an instruction word using the other half of the 64-bit doubleword that was read last cycle. This doesn't make much difference at the moment, but it frees up cycles where we could test whether the next line is present in the cache so that we could prefetch it if not. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	9880fc7435	multiply: Move selection of result bits into execute1 This puts the logic that selects which bits of the multiplier result get written into the destination GPR into execute1, moved out from multiply. The multiplier is now expected to do an unsigned multiplication of 64-bit operands, optionally negate the result, detect 32-bit or 64-bit signed overflow of the result, and return a full 128-bit result. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b5a7dbb78d	core: Remove fetch2 pipeline stage The fetch2 stage existed primarily to provide a stash buffer for the output of icache when a stall occurred. However, we can get the same effect -- of having the input to decode1 stay unchanged on a stall cycle -- by using the read enable of the BRAMs in icache, and by adding logic to keep the outputs unchanged on a clock cycle when stall_in = 1. This reduces branch and interrupt latency by one cycle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	f86fb74bfe	irq: Simplify xics->core irq input Use a simple wire. common.vhdl types are better kept for things local to the core. We can add more wires later if we need to for HV irqs etc... Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Jonathan Balkind	cc532dd065	Changes for compilation with VCS: - Changing use of others in core files to satisfy VCS - Adding workaround for VCS subtype constraint inconsistencies in common.vhdl Signed-off-by: Jonathan Balkind <jbalkind@princeton.edu>	5 years ago
Paul Mackerras	2843c99a71	MMU: Implement reading of the process table This adds the PID register and repurposes SPR 720 as the PRTBL register, which points to the base of the process table. There doesn't seem to be any point to implementing the partition table given that we don't have hypervisor mode. The MMU caches entry 0 of the process table internally (in pgtbl3) plus the entry indexed by the value in the PID register (pgtbl0). Both caches are invalidated by a tlbie[l] with RIC=2 or by a move to PRTBL. The pgtbl0 cache is invalidated by a move to PID. The dTLB and iTLB are cleared by a move to either PRTBL or PID. Which of the two page table root pointers is used (pgtbl0 or pgtbl3) depends on the MSB of the address being translated. Since the segment checking ensures that address(63) = address(62), this is sufficient to map quadrants 0 and 3. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	a658766fcf	Implement slbia as a dTLB/iTLB flush Slbia (with IH=7) is used in the Linux kernel to flush the ERATs (our iTLB/dTLB), so make it do that. This moves the logic to work out whether to flush a single entry or the whole TLB from dcache and icache into mmu. We now invalidate all dTLB and iTLB entries when the AP (actual pagesize) field of RB is non-zero on a tlbie[l], as well as when IS is non-zero. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	01046527ba	MMU: Do radix page table walks on iTLB misses This hooks up the connections so that an OP_FETCH_FAILED coming down to loadstore1 will get sent to the MMU for it to do a radix tree walk for the instruction address. The MMU then sends the resulting PTE to the icache module to be installed in the iTLB. If no valid PTE can be found, the MMU sends an error signal back to loadstore1 which sends it on to execute1 to generate an ISI. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	3d4712ad43	Add TLB to icache This adds a direct-mapped TLB to the icache, with 64 entries by default. Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along with redirects to indicate whether instruction addresses should be translated through the TLB, and fetch1 sends that on to icache. Similarly a "priv_mode" signal is sent to indicate the privilege mode for instruction fetches. This means that changes to MSR[IR] or MSR[PR] don't take effect until the next redirect, meaning an isync, rfid, branch, etc. The icache uses a hash of the effective address (i.e. next instruction address) to index the TLB. The hash is an XOR of three fields of the address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and 24--29 of the address. TLB invalidations simply invalidate the indexed TLB entry without checking the contents. If the icache detects a TLB miss with virt_mode=1, it will send a fetch_failed indication through fetch2 to decode1, which will turn it into a special OP_FETCH_FAILED opcode with unit=LDST. That will get sent down to loadstore1 which will currently just raise a Instruction Storage Interrupt (0x400) exception. One bit in the PTE obtained from the TLB is used to check whether an instruction access is allowed -- the privilege bit (bit 3). If bit 3 is 1 and priv_mode=0, then a fetch_failed indication is sent down to fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put into the iTLB since such PTEs would not allow execution by any context. Tlbie operations get sent from mmu to icache over a new connection. Unfortunately the privileged instruction tests are broken for now. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	3eb07dc637	MMU: Refetch PTE on access fault This is required by the architecture. It means that the error bits reported in DSISR or SRR1 now come from the permission/RC check done on the refetched PTE rather than the TLB entry. Unfortunately that somewhat breaks the software-loaded TLB mode of operation in that DSISR/SRR1 always report no PTE rather than permission error or RC failure. This also restructures the loadstore1 state machine a bit, combining the FIRST_ACK_WAIT and LAST_ACK_WAIT states into a single state and the MMU_LOOKUP_1ST and MMU_LOOKUP_LAST states likewise. We now have a 'dwords_done' bit to say whether the first transfer of two (for an unaligned access) has been done. The cache paradox error (where a non-cacheable access finds a hit in the cache) is now the only cause of DSI from the dcache. This should probably be a machine check rather than DSI in fact. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	f6a0d7f9da	MMU: Implement data segment interrupts A data segment interrupt (DSegI) occurs when an address to be translated by the MMU is outside the range of the radix tree or the top two bits of the address (the quadrant) are 01 or 10. This is detected in a new state of the MMU state machine, and is sent back to loadstore1 as an error, which sends it on to execute1 to generate an interrupt to the 0x380 vector. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	4e6fc6811a	MMU: Implement radix page table machinery This adds the necessary machinery to the MMU for it to do radix page table walks. The core elements are a shifter that can shift the address right by between 0 and 47 bits, a mask generator that can generate a mask of between 5 and 16 bits, a final mask generator, and new states in the state machine. (The final mask generator is used for transferring bits of the original address into the resulting TLB entry when the leaf PTE corresponds to a page size larger than 4kB.) The hardware does not implement a partition table or a process table. Software is expected to load the appropriate process table entry into a new SPR called PGTBL0, SPR 720. The contents should be formatted as described in Book III section 5.7.6.2 of the Power ISA v3.0B. PGTBL0 is set to 0 on hard reset. At present, the top two bits of the address (the quadrant) are ignored. There is currently no caching of any step in the translation process or of the final result, other than the entry created in the dTLB. That entry is a 4k page entry even if the leaf PTE found in the walk corresponds to a larger page size. This implementation can handle almost any page table layout and any page size. The RTS field (in PGTBL0) can have any value between 0 and 31, corresponding to a total address space size between 2^31 and 2^62 bytes. The RPDS field of PGTBL0 can be any value between 5 and 16, except that a value of 0 is taken to disable radix page table walking (for use when one is using software loading of TLB entries). The NLS field of the page directory entries can have any value between 5 and 16. The minimum page size is 4kB, meaning that the sum of RPDS and the NLS values of the PDEs found on the path to a leaf PTE must be less than or equal to RTS + 31 - 12. The PGTBL0 SPR is in the mmu module; thus this adds a path for loadstore1 to read and write SPRs in mmu. This adds code in dcache to service doubleword read requests from the MMU, as well as requests to write dTLB entries. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	8160f4f821	Add framework for implementing an MMU This adds a new module to implement an MMU. At the moment it doesn't do very much. Tlbie instructions now get sent by loadstore1 to mmu, which sends them to dcache, rather than loadstore1 sending them directly to dcache. TLB misses from dcache now get sent by loadstore1 to mmu, which currently just returns an error. Loadstore1 then generates a DSI in response to the error return from mmu. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d47fbf88d1	Implement access permission checks This adds logic to the dcache to check the permissions encoded in the PTE that it gets from the dTLB. The bits that are checked are: R must be 1 C must be 1 for a store EAA(0) - if this is 1, MSR[PR] must be 0 EAA(2) must be 1 for a store EAA(1) \| EAA(2) must be 1 for a load In addition, ATT(0) is used to indicate a cache-inhibited access. This now implements DSISR bits 36, 38 and 45. (Bit numbers above correspond to the ISA, i.e. using big-endian numbering.) MSR[PR] is now conveyed to loadstore1 for use in permission checking. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	42d0fcc511	Implement data storage interrupts This adds a path from loadstore1 back to execute1 for reporting errors, and machinery in execute1 for generating data storage interrupts at vector 0x300. If dcache is given two requests in successive cycles and the first encounters an error (e.g. a TLB miss), it will now cancel the second request. Loadstore1 now responds to errors reported by dcache by sending an exception signal to execute1 and returning to the idle state. Execute1 then writes SRR0 and SRR1 and jumps to the 0x300 Data Storage Interrupt vector. DAR and DSISR are held in loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	750b3a8e28	dcache: Implement data TLB This adds a TLB to dcache, providing the ability to translate addresses for loads and stores. No protection mechanism has been implemented yet. The MSR_DR bit controls whether addresses are translated through the TLB. The TLB is a fixed-pagesize, set-associative cache. Currently the page size is 4kB and the TLB is 2-way set associative with 64 entries per set. This implements the tlbie instruction. RB bits 10 and 11 control whether the whole TLB is invalidated (if either bit is 1) or just a single entry corresponding to the effective page number in bits 12-63 of RB. As an extension until we get a hardware page table walk, a tlbie instruction with RB bits 9-11 set to 001 will load an entry into the TLB. The TLB entry value is in RS in the format of a radix PTE. Currently there is no proper handling of TLB misses. The load or store will not be performed but no interrupt is generated. In order to make timing at 100MHz on the Arty A7-100, we compare the real address from each way of the TLB with the tag from each way of the cache in parallel (requiring # TLB ways * # cache ways comparators). Then the result is selected based on which way hit in the TLB. That avoids a timing path going through the TLB EA comparators, the multiplexer that selects the RA, and the cache tag comparators. The hack where addresses of the form 0xc------- are marked as cache-inhibited is kept for now but restricted to real-mode accesses. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	635e316f9b	Pass mtspr/mfspr to MMU-related SPRs down to loadstore1 This arranges for some mfspr and mtspr to get sent to loadstore1 instead of being handled in execute1. In particular, DAR and DSISR are handled this way. They are therefore "slow" SPRs. While we're at it, fix the spelling of HEIR and remove mention of DAR and DSISR from the comments in execute1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	102b304db7	Merge remote-tracking branch 'remotes/origin/master'	5 years ago
Paul Mackerras	041d6bef60	dcache: Implement the dcbz instruction This adds logic to dcache and loadstore1 to implement dcbz. For now it zeroes a single cache line (by default 64 bytes), not 128 bytes like IBM Power processors do. The dcbz operation is performed much like a load miss, except that we are writing zeroes to memory instead of reading. As each ack comes back, we write zeroes to the BRAM instead of data from memory. In this way we zero the line in memory and also zero the line of cache memory, establishing the line in the cache if it wasn't already resident. If it was already resident then we overwrite the existing line in the cache. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	167e37d667	Plumb insn_type through to loadstore1 In preparation for adding a TLB to the dcache, this plumbs the insn_type from execute1 through to loadstore1, so that we can have other operations besides loads and stores (e.g. tlbie) going to loadstore1 and thence to the dcache. This also plumbs the unit field of the decode ROM from decode2 through to execute1 to simplify the logic around which ops need to go to loadstore1. The load and store data formatting are now not conditional on the op being OP_LOAD or OP_STORE. This eliminates the inferred latches clocked by each of the bits of r.op that we were getting previously. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b55c9cc298	execute1: Improve architecture compliance of MSR and related instructions This makes our treatment of the MSR conform better with the ISA. - On reset, initialize the MSR to have the SF and LE bits set and all the others reset. For good measure initialize r properly too. - Fix the bit numbering in msr_copy (the code was using big-endian bit numbers, not little-endian). - Use constants like MSR_EE to index MSR bits instead of expressions like '63 - 48', for readability. - Set MSR[SF, LE] and clear MSR[PR, IR, DR, RI] on interrupts. - Copy the relevant fields for rfid instead of using msr_copy, because the partial function fields of the MSR should be left unchanged, not zeroed. Our implementation of rfid is like the architecture description of hrfid, because we don't implement hypervisor mode. - Return the whole MSR for mfmsr. - Implement the L field for mtmsrd (L=1 copies just EE and RI). - For mtmsrd with L=0, leave out the HV, ME and LE bits as per the arch. - For mtmsrd and rfid, if PR ends up set, then also set EE, IR and DR as per the arch. - A few other minor tidyups (no semantic change). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Michael Neuling	b4f20c20b9	XICS interrupt controller New unified ICP and ICS XICS compliant interrupt controller. Configurable number of hardware sources. Fixed hardware source number based on hardware line taken. All hardware interrupts are a fixed priority. Level interrupts supported only. Hardwired to 0xc0004000 in SOC (UART is kept at 0xc0002000). Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Paul Mackerras	fb8f3da128	Give exceptions a separate path to writeback This adds separate fields in Execute1ToWritebackType for use in writing SRR0/1 (and in future other SPRs) on an interrupt. With this, we make timing once again on the Arty A7-100 -- previously we were missing by 0.2ns, presumably due to the result mux being wider than before. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Michael Neuling	5ef5604b65	Add sc, illegal and decrementer exceptions and some supervisor state This adds the following exceptions: - 0x700 program check (for illegal instructions) - 0x900 decrementer - 0xc00 system call This also adds some supervisor state: - decremeter - msr (SPRG0/1 and SRR0/1 already exist as fast SPRs) It also adds some supporting instructions: - rfid - mtmsrd - mfmsr - sc MSR state is added but only EE is used in this patch set. Other bits are read/written but are not used at all. This adds a 2 stage state machine to execute1.vhdl. This state machine allows fast SPRS SRR0/1 to be written in different cycles. This state machine can be extended later to add DAR and DSISR SPR writing for more complex exceptions like page faults. Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Paul Mackerras	81369187c0	loadstore1: Add support for cache-inhibited load and store instructions This adds support for lbzcix, lhzcix, lwzcix, ldcix, stbcix, sthcix, stwcix and stdcix. The temporary hack where accesses to addresses of the form 0xc??????? are made non-cacheable is left in for now to avoid making existing programs non-functional. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	4e38c2cc21	loadstore1: Move load data formatting from writeback to loadstore1 This puts all the data formatting (byte rotation based on lowest three bits of the address, byte reversal, sign extension, zero extension) in loadstore1. Writeback now simply sends the data provided to the register files. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b349cc891a	loadstore1: Move logic from dcache to loadstore1 So that the dcache could in future be used by an MMU, this moves logic to do with data formatting, rA updates for update-form instructions, and handling of unaligned loads and stores out of dcache and into loadstore1. For now, dcache connects only to loadstore1, and loadstore1 now has the connection to writeback. Dcache generates a stall signal to loadstore1 which indicates that the request presented in the current cycle was not accepted and should be presented again. However, loadstore1 doesn't currently use it because we know that we can never hit the circumstances where it might be set. For unaligned transfers, loadstore1 generates two requests to dcache back-to-back, and then waits to see two acks back from dcache (cycles where d_in.valid is true). Loadstore1 now has a FSM for tracking how many acks we are expecting from dcache and for doing the rA update cycles when necessary. Handling for reservations and conditional stores is still in dcache. Loadstore1 now generates its own stall signal back to decode2, so we no longer need the logic in execute1 that generated the stall for the first two cycles. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	81d777be02	dcache: Trim one cycle from the load hit path Currently we don't get the result from a load that hits in the dcache until the fourth cycle after the instruction was presented to loadstore1. This trims this back to 3 cycles by taking the low order bits of the address generated in loadstore1 into dcache directly (not via the output register of loadstore1) and using them to address the read port of the dcache data RAM. We use the lower 12 address bits here in the expectation that any reasonable data cache design will have a set size of 4kB or less in order to avoid the aliasing problems that can arise with a virtually-indexed physically-tagged cache if the set size is greater than the smallest page size provided by the MMU. With this we can get rid of r2 and drive the signals going to writeback from r1, since the load hit data is now available one cycle earlier. We need a multiplexer on the read address of the data cache RAM in order to handle the second doubleword of an unaligned access. One small complication is that we now need an extra cycle in the case of an unaligned load which misses in the data cache and which reads the 2nd-last and last doublewords of a cache line. This is the reason for the PRE_NEXT_DWORD state; if we just go straight to NEXT_DWORD then we end up having the write of the last doubleword of the cache line and the read of that same doubleword occurring in the same cycle, which means we read stale data rather than the just-fetched data. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	5d85ede97d	dcache: Implement load-reserve and store-conditional instructions This involves plumbing the (existing) 'reserve' and 'rc' bits in the decode tables down to dcache, and 'rc' and 'store_done' bits from dcache to writeback. It turns out that we had 'RC' set in the 'rc' column for several ordinary stores and for the attn instruction. This corrects them to 'NONE', and sets the 'rc' column to 'ONE' for the conditional stores. In writeback we now have logic to set CR0 when the input from dcache has rc = 1. In dcache we have the reservation itself, which has a valid bit and the address down to cache line granularity. We don't currently store the reservation length. For a store conditional which fails, we set a 'cancel_store' signal which inhibits the write to the cache and prevents the state machine from starting a bus cycle or going to the STORE_WAIT_ACK state. Instead we set r1.stcx_fail which causes the instruction to complete in the next cycle with rc=1 and store_done=0. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	5422007f83	Plumb loadstore1 input from execute1 not decode2 This allows us to use the bypass at the input of execute1 for the address and data operands for loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b14d982011	execute: Implement bypass from output of execute1 to input This enables back-to-back execution of integer instructions where the first instruction writes a GPR and the second reads the same GPR. This is done with a set of multiplexers at the start of execute1 which enable any of the three input operands to be taken from the output of execute1 (i.e. r.e.write_data) rather than the input from decode2 (i.e. e_in.read_data[123]). This also requires changes to the hazard detection and handling. Decode2 generates a signal indicating that the GPR being written is available for bypass, which is true for instructions that are executed in execute1 (rather than loadstore1/dcache). The gpr_hazard module stores this "bypassable" bit, and if the same GPR needs to be read by a subsequent instruction, it outputs a "use_bypass" signal rather than generating a stall. The use_bypass signal is then latched at the output of decode2 and passed down to execute1 to control the input multiplexer. At the moment there is no bypass on the inputs to loadstore1, but that is OK because all load and store instructions are marked as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d956846667	execute1: Move EXTS* instruction back into execute1 This moves the sign extension done by the extsb, extsh and extsw instructions back into execute1. This means that we no longer need any data formatting in writeback for results coming from execute1, so this modifies writeback so the data formatter inputs come directly from the loadstore unit output. The condition code updates for RC=1 form instructions are now done on the value from execute1 rather than the output of the data formatter, which should help timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago

1 2

91 Commits (fb5115c9445fe03142946b195028d90a00c6acee)