microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	a4500c63a2	dcache: Reduce back-to-back store latency from 3 cycles to 2 This uses the machinery we already had for comparing the real address of a new request with the tag of a previous request (r1.reload_tag) to get better timing on comparing the address of a second store with the one in progress. The comparison is now on the set size rather than the page size, but since set size can't be larger than the page size (and usually will equal the page size), that is OK. The same comparison can also be used to tell when we can satisfy a load miss during a cache line refill. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b595963233	dcache: Reduce latencies and improve timing This implements various improvements to the dcache with the aim of making it go faster. - We can now execute operations that don't need to access main memory (cacheable loads that hit in the cache and TLB operations) as soon as any previous operation has completed, without waiting for the state machine to become idle. - Cache line refills start with the doubleword that is needed to satisfy the load that initiated them. - Cacheable loads that miss return their data and complete as soon as the requested doubleword comes back from memory; they don't wait for the refill to finish. - We now have per-doubleword valid bits for the cache line being refilled, meaning that if a load comes in for a line that is in the process of being refilled, we can return the data and complete it within a couple of cycles of the doubleword coming in from memory. - There is now a bypass path for data being written to the cache RAM so that we can do a store hit followed immediately by a load hit to the same doubleword. This also makes the data from a refill available to load hits one cycle earlier than it would be otherwise. - Stores complete in the cycle where their wishbone operation is initiated, without waiting for the wishbone cycle to complete. - During the wishbone cycle for a store, if another store comes in that is to the same page, and we don't have a stall from the wishbone, we can send out the write for the second store in the same wishbone cycle and without going through the IDLE state first. We limit it to 7 outstanding writes that have not yet been acknowledged. - The cache tag RAM is now read on a clock edge rather than being combinatorial for reading. Its width is rounded up to a multiple of 8 bits per way so that byte enables can be used for writing individual tags. - The cache tag RAM is now written a cycle later than previously, in order to ease timing. - Data for a store hit is now written one cycle later than previously. This eases timing since we don't have to get through the tag matching and on to the write enable within a single cycle. The 2-stage bypass path means we can still handle a load hit on either of the two cycles after the store and return the correct data. (A load hit 3 or more cycles later will get the correct data from the BRAM.) - Operations can sit in r0 while there is an uncompleted operation in r1. Once the operation in r1 is completed, the operation in r0 spends one cycle in r0 for TLB/cache tag lookup and then gets put into r1.req. This can happen before r1 gets to the IDLE state. Some operations can then be completed before r1 gets to the IDLE state - a load miss to the cache line being refilled, or a store to the same page as a previous store. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	49a4d9f67a	Add core logging This logs 256 bits of data per cycle to a ring buffer in BRAM. The data collected can be read out through 2 new SPRs or through the debug interface. The new SPRs are LOG_ADDR (724) and LOG_DATA (725). LOG_ADDR contains the buffer write pointer in the upper 32 bits (in units of entries, i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of doublewords, i.e. 8 bytes). Reading LOG_DATA gives the doubleword from the buffer at the read pointer and increments the read pointer. Setting bit 31 of LOG_ADDR inhibits the trace log system from writing to the log buffer, so the contents are stable and can be read. There are two new debug addresses which function similarly to the LOG_ADDR and LOG_DATA SPRs. The log is frozen while either or both of the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set. The buffer defaults to 2048 entries, i.e. 64kB. The size is set by the LOG_LENGTH generic on the core_debug module. Software can determine the length of the buffer because the length is ORed into the buffer write pointer in the upper 32 bits of LOG_ADDR. Hence the length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)). There is a program to format the log entries in a somewhat readable fashion in scripts/fmt_log/fmt_log.c. The log_entry struct in that file describes the layout of the bits in the log entries. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	ecaa5e2fb2	dcache: Rework RAM wrapper to synthetize better on Xilinx The global wr_en signal is causing Vivado to generate two TDP (True Dual Port) block RAMs instead of one SDP (Simple Dual Port) for each cache way. Remove it and instead apply a AND to the individual byte write enables. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Paul Mackerras	eca0fb5bf1	dcache: Fix bug in store hit after dcbz case This fixes a bug where a store that hits in the dcache immediately following a dcbz has its write to the cache RAM suppressed (but not its write to memory). If a load to the same location comes along before the cache line gets replaced, the load will return incorrect data. Fixes: `4db1676ef8` ("dcache: Don't assert on dcbz cache hit") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	a658766fcf	Implement slbia as a dTLB/iTLB flush Slbia (with IH=7) is used in the Linux kernel to flush the ERATs (our iTLB/dTLB), so make it do that. This moves the logic to work out whether to flush a single entry or the whole TLB from dcache and icache into mmu. We now invalidate all dTLB and iTLB entries when the AP (actual pagesize) field of RB is non-zero on a tlbie[l], as well as when IS is non-zero. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	dee3783d79	MMU: Remove software-loaded dTLB mode This removes the hack where the tlbie instruction could be used to load entries directly into the dTLB, because we don't report the correct DSISR values for accesses that hit software-loaded dTLB entries and have privilege or permission errors. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	3eb07dc637	MMU: Refetch PTE on access fault This is required by the architecture. It means that the error bits reported in DSISR or SRR1 now come from the permission/RC check done on the refetched PTE rather than the TLB entry. Unfortunately that somewhat breaks the software-loaded TLB mode of operation in that DSISR/SRR1 always report no PTE rather than permission error or RC failure. This also restructures the loadstore1 state machine a bit, combining the FIRST_ACK_WAIT and LAST_ACK_WAIT states into a single state and the MMU_LOOKUP_1ST and MMU_LOOKUP_LAST states likewise. We now have a 'dwords_done' bit to say whether the first transfer of two (for an unaligned access) has been done. The cache paradox error (where a non-cacheable access finds a hit in the cache) is now the only cause of DSI from the dcache. This should probably be a machine check rather than DSI in fact. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	4e6fc6811a	MMU: Implement radix page table machinery This adds the necessary machinery to the MMU for it to do radix page table walks. The core elements are a shifter that can shift the address right by between 0 and 47 bits, a mask generator that can generate a mask of between 5 and 16 bits, a final mask generator, and new states in the state machine. (The final mask generator is used for transferring bits of the original address into the resulting TLB entry when the leaf PTE corresponds to a page size larger than 4kB.) The hardware does not implement a partition table or a process table. Software is expected to load the appropriate process table entry into a new SPR called PGTBL0, SPR 720. The contents should be formatted as described in Book III section 5.7.6.2 of the Power ISA v3.0B. PGTBL0 is set to 0 on hard reset. At present, the top two bits of the address (the quadrant) are ignored. There is currently no caching of any step in the translation process or of the final result, other than the entry created in the dTLB. That entry is a 4k page entry even if the leaf PTE found in the walk corresponds to a larger page size. This implementation can handle almost any page table layout and any page size. The RTS field (in PGTBL0) can have any value between 0 and 31, corresponding to a total address space size between 2^31 and 2^62 bytes. The RPDS field of PGTBL0 can be any value between 5 and 16, except that a value of 0 is taken to disable radix page table walking (for use when one is using software loading of TLB entries). The NLS field of the page directory entries can have any value between 5 and 16. The minimum page size is 4kB, meaning that the sum of RPDS and the NLS values of the PDEs found on the path to a leaf PTE must be less than or equal to RTS + 31 - 12. The PGTBL0 SPR is in the mmu module; thus this adds a path for loadstore1 to read and write SPRs in mmu. This adds code in dcache to service doubleword read requests from the MMU, as well as requests to write dTLB entries. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	8160f4f821	Add framework for implementing an MMU This adds a new module to implement an MMU. At the moment it doesn't do very much. Tlbie instructions now get sent by loadstore1 to mmu, which sends them to dcache, rather than loadstore1 sending them directly to dcache. TLB misses from dcache now get sent by loadstore1 to mmu, which currently just returns an error. Loadstore1 then generates a DSI in response to the error return from mmu. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d47fbf88d1	Implement access permission checks This adds logic to the dcache to check the permissions encoded in the PTE that it gets from the dTLB. The bits that are checked are: R must be 1 C must be 1 for a store EAA(0) - if this is 1, MSR[PR] must be 0 EAA(2) must be 1 for a store EAA(1) \| EAA(2) must be 1 for a load In addition, ATT(0) is used to indicate a cache-inhibited access. This now implements DSISR bits 36, 38 and 45. (Bit numbers above correspond to the ISA, i.e. using big-endian numbering.) MSR[PR] is now conveyed to loadstore1 for use in permission checking. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	42d0fcc511	Implement data storage interrupts This adds a path from loadstore1 back to execute1 for reporting errors, and machinery in execute1 for generating data storage interrupts at vector 0x300. If dcache is given two requests in successive cycles and the first encounters an error (e.g. a TLB miss), it will now cancel the second request. Loadstore1 now responds to errors reported by dcache by sending an exception signal to execute1 and returning to the idle state. Execute1 then writes SRR0 and SRR1 and jumps to the 0x300 Data Storage Interrupt vector. DAR and DSISR are held in loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	750b3a8e28	dcache: Implement data TLB This adds a TLB to dcache, providing the ability to translate addresses for loads and stores. No protection mechanism has been implemented yet. The MSR_DR bit controls whether addresses are translated through the TLB. The TLB is a fixed-pagesize, set-associative cache. Currently the page size is 4kB and the TLB is 2-way set associative with 64 entries per set. This implements the tlbie instruction. RB bits 10 and 11 control whether the whole TLB is invalidated (if either bit is 1) or just a single entry corresponding to the effective page number in bits 12-63 of RB. As an extension until we get a hardware page table walk, a tlbie instruction with RB bits 9-11 set to 001 will load an entry into the TLB. The TLB entry value is in RS in the format of a radix PTE. Currently there is no proper handling of TLB misses. The load or store will not be performed but no interrupt is generated. In order to make timing at 100MHz on the Arty A7-100, we compare the real address from each way of the TLB with the tag from each way of the cache in parallel (requiring # TLB ways * # cache ways comparators). Then the result is selected based on which way hit in the TLB. That avoids a timing path going through the TLB EA comparators, the multiplexer that selects the RA, and the cache tag comparators. The hack where addresses of the form 0xc------- are marked as cache-inhibited is kept for now but restricted to real-mode accesses. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	4db1676ef8	dcache: Don't assert on dcbz cache hit We can hit the assert for req_op = OP_STORE_HIT and reloading in the case of dcbz, since it looks like a store. Therefore we need to exclude that case from the assert. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	041d6bef60	dcache: Implement the dcbz instruction This adds logic to dcache and loadstore1 to implement dcbz. For now it zeroes a single cache line (by default 64 bytes), not 128 bytes like IBM Power processors do. The dcbz operation is performed much like a load miss, except that we are writing zeroes to memory instead of reading. As each ack comes back, we write zeroes to the BRAM instead of data from memory. In this way we zero the line in memory and also zero the line of cache memory, establishing the line in the cache if it wasn't already resident. If it was already resident then we overwrite the existing line in the cache. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b349cc891a	loadstore1: Move logic from dcache to loadstore1 So that the dcache could in future be used by an MMU, this moves logic to do with data formatting, rA updates for update-form instructions, and handling of unaligned loads and stores out of dcache and into loadstore1. For now, dcache connects only to loadstore1, and loadstore1 now has the connection to writeback. Dcache generates a stall signal to loadstore1 which indicates that the request presented in the current cycle was not accepted and should be presented again. However, loadstore1 doesn't currently use it because we know that we can never hit the circumstances where it might be set. For unaligned transfers, loadstore1 generates two requests to dcache back-to-back, and then waits to see two acks back from dcache (cycles where d_in.valid is true). Loadstore1 now has a FSM for tracking how many acks we are expecting from dcache and for doing the rA update cycles when necessary. Handling for reservations and conditional stores is still in dcache. Loadstore1 now generates its own stall signal back to decode2, so we no longer need the logic in execute1 that generated the stall for the first two cycles. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	ef9c1efd72	dcache: Remove LOAD_UPDATE2 state Since we removed one cycle from the load hit case, we actually no longer need the extra cycle provided by having the LOAD_UPDATE state. Therefore this makes the load hit case in the IDLE and NEXT_DWORD states go to LOAD_UPDATE2 rather than LOAD_UPDATE. Then we remove LOAD_UPDATE and then rename LOAD_UPDATE2 to LOAD_UPDATE. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	81d777be02	dcache: Trim one cycle from the load hit path Currently we don't get the result from a load that hits in the dcache until the fourth cycle after the instruction was presented to loadstore1. This trims this back to 3 cycles by taking the low order bits of the address generated in loadstore1 into dcache directly (not via the output register of loadstore1) and using them to address the read port of the dcache data RAM. We use the lower 12 address bits here in the expectation that any reasonable data cache design will have a set size of 4kB or less in order to avoid the aliasing problems that can arise with a virtually-indexed physically-tagged cache if the set size is greater than the smallest page size provided by the MMU. With this we can get rid of r2 and drive the signals going to writeback from r1, since the load hit data is now available one cycle earlier. We need a multiplexer on the read address of the data cache RAM in order to handle the second doubleword of an unaligned access. One small complication is that we now need an extra cycle in the case of an unaligned load which misses in the data cache and which reads the 2nd-last and last doublewords of a cache line. This is the reason for the PRE_NEXT_DWORD state; if we just go straight to NEXT_DWORD then we end up having the write of the last doubleword of the cache line and the read of that same doubleword occurring in the same cycle, which means we read stale data rather than the just-fetched data. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	5d85ede97d	dcache: Implement load-reserve and store-conditional instructions This involves plumbing the (existing) 'reserve' and 'rc' bits in the decode tables down to dcache, and 'rc' and 'store_done' bits from dcache to writeback. It turns out that we had 'RC' set in the 'rc' column for several ordinary stores and for the attn instruction. This corrects them to 'NONE', and sets the 'rc' column to 'ONE' for the conditional stores. In writeback we now have logic to set CR0 when the input from dcache has rc = 1. In dcache we have the reservation itself, which has a valid bit and the address down to cache line granularity. We don't currently store the reservation length. For a store conditional which fails, we set a 'cancel_store' signal which inhibits the write to the cache and prevents the state machine from starting a bus cycle or going to the STORE_WAIT_ACK state. Instead we set r1.stcx_fail which causes the instruction to complete in the next cycle with rc=1 and store_done=0. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	94dd8bc480	dcache: Add support for unaligned loads and stores For an unaligned load or store, we do the first doubleword (dword) of the transfer as normal, but then go to a new NEXT_DWORD state of the state machine to do the cache tag lookup for the second dword of the transfer. From the NEXT_DWORD state we have much the same transitions to other states as from the IDLE state (the transitions for OP_LOAD_HIT are a bit different but almost identical for the other op values). We now do the preparation of the data to be written in loadstore1, that is, byte reversal if necessary and rotation by a number of bytes based on the low 3 bits of the address. We do rotation not shifting so we have the bytes that need to go into the second doubleword in the right place in the low bytes of the data sent to dcache. The rotation and byte reversal are done in a single step with one multiplexer per byte by setting the select inputs for each byte appropriately. This also fixes writeback to not write the register value until it has received both pieces of an unaligned load value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	1587d9e6eb	dcache: Fix obscure bug and minor cleanups The obscure bug is that a non-cacheable load with update would never do the update and would never complete the instruction. This is fixed by making state NC_LOAD_WAIT_ACK go to LOAD_UPDATE2 if r1.req.update is set. The slow load forms with update can go to LOAD_UPDATE2 at the end rather than LOAD_UPDATE, thus saving a cycle. Loads with a cache hit need the LOAD_UPDATE state in the third cycle since they are not writing back until the 4th cycle, when the state is LOAD_UPDATE2. Slow loads (cacheable loads that miss and non-cacheable loads) currently go to LOAD_UPDATE in the cycle after they see r1.wb.ack = 1 for the last time, but that cycle is the cycle where they write back, and the following cycle does nothing. Going to LOAD_UPDATE2 in those cases saves a cycle and makes them consistent with the load hit case. The logic in the RELOAD_WAIT_ACK case doesn't need to check r1.req.load = '1' since we only ever use RELOAD_WAIT_ACK for loads. There are also some whitespace fixes and a typo fix. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	501b6daf9b	Add basic XER support The carry is currently internal to execute1. We don't handle any of the other XER fields. This creates type called "xer_common_t" that contains the commonly used XER bits (CA, CA32, SO, OV, OV32). The value is stored in the CR file (though it could be a separate module). The rest of the bits will be implemented as a separate SPR and the two parts reconciled in mfspr/mtspr in latter commits. We always read XER in decode2 (there is little point not to) and send it down all pipeline branches as it will be needed in writeback for all type of instructions when CR0:SO needs to be updated (such forms exist for all pipeline branches even if we don't yet implement them). To avoid having to track XER hazards, we forward it back in EX1. This assumes that other pipeline branches that can modify it (mult and div) are running single issue for now. One additional hazard to beware of is an XER:SO modifying instruction in EX1 followed immediately by a store conditional. Due to our writeback latency, the store will go down the LSU with the previous XER value, thus the stcx. will set CR0:SO using an obsolete SO value. I doubt there exist any code relying on this behaviour being correct but we should account for it regardless, possibly by ensuring that stcx. remain single issue initially, or later by adding some minimal tracking or moving the LSU into the same pipeline as execute. Missing some obscure XER affecting instructions like addex or mcrxrx. [paulus@ozlabs.org - fix CA32 and OV32 for OP_ADD, fix order of arguments to set_ov] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	9a63c098a5	Move log2/ispow2 to a utils package (Out of icache and dcache) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	d363daa692	dcache: Add wishbone pipelining support Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	1a63c39704	Make it possible to change wishbone address size All that needs to be changed now is the size in wishbone_types.vhdl and the address decoder in soc.vhdl Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	6e0ee0b0db	icache & dcache: Fix store way variable We used the variable "way" in the wrong state in the cache when updating a line valid bit after the end of the wishbone transactions, we need to use the latched "store_way". Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	587a5e3c45	dcache: Cleanup (mostly cosmetic) Clearly separate the 2 stages of load hits, improve naming and comments, clarify the writeback controls etc... Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	174378b190	dcache: Introduce an extra cycle latency to make timing This makes the BRAMs use an output buffer, introducing an extra cycle latency. Without this, Vivado won't make timing at 100Mhz. We stash all the necessary response data in delayed latches, the extra cycle is NOT a state in the state machine, thus it's fully pipelined and doesn't involve stalling. This introduces an extra non-pipelined cycle for loads with update to avoid collision on the writeback output between the now delayed load data and the register update. We could avoid it by moving the register update in the pipeline bubble created by the extra update state, but it's a bit trickier, so I leave that for a latter optimization. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	b513f0fb48	dcache: Add a dcache This replaces loadstore2 with a dcache The dcache unit is losely based on the icache one (same basic cache layout), but has some significant logic additions to deal with stores, loads with update, non-cachable accesses and other differences due to operating in the execution part of the pipeline rather than the fetch part. The cache is store-through, though a hit with an existing line will update the line rather than invalidate it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago

29 Commits (fb5c16d05e31983f1127e2f8a97d60f0ce0b4d81)