microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	152eef1156	Merge pull request #446 from paulusmack/master Implement hypervisor doorbells	3 days ago
Paul Mackerras	d2bf3f3580	core: Implement hypervisor doorbell interrupt and msg* instructions This implements the hypervisor doorbell exception and interrupt and the msgsnd, msgclr and msgsync instructions (msgsync is a no-op). The msgsnd instruction can generate a hypervisor doorbell interrupt on any CPU in the system. To achieve this, each core sends its hypervisor doorbell messages to the soc level, which ORs together the bits for each CPU and sends it to that CPU. The privileged doorbell exception/interrupt and the msgsndp/msgclrp instructions are not required since we don't implement SMT. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 week ago
Paul Mackerras	ca872faede	core: Consolidate several OP_* values into a single OP_COMPUTE This replaces OP_ADDG6S, OP_BCD, OP_BREV, OP_CMPB, OP_CMPEQB, OP_CMPRB, OP_CROP, OP_EXTS, OP_EXTSWSLI, OP_ISEL, OP_LOGIC, OP_MFCR, OP_PRTY, OP_RLC, OP_RLCL, OP_RLCR, OP_SETB, OP_SHL, OP_SHR, and OP_XOR with a single OP_COMPUTE. The replaced operations are all ones which just compute a result value (for GPR or CR) in execute1, don't have any other side effects, and aren't used in decode2 to determine other signals. The operation to be performed is sufficiently defined by the result and subresult fields in the decode table. With the elimination of OP_SPARE, this reduces the number of insn_type_t values to 44. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 week ago
Paul Mackerras	a764fd464e	Merge pull request #445 from paulusmack/master Various improvements, including SMP support for the Acorn-CLE-215 board.	1 week ago
Paul Mackerras	8f6c727309	execute1: Rework data paths for mfspr and mtspr Data being written to an SPR by mtspr now comes in to execute2 via ex1.write_spr_data (renamed from ex1.ramspr_odd_data) rather than ex1.e.write_data. This eliminates the need for the main result mux in execute1 to be able to pass the c_in value through. For mfspr, the no-op behaviour is obtained by selecting ex1.write_spr_data as spr_result in execute2. We already had ex1.write_spr_data being set from c_in, so no new logic is required there. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 months ago
Paul Mackerras	fc3ff2d340	logical: Use sub_select rather than insn_type to select logical op Also select the RS passthrough in the logical unit by default for mfspr, which is needed for the no-op SPRs and the no-op behaviour of privileged mfspr to unimplemented SPRs. For slow SPRs the RS behaviour gets passed through from execute1 to execute2 and replaced by the correct result in execute2's result mux. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 months ago
Paul Mackerras	54173a0677	decode: Move result_sel and subresult_sel into main decode table Instead of working out result_sel and subresult_sel in decode2 from the insn_type, they now come directly from the main decode table in decode1. This reduces the need for distinct insn_type values and should enable us to avoid expanding insn_type beyond 6 bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 months ago
Paul Mackerras	8bfce4890b	predecode: Add some more comments No code change. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 months ago
Paul Mackerras	0f8c4afc52	openocd: Update arty config for newer openocd versions Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 months ago
Paul Mackerras	0bf1dcedbd	acorn-cle-215: Implement SMP and enable FPU and BTC The four LEDs on the Acorn-CLE-215 (Nitefury) board become run lights for the first four CPUs. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 months ago
Paul Mackerras	ce5a967ac2	soc: Allow for up to 1GB of DRAM in address decoding The Acorn-CLE-215 board has 1GB of DRAM. Without this, the top 512MB of DRAM is not accessible. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 months ago
Paul Mackerras	4282d37741	FPU: Faster method for testing for 1-bits at right end of R At various points we need to set the X bit if any bit of R which would be shifted out by a right shift of N bits is a 1. We can do this by computing R \| -R, which contains a 1 in the position of the right-most 1-bit in R and in all positions to the left, and zeroes to the right. That means we can test for the least-significant N bits being non-zero by testing whether bit N-1 of (R \| -R) is a 1. Doing this uses fewer LUTs and has better timing than the old method of generating a mask, ANDing it with R, and testing whether the result is non-zero. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	04b0c901e0	dcache: Simplify expression for read enable of cache RAM The path from execute_to_loadstore.valid through to the read enable of the cache RAM has showed up as a critical path. In fact we can simplify this by always asserting read enable when not stalled. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	8605dcb4f1	decode2: Use register addresses from decode1 rather than recomputing them Currently, decode2 computes register addresses from the input_reg fields in the decode table entry and the instruction word. This duplicates a computation that decode1 has already done based on the insn_code value. Instead of doing this redundant computation, just use the register addresses supplied by decode1. This means that the decode_input_reg_* functions merely compute whether the register operand is used or not. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	e14712e45c	core: Simplify operand presentation for hash instructions This removes the cases in the decode stages which allowed the C register address to come from the RB field for the hash instructions (hashst[p], hashchk[p]), and generated a negative immediate value for the B operand. The motivation is to simpify the logic for the C register address. Instead the unusual construction of the address for the hash instructions is handled in the loadstore1_in process, and the hash computation uses the A and B operands rather than A and C. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	dc9d351833	Merge pull request #444 from paulusmack/master Miscellaneous improvements	4 months ago
Paul Mackerras	de2e8f81ee	decode: Execute cpabort as a no-op It seems that the Linux kernel executes cpabort on any CPU that implements ISA v3.1 or later, despite cpabort being optional. To cope with this, implement cpabort as a no-op. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	b65dde1a95	arty a7: Display run status of two CPUs on LEDs 6 and 7 The run status LED is off when the core is held in reset (e.g. when the second core hasn't been started yet). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	51dd7f578f	countbits: Move more popcount calculation before the clock edge Popcount takes two cycles to execute. The computation of the final popcount value in the second cycle has showed up as a critical path on the Artix-7, so move one stage of the summation back into the first cycle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	b14dd43ce6	Merge pull request #443 from paulusmack/compliance More architecture compliance improvements: LPCR, [U]SIER[23], [U]MMCR3, HMER, HMEER. Remove HFSCR and associated logic.	4 months ago
Paul Mackerras	361a01259c	Merge pull request #441 from paulusmack/dcache This reworks the dcache to try and simplify the logic and alleviate some of the paths that have been showing up as critical paths in synthesis. An example is a dependency of the req_is_hit signal on the wishbone ack, which this series removes. Overall this seems to have reduced LUT usage and improved timing.	4 months ago
Paul Mackerras	7e544c1fb8	Merge pull request #442 from paulusmack/fpu This reworks the FPU logic to try and get closer to the point where the big state machine could be converted into microcode. This means that as far as possible the state machine should just set control lines, ideally with as little conditional logic in each state as possible, and that anything that is considered data should be manipulated outside of the state machine. This also improves architecture compliance in the area of exception handling, and alleviates some critical paths.	4 months ago
Paul Mackerras	8f7326a824	core: Implement various SPRs which read zero and ignore writes This implements [U]SIER2, [U]SIER3, [U]MMCR3, HMER and HMEER as SPRs which return zero when read, and ignore writes. The zero value is provided via the slow SPR read multiplexer. To avoid increasing the size of the selector from 4 bits to 5, the (implementation specific) LOG_ADDR and LOG_DATA SPRs now share a single selector value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	1da8476cf9	dcache: Simplify forwarding of load data while reloading a cache line This removes a dependency of req_is_hit and similar signals on the wishbone ack input, by removing use_forward_rl, and making idx_reload not dependent on wr_row_match and wishbone_in.ack. Previously if a load in r0 hit the doubleword being supplied from memory, that was treated as a hit and the data was forwarded via a multiplexer associated with the cache RAM. Now it is called a miss and completed by the logic in the RELOAD_WAIT_ACK state of the state machine. The only downside is that now the selection of data source in the dcache_fast_hit process depends on req_is_hit rather than r1.full. Overall this change seems to reduce the number of LUTs, and make timing easier on the ECP-5. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	c938246cc8	dcache: Simplify addressing of the dcache TLB Instead of having TLB invalidation and TLB load requests come through the dcache main path, these operations are now done in one cycle entirely based on signals from the MMU, and don't involve the TLB read path or the dcache state machine at all. So that we know which way of the TLB to affect for invalidations, loadstore1 now sends down a "TLB probe" operation for tlbie instructions which goes through the dcache pipeline and sets the r1.tlb_hit_* fields which are used in the subsequent invalidation operation from the MMU (if it is a single-page invalidation). TLB load operations write to the way identified by r1.victim_way, which was set on the TLB miss that triggered the TLB reload. Since we are writing just one way of the TLB tags now, rather than writing all ways with one way's value changed, we now pad each way to a multiple of 8 bits so that byte write-enables can be used to select which way gets written. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	1b6ee631bc	core: Implement LPCR register This implements the LPCR (Logical Partition Control Register) with 5 read/write bits. The other 59 bits are read-only; two (HR and UPRT) read as 1 and the rest as 0. The bits that are implemented are: * HAIL - enables taking interrupts with relocation on * LD - enables large decrementer mode * HEIC - disables external interrupts when set * LPES - controls how external interrupts are delivered * HVICE - does nothing at present since there is no source of Hypervisor Virtualization Interrupts. This also fixes a bug where MSR[RI] was getting cleared by the delivery of hypervisor interrupts. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	63fff5e05c	core: Remove HFSCR and Hypervisor Facility Unavailable interrupt logic HFSCR is associated with the LPAR (Logical Partitioning) feature, which is not required for SFFS designs, so remove it and the associated logic. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	5168242cd5	dcache: Rework forwarding data paths This rearranges the multiplexing of cache read data with forwarded store data with the aim of shortening the path from the req_hit_ways signal to the r1.data_out register. The forwarding decisions are now made for each way independently and the the results then combined according to which way detected a cache hit. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	4278387b21	dcache: Simplify reservation logic With some slight arrangement of the state machine in the dcache_slow process, we can remove one of the two comparators that detect writes by other entities to the reservation granule. The state machine now sets the wishbone cyc signal on the transition from IDLE to DO_STCX state. Once we see the wishbone stall signal at 0, we consider we have the wishbone and we can assert stb to do the write provided that the stcx is to the reservation address and we haven't seen another write to the reservation granule. We keep the comparator that compares the snoop address delayed by one cycle, in order to make timing easier, and the one (or more) cycle delay between cyc and stb covers that one cycle delay in the kill_rsrv signal. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	26507450b7	dcache: Remove reset on read port of cache tag RAM The reset was added originally to reduce metavalue warnings in simulation, is not necessary for correct operation, and showed up as a critical path in synthesis for the Xilinx Artix-7. Remove it when doing synthesis; for simulation we set the value read to X rather than 0 in order to catch any use of the previously reset value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	9645ab6e1f	dcache: Rework forwarding and same-page logic This gets rid of some largish comparators in the dcache_request process by matching index and way that hit in the cache tags instead of comparing tag values. That is, some tag comparisons can be replaced by seeing if both tags hit in the same cache way. When reloading a cache line, we now set it valid at the beginning of the reload, so that we get hits to compare. While the reload is still occurring, accesses to doublewords that haven't yet been read are indicated with req_is_hit = 0 and req_hit_reload = 1 (i.e. are considered to be a miss, at least for now). For the comparison of whether a subsequent access is to the same page as stores already being performed, in virtual mode (TLB being used) we now compare the way and index of the hit in the TLB, and in real mode we compare the effective address. If any new entry has been loaded into the TLB since the access we're comparing against, then it is considered to be a different page. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	2529bb66ad	dcache: Implement dcbz to non-cacheable memory properly A dcbz operation to memory that is mapped as non-cacheable in the page tables doesn't cause an alignment interrupt, but neither was it implemented properly in the dcache. It does do 8 writes to memory but it also creates a zero-filled line in the cache. This fixes it so that dcbz to memory mapped non-cacheable doesn't write the cache tag or set any line valid. We now have r1.reloading which is 1 only in RELOAD_WAIT_ACK state, but only if the memory is cacheable and therefore the cache should be updated (i.e. it is zero in RELOAD_WAIT_ACK state if we are doing a non-cacheable dcbz). We can now also remove the code in loadstore1 that checks for non-cacheable dcbz, which only triggered when doing dcbz in real mode to an address in the Cxxxxxxx range. Also remove some unused variables and signals. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	ec323897e3	dcache: Use expanded per-way TLB and cache tag hit information Rather than combining the results of the per-way comparators into an encoded 'hit_way' variable, use the individual results directly using AND-OR type networks where possible, in order to reduce utilization and improve timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	3268ef717c	FPU: Make opsel_a a function of just the state This adds some extra states and transitions so that opsel_a becomes a function only of the current state. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	73505b1626	FPU: Provide a separate path for transferring A/B/C to R The timing path from r.a.class to result showed up as a critical path on the Artix-7, apparently because of transfers of A, B or C to R in special cases (e.g. NaN inputs) and the fsel instruction. To alleviate this, we provide a path via the miscellaneous value multiplexer from A, B and C to R, selected via opsel_R = RES_MISC and misc_sel = 111. A new selector opsel_sel selects which of A, B or C to transfer, using the same encoding as opsel_a. This new selector is now also used for the result class when rcls_op = RCLS_SEL and for the result sign when rsgn_op = RSGN_SEL. This reduces the number of things that opsel_a depends on and eases timing in the main adder path. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	b63773f6e9	FPU: Move computation of main adder inputs out of the state machine Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	b4aae8511d	FPU: Move special case handling to a separate process This creates a new fpu_specialcases process that handles most of the logic that was previously in the DO_NAN_INF and DO_ZERO_DEN states. What remains of those states, i.e. the handling of denormalized inputs, is in a new DO_SPECIAL state. The state machine goes into DO_SPECIAL state after IDLE for any arithmetic operation where an input is a NaN, infinity, zero or denormalized value. Doing this means that the rest of the state machine won't try to start any computation which would need to be overridden by the logic to produce the result value selected by the fpu_specialcases process. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	b1bd2aa865	FPU: Make set_r independent of multiply_to_f.valid Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	fcfdbc449c	FPU: Move condition register calculations to an explicit data path Instead of calculating v.cr_result in the state machine, we now have the state machine set a 'cr_op' variable which then controls what computation the CR data path does to set v.cr_result. The CR data path also handles updating the XERC result bits for integer operations (division and modulus). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	bbc485f336	FPU: Rework inputs to the main adder With this, the A input no longer has R as an option but now takes the rounding constants and the low-order bits of P (used as an adjustment in the square root algorithm). The B input has either R or zero. Both inputs can be optionally inverted for subtraction. The select inputs to the multiplexers now have 3 bits in opsel_a and 1 bit in opsel_b. The states which need R to be set now explicitly have set_r := 1 even though that is the default, essentially for documentation reasons. Similarly some states set opsel_b <= BIN_R even though that is the default. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	0e7c11a0e4	FPU: Move result_class logic outside of state machine The various states choose one of four operations (including no-op) to be done on result_class. Some operations have side-effects on arith_done or FPSCR. The DO_NAN_INF and DO_ZERO_DEN states still set result_class directly since their logic is expected to move out to a separate process later. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	5f0b2d433d	FPU: Simplify calculation of result_class For the various arithmetic operators, we only get to the DO_* states when the inputs are finite (not zero, infinity or NaN), so we can replace setting of v.result_class to r.a.class or r.b.class with a overall setting of it to FINITE in cycle 1 of all those operations. Also, integer division doesn't need to set the result class since the result is integer. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	70819c4c39	FPU: Do renormalization from DO_ZERO_DEN state Instead of having the various DO_* states (DO_FMUL, DO_FDIV, etc.) handle checking for denormalized inputs, we now have DO_ZERO_DEN state check for denormalized inputs and branch to RENORM_{A,B,C} to handle them. This also meant some changes were needed in how fsqrt and frsqrte handled inputs with odd exponent. The DO_FSQRT and DO_FRSQRTE states were very similar and have been combined into one. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	8648ddb64f	FPU: Eliminate EXC_RESULT state This lets us remove r.opsel_a and is a step towards moving the handling of exceptional cases out to a separate process. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	850b87c83f	FPU: Get rid of r.madd_cmp and r.exp_cmp This saves a few LUTs and simplifies the code a little. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	ba2add029a	FPU: Remove need to set opsel_a one cycle ahead Most states set opsel_a directly to select the operand for the A input of the main adder. The exception is the EXC_RESULT state, which uses r.opsel_a set by the previous cycle to indicate which input operand to use as the result. In order to make timing, ensure that the controls that select the inputs to the main adder (opsel_*, etc.) don't depend on any complicated functions of the data (such as px_nz, pcmpb_eq, pcmpb_lt, etc.), but are as far as possible constant for each state. There is now a control called set_r for whether the result is written to r.r, which enables us to avoid setting opsel_b or opsel_r conditionally in some cases. Also, to avoid a data-dependent setting of msel_2 in IDIV_DODIV state, the IDIV_NR1 and IDIV_NR2 states have been reworked so that completion of the required number of iterations is checked in IDIV_NR1 state, and at that point, if the inverse estimate is < 0.5, we go to IDIV_USE0_5 state in order to use 0.5 as the estimate. This means that in the normal case, the inverse estimate is already in Y when we get to IDIV_DODIV state. IDIV_USE0_5 has been reworked to put R (which will contain 0.5) into Y as the inverse estimate. That means that IDIV_DODIV state doesn't have any data-dependent logic to put either P or R into Y. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	2731384a4b	FPU: Reduce misc_sel to 3 bits Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	cf866ce910	FPU: Simplify logic for setting r.x Since r.x is mostly set from the value in r.r and only once from anything else (r.b.mantissa), move the check to before the input multiplexer for the main adder, so it works on r.r rather than whatever is selected by r.opsel_a. For the case in DO_FRSP where we have B selected by r.opsel_a, we add a new state so that we now get B into R and then check the low bits of R. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	4e5f856c55	FPU: Factor out some of the common elements of the DO_* states Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago
Paul Mackerras	2422585e14	FPU: Reduce use of r.insn inside the state machine Instead use things derived from the instruction in the first cycle, such as r.is_multiply, r.is_addition, etc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 months ago

1 2 3 4 5 ...

1460 Commits (master) All Branches Search

1460 Commits (master)

All Branches