microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	73cc5167ec	Use FPU for division instructions if we have an FPU - Arrange for XER to be written for OE=1 forms - Arrange for condition codes to be set for RC=1 forms (including correct handling for 32-bit mode) - Don't instantiate the divider if we have an FPU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	34330552e8	FPU: Add logic for 32-bit integer division Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	a95f8aab38	FPU: Add integer division logic to FPU This adds logic to the FPU to accomplish 64-bit integer divisions. No instruction actually uses this yet. The algorithm used is to obtain an estimate of the reciprocal of the divisor using the lookup table and refine it by one to three iterations of the Newton-Raphson algorithm (the number of iterations depends on the number of significant bits in the dividend). Then the reciprocal is multiplied by the dividend to get the quotient estimate. The remainder is calculated as dividend - quotient * divisor. If the remainder is greater than or equal to the divisor, the quotient is incremented, or if a modulo operation is being done, the divisor is subtracted from the remainder. The inverse estimate after refinement is good enough that the quotient estimate is always equal to or one less than the true quotient. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	23d5c4edc5	FPU: Convert internal R, A, B, and C registers to 8.56 format This changes the representation of the R, A, B and C registers in the FPU from 10.54 format (10 bits to the left of the binary point and 54 bits to the right) to 8.56 format, to match the representation used in the P and Y registers and the multiplier operands. This eliminates the need for shifting when R, A, B or C is an input to the multiplier and will make it easier to implement integer division in the FPU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	d1850fea29	Track hazards explicitly for XER overflow bits This provides a mechanism for tracking updates to the XER overflow bits (SO, OV, OV32) and stalling instructions which need current values of those bits (mfxer, integer compare instructions, integer Rc=1 instructions, addex) or which writes carry bits (since all the XER common bits are written together, if we are writing CA/CA32 we need up-to-date values of SO/OV/OV32). This will enable updates to SO/OV/OV32 to be done at other places besides the ex1 stage. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	7c240a664b	fetch1: Fix debug stop again This fixes a bug which prevents the core from stopping properly. The same bug was previously fixed in commit `e41cb01bca` ("fetch1: Fix debug stop", 2020-12-19) and reintroduced by commit `0fb207be60` ("fetch1: Implement a simple branch target cache", 2020-12-19). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	e598c2aef8	control: Reimplement serialization using tags This lets us get rid of r_int and its 'outstanding' counter. We now test more directly for excess completions by checking that we don't get duplicate completions for the same tag. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	2da08bcf2e	decode1: Remove stash buffer Now that the timing of the busy signal from decode2 doesn't depend on register numbers or downstream instruction completion, we no longer need the stash buffer on the output of decode1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	2f45e545ed	decode2: Rework to make the stall_out signal come from a register At present the busy/stall signal going to decode1 depends on whether control thinks it can issue the current instruction, and that depends on completion and bypass signals coming from execute1 and writeback. To improve the timing of stall_out, this rearranges decode2 so that stall_out is asserted when we have a valid instruction that couldn't be issued in the previous cycle. This means that decode1 could give us a new instruction when we haven't issued the previous instruction. This in turn means that we can only use d_in in the first cycle of processing an instruction. After the first cycle, we get register addresses etc. from dc2 rather than d_in. Then, to avoid the need to read register operands from register_file in each cycle until the instruction issues, we bring the bypass path for data being written to the register file into decode2 explicitly rather than having it in register_file. A new process called decode2_addrs does the process of calling decode_input_reg_* and decode_output_reg and sets up the register file addresses. This was split out (and decode_input_reg_* reworked) to try to reduce the number of passes through the decode2_1 process that need to be done in simulation. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c9e838b656	Remove support for lq, stq, lqarx and stqcx. They are optional in SFFS (scalar fixed-point and floating-point subset), are not needed for running Linux, and add complexity, so remove them. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0bd1e24024	decode2: Rename 'r' to 'dc2' Also get rid of a couple of unused variables. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	ebe1caab85	decode1: Reduce number of single-issue instructions This reduces the set of instructions marked as single-issue to just attn and mtspr to "slow" SPRs (those that are not stored in the register file). The instructions that were previously single-issue are: isync, dcbf, dcbst, dcbt, dcbtst, eieio, icbi, mfmsr, mtmsr, mtmsrd, mfspr to slow SPRS, sync, tlbsync and wait. The synchronization instructions are mostly no-ops anyway due to the in-order nature of the core, and the cache-management instructions are unimplemented (except for icbi). The MSR ops don't need to be single-issue due to the in-order core and the fact that MSR updates are effective on the following instruction. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	9a8a8e50f8	FPU: Add stage-2 stall ability to FPU This makes the FPU able to stall other units at execute stage 2 and be stalled by other units (specifically the LSU). This means that the completion and writeback for an instruction can now end up being deferred until the second cycle of a following instruction, i.e. the cycle when the state machine has gone through IDLE state into one of the DO_* states, which means we need to latch the destination FPR number, CR mask, etc. from the previous instruction so that we present the correct information to writeback. The advantage of this is that we can get rid of the in_progress signal from the LSU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	ef122868d5	Do CR0 setting for Rc=1 instructions in execute2 instead of writeback This lets us forward the CR0 result to following instructions that use CR, meaning they get to issue one cycle earlier. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	e030a500e8	Allow integer instructions and load/store instructions to execute together Execute1 and loadstore1 now send each other stall signals that indicate that a valid instruction in stage 2 can't complete in this cycle, and hence any valid instruction in stage 1 in the other unit can't move to stage 2. With this in place, an ALU instruction can move into stage 1 while a LSU instruction is in stage 2. Since the FPU doesn't yet have a way to stall completion, we can't yet start FPU instructions while any LSU or ALU instruction is in progress. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4b6148ada6	Add a bypass path from the execute2 stage This enables some instructions to issue earlier and thus improves performance, at the cost of some extra multiplexers in decode2. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	3510071d9a	Add a second execute stage to the pipeline This adds a second execute stage to the pipeline, in order to match up the length of the pipeline through loadstore and dcache with the length through execute1. This will ultimately enable us to get rid of the 1-cycle bubble that we currently have when issuing ALU instructions after one or more LSU instructions. Most ALU instructions execute in the first stage, except for count-zeroes and popcount instructions (which take two cycles and do some of their work in the second stage) and mfspr/mtspr to "slow" SPRs (TB, DEC, PVR, LOGA/LOGD, CFAR). Multiply and divide/mod instructions take several cycles but the instruction stays in the first stage (ex1) and ex1.busy is asserted until the operation is complete. There is currently a bypass from the first stage but not the second stage. Performance is down somewhat because of that and because this doesn't yet eliminate the bubble between LSU and ALU instructions. The forwarding of XER common bits has been changed somewhat because now there is another pipeline stage between ex1 and the committed state in cr_file. The simplest thing for now is to record the last value written and use that, unless there has been a flush, in which case the committed state (obtained via e_in.xerc) is used. Note that this fixes what was previously a benign bug in control.vhdl, where it was possible for control to forget an instructions dependency on a value from a previous instruction (a GPR or the CR) if this instruction writes the value and the instruction gets to the point where it could issue but is blocked by the busy signal from execute1. In that situation, control may incorrectly not indicate that a bypass should be used. That didn't matter previously because, for ALU and FPU instructions, there was only one previous instruction in flight and once the current instruction could issue, the previous instruction was completing and the correct value would be obtained from register_file or cr_file. For loadstore instructions there could be two being executed, but because there are no bypass paths, failing to indicate use of a bypass path is fine. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	521a5403a9	execute1: Rename 'r' to 'ex1' Maybe this will give us slightly better names in critical path reports and the like. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	813e2317bf	execute1: Restructure to separate out execution of side effects We now have a record that represents the actions taken in executing an instruction, and a process that computes that for the incoming instruction. We no longer have 'current' or 'r.cur_instr', instead things like the destination register are put into r.e in the first cycle of an instruction and not reinitialized in subsequent busy cycles. For mfspr and mtspr, we now decode "slow" SPR numbers (those SPRs that are not stored in the register file) to a new "spr_selector" record in decode1 (excluding those in the loadstore unit). With this, the result for mfspr is determined in the data path. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Iago Caran Aquino	de1bf10114	tests/pmu: Add load/store completed, instruction count and cycle count tests Signed-off-by: Iago Caran Aquino <iago.caran@gmail.com>	4 years ago
Paul Mackerras	204fedc63f	Move XER low bits out of register file Besides the overflow and status carry bits, XER has 18 bits which need to retain the value written by mtxer (in case software wants to emulate the move-assist instructions (lswi, lswx, stswi, stswx). Until now these bits (and others) have been stored in the GPR file as a "fast" SPR, but this causes complications because XER is not really a fast SPR. Instead, we now store these 18 bits in the 'ctrl' signal, which exists in execute1. This will enable us to simplify the data path in future, and has the added bonus that with a little bit of plumbing, we can get the full XER value printed when dumping registers at the end of a simulation. Therefore this changes scripts/run_test.sh to remove the greps which exclude XER from the comparison of actual and expected register results. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	bdd4d04162	Simplify flow control in the dcache and loadstore units Simplify the flow control by stalling the whole upstream pipeline when a stage can't proceed, instead of trying to let each stage progress independently when it can. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	35e0dbed34	Merge pull request #353 from tianrui-wei/master fix: fix icache_tb not finishing correctly	4 years ago
Michael Neuling	cd52390bf1	Merge pull request #373 from antonblanchard/icache-insn-u-state icache: Don't output X on i_out.insn	4 years ago
Michael Neuling	b983d5080e	Merge pull request #376 from antonblanchard/loadstore-init loadstore1: reduce U state being output	4 years ago
Michael Neuling	d4db331467	Merge pull request #374 from antonblanchard/icache-unused-sig core: Remove unused icache_inv signal	4 years ago
Michael Neuling	ee5e3778ed	Merge pull request #364 from shenki/readme-updates Readme updates	4 years ago
Michael Neuling	c43692f4c7	Merge pull request #372 from antonblanchard/dcache-unused-sig dcache: remove unused do_write signal	4 years ago
Michael Neuling	956df2c863	Merge pull request #371 from antonblanchard/unused-sig execute1: sub_mux_sel and result_mux_sel are unused	4 years ago
Michael Neuling	3627f102db	Merge pull request #370 from antonblanchard/divider-init divider: Fix d_out.overflow U state issue	4 years ago
Paul Mackerras	6e1e763c02	Merge pull request #368 from antonblanchard/icache-pmu-events icache: Hook up PMU events	4 years ago
Anton Blanchard	1047239a37	Merge pull request #377 from antonblanchard/fpu-init fpu: Reduce uninitialised signals	4 years ago
Anton Blanchard	9d35340bb1	fpu: Reduce uninitialised signals Reduce uninitialised signals coming out of the FPU. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Michael Neuling	b82eea5933	Merge pull request #366 from antonblanchard/hello-world-bss Zero BSS in hello world test	4 years ago
Anton Blanchard	d3aff67fa7	Merge pull request #375 from antonblanchard/core_debug-init core_debug: Initialise gspr_index	4 years ago
Anton Blanchard	b47b71821e	loadstore1: reduce U state being output While these signals should only be read when valid is true, they are only a small number of bits and we want to reduce the amount of U/X state bouncing around the chip. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	71d4b5ed20	core_debug: Initialise gspr_index Another case of U state being driven out of a module. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	a527d9b959	core: Remove unused icache_inv signal Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	e7f0a7c7ac	icache: Don't output X on i_out.insn decode1 has a lot of logic that uses i_out.insn without first looking at i_iout.valid. Play it safe and never output X state. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	39220be311	dcache: remove unused do_write signal Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	843361f2be	execute1: sub_mux_sel and result_mux_sel are unused Remove them. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	d3a7517318	divider: Fix d_out.overflow U state issue While we should only look at this when d_out.valid = 1, we may as remove some U state across interfaces. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	1ff852b012	Merge pull request #369 from antonblanchard/loadstore-pmu-init loadstore1: Initialise PMU events	4 years ago
Anton Blanchard	e2438071a1	loadstore1: Initialise PMU events The loadstore1 PMU events are U state until a load and a store completes. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	b7c4d3c5c3	Merge pull request #367 from antonblanchard/fpu-typo fpu: Fix capitalisation of Execute1ToFPUType	4 years ago
Anton Blanchard	f06abb67ad	icache: Hook up PMU events We weren't connecting the icache PMU events up. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	64d2def0c6	fpu: Fix capitalisation of Execute1ToFPUType While this is not an issue in VHDL, I noticed this when running a script over the source and we may as well fix it. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	ff442d1bdb	Zero BSS in hello world test While trying to reduce U/X state issues, I notice that our BSS is not being initialised in the hello world test. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Anton Blanchard	b8fc5636a4	Merge pull request #365 from antonblanchard/less-fpga-init Remove some FPGA style signal inits	4 years ago
Anton Blanchard	ebdddcc402	Remove some FPGA style signal inits These don't work on the ASIC flow, so remove them and initialise them explicitly where required. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago

1 2 3 4 5 ...

1196 Commits (09965b91024faceab13446ab6801fdb2614da71e) All Branches Search

1196 Commits (09965b91024faceab13446ab6801fdb2614da71e)

All Branches