microwatt

Commit Graph

Author	SHA1	Message	Date
Michael Neuling	caf458be37	Metavalue cleanup for common.vhdl This affects other files which have been included here. Signed-off-by: Michael Neuling <mikey@neuling.org>	2 years ago
Paul Mackerras	d6121cd636	Use register addresses from decode1 for dependency tracking This improves timing a little because the register addresses now come directly from a latch instead of being calculated by decode_input_reg_*. The asserts that check that the two are the same are now in decode2 rather than register_file. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	d0f319290f	Restore debug access to SPRs This provides access to the SPRs via the JTAG DMI interface. For now they are still accessed as if they were GPR/FPRs using the same numbering as before (GPRs at 0 - 0x1f, SPRs at 0x20 - 0x2d, FPRs at 0x40 - 0x5f). For XER, debug reads now report the full value, not just the bits that were previously stored in the register file. The "slow" SPR mux is not used for debug reads. Decode2 determines on each cycle whether a debug SPR access will happen next cycle, based on whether there is a request and whether the current instruction accesses the SPR RAM. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	fdb3ef6874	Finish off taking SPRs out of register file With this, the register file now contains 64 entries, for 32 GPRs and 32 FPRs, rather than the 128 it had previously. Several things get simplified - decode1 no longer has to work out the ispr{1,2,o} values, decode_input_reg_{a,b,c} no longer have the t = SPR case, etc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	337b104250	Move LR, CTR and TAR out of the register file By putting CTR on the odd side and LR and TAR on the even side, we can read and write CTR for bdnz-style instructions in parallel with reading LR or TAR for indirect branches and writing LR for branches with LK=1. Thus we don't need to double up any of these instructions, giving a simplification in decode2. We now have logic for printing LR and CTR at the end of a simulation in execute1, in addition to the similar logic in register_file and cr_file. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	bc4d02cb0d	Start removing SPRs from register file This starts the process of removing SPRs from the register file by moving SRR0/1, SPRG0-3, HSRR0/1 and HSPRG0/1 out of the register file and putting them into execute1. They are stored in a pair of small RAM arrays, referred to as "even" and "odd". The reason for having two arrays is so that two values can be read and written in each cycle. For example, SRR0 and SRR1 can be written in parallel by an interrupt and read in parallel by the rfid instruction. The addresses in the RAM which will be accessed are determined in the decode2 stage. We have one write address for both sides, but two read addresses, since in future we will want to be able to read CTR at the same time as either LR or TAR. We now have a connection from writeback to execute1 which carries the partial SRR1 value for an interrupt. SRR0 comes from the execute pipeline; we no longer need to carry instruction addresses along the LSU and FPU pipelines. Since SRR0 and SRR1 can be written in the same cycle now, we don't need the little state machine in writeback any more. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	d1850fea29	Track hazards explicitly for XER overflow bits This provides a mechanism for tracking updates to the XER overflow bits (SO, OV, OV32) and stalling instructions which need current values of those bits (mfxer, integer compare instructions, integer Rc=1 instructions, addex) or which writes carry bits (since all the XER common bits are written together, if we are writing CA/CA32 we need up-to-date values of SO/OV/OV32). This will enable updates to SO/OV/OV32 to be done at other places besides the ex1 stage. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	e598c2aef8	control: Reimplement serialization using tags This lets us get rid of r_int and its 'outstanding' counter. We now test more directly for excess completions by checking that we don't get duplicate completions for the same tag. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	2f45e545ed	decode2: Rework to make the stall_out signal come from a register At present the busy/stall signal going to decode1 depends on whether control thinks it can issue the current instruction, and that depends on completion and bypass signals coming from execute1 and writeback. To improve the timing of stall_out, this rearranges decode2 so that stall_out is asserted when we have a valid instruction that couldn't be issued in the previous cycle. This means that decode1 could give us a new instruction when we haven't issued the previous instruction. This in turn means that we can only use d_in in the first cycle of processing an instruction. After the first cycle, we get register addresses etc. from dc2 rather than d_in. Then, to avoid the need to read register operands from register_file in each cycle until the instruction issues, we bring the bypass path for data being written to the register file into decode2 explicitly rather than having it in register_file. A new process called decode2_addrs does the process of calling decode_input_reg_* and decode_output_reg and sets up the register file addresses. This was split out (and decode_input_reg_* reworked) to try to reduce the number of passes through the decode2_1 process that need to be done in simulation. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	c9e838b656	Remove support for lq, stq, lqarx and stqcx. They are optional in SFFS (scalar fixed-point and floating-point subset), are not needed for running Linux, and add complexity, so remove them. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	0bd1e24024	decode2: Rename 'r' to 'dc2' Also get rid of a couple of unused variables. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	4b6148ada6	Add a bypass path from the execute2 stage This enables some instructions to issue earlier and thus improves performance, at the cost of some extra multiplexers in decode2. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	3510071d9a	Add a second execute stage to the pipeline This adds a second execute stage to the pipeline, in order to match up the length of the pipeline through loadstore and dcache with the length through execute1. This will ultimately enable us to get rid of the 1-cycle bubble that we currently have when issuing ALU instructions after one or more LSU instructions. Most ALU instructions execute in the first stage, except for count-zeroes and popcount instructions (which take two cycles and do some of their work in the second stage) and mfspr/mtspr to "slow" SPRs (TB, DEC, PVR, LOGA/LOGD, CFAR). Multiply and divide/mod instructions take several cycles but the instruction stays in the first stage (ex1) and ex1.busy is asserted until the operation is complete. There is currently a bypass from the first stage but not the second stage. Performance is down somewhat because of that and because this doesn't yet eliminate the bubble between LSU and ALU instructions. The forwarding of XER common bits has been changed somewhat because now there is another pipeline stage between ex1 and the committed state in cr_file. The simplest thing for now is to record the last value written and use that, unless there has been a flush, in which case the committed state (obtained via e_in.xerc) is used. Note that this fixes what was previously a benign bug in control.vhdl, where it was possible for control to forget an instructions dependency on a value from a previous instruction (a GPR or the CR) if this instruction writes the value and the instruction gets to the point where it could issue but is blocked by the busy signal from execute1. In that situation, control may incorrectly not indicate that a bypass should be used. That didn't matter previously because, for ALU and FPU instructions, there was only one previous instruction in flight and once the current instruction could issue, the previous instruction was completing and the correct value would be obtained from register_file or cr_file. For loadstore instructions there could be two being executed, but because there are no bypass paths, failing to indicate use of a bypass path is fine. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	813e2317bf	execute1: Restructure to separate out execution of side effects We now have a record that represents the actions taken in executing an instruction, and a process that computes that for the incoming instruction. We no longer have 'current' or 'r.cur_instr', instead things like the destination register are put into r.e in the first cycle of an instruction and not reinitialized in subsequent busy cycles. For mfspr and mtspr, we now decode "slow" SPR numbers (those SPRs that are not stored in the register file) to a new "spr_selector" record in decode1 (excluding those in the loadstore unit). With this, the result for mfspr is determined in the data path. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	2491aa7fc5	core: Make popcnt* take two cycles This moves the calculation of the result for popcnt* into the countbits unit, renamed from countzero, so that we can take two cycles to get the result. The motivation for this is that the popcnt* calculation was showing up as a critical path. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	17fd069640	core: Allow multiple loadstore instructions to be in flight The idea here is that we can have multiple instructions in progress at the same time as long as they all go to the same unit, because that unit will keep them in order. If we get an instruction for a different unit, we wait for all the previous instructions to finish before executing it. Since the loadstore unit is the only one that is currently pipelined, this boils down to saying that loadstore instructions can go ahead while l_in.in_progress = 1 but other instructions have to wait until it is 0. This gives a 2% increase on coremark performance on the Arty A7-100 (from ~190 to ~194). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4fd8d9509c	execute1: Move CR result to data path process Also work out in decode2 whether the instruction sets the XER common bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	ae2afeca5c	core: Track CR hazards and bypasses using tags Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	d290d2a9bb	core: Restore bypass path from execute1 This changes the bypass path. Previously it went from after execute1's output to after decode2's output. Now it goes from before execute1's output register to before decode2's output register. The reason is that the new path will be simpler to manage when there are possibly multiple instructions in flight. This means that the bypassing can be managed inside decode2 and control. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c0b45e153b	core: Track GPR hazards using tags that propagate through the pipelines This changes the way GPR hazards are detected and tracked. Instead of having a model of the pipeline in gpr_hazard.vhdl, which has to mirror the behaviour of the real pipeline exactly, we now assign a 2-bit tag to each instruction and record which GSPR the instruction writes. Subsequent instructions that need to use the GSPR get the tag number and stall until the value with that tag is being written back to the register file. For now, the forwarding paths are disabled. That gives about a 8% reduction in coremark performance. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	a1d7b54f76	core: Crack branches that update both CTR and LR This uses the instruction doubling machinery to convert conditional branch instructions that update both CTR and LR (e.g., bdnzl, bdnzlrl) into two instructions, of which the first updates CTR and determines whether the branch is taken, and the second updates LR and does the redirect if necessary. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4c61a71a62	core: Crack update-form loads into two internal ops This uses the instruction-doubling machinery to send load with update instructions down to loadstore1 as two separate ops, rather than one op with two destinations. This will help to simplify the value tracking mechanisms. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b0510fd1bb	core: Reorganize execute1 This breaks up the enormous if .. elsif .. case .. elsif statement in execute1 in order to try to make it simpler and more understandable. We now have decode2 deciding whether the instruction has a value to be written back to a register (GPR, GSPR, FPR, etc.) rather than individual cases in execute1 setting result_en. The computation of the data to be written back is now independent of detection of various exception conditions. We now have an if block determining if any exception condition exists which prevents the next instruction from being executed, then the case statement which performs actions such as setting carry/overflow bits, determining if a trap exception exists, doing branches, etc., then an if statement for all the r.busy = 1 cases (continuing execution of an instruction which was started in a previous cycle, or writing SRR1 for an interrupt). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	658feabfd4	core: Make result multiplexing explicit This adds an explicit multiplexer feeding v.e.write_data in execute1, with the select lines determined in the previous cycle based on the insn_type. Similarly, for multiply and divide instructions, there is now an explicit multiplexer. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	89a67a18d0	decode: Add a facility field to the instruction decode tables This makes it simpler to work out when to deliver a FPU unavailable interrupt. This also means we can get rid of the OP_FPLOAD and OP_FPSTORE insn_type values. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4b2c23703c	core: Implement quadword loads and stores This implements the lq, stq, lqarx and stqcx. instructions. These instructions all access two consecutive GPRs; for example the "lq %r6,0(%r3)" instruction will load the doubleword at the address in R3 into R7 and the doubleword at address R3 + 8 into R6. To cope with having two GPR sources or destinations, the instruction gets repeated at the decode2 stage, that is, for each lq/stq/lqarx/stqcx. coming in from decode1, two instructions get sent out to execute1. For these instructions, the RS or RT register gets modified on one of the iterations by setting the LSB of the register number. In LE mode, the first iteration uses RS\|1 or RT\|1 and the second iteration uses RS or RT. In BE mode, this is done the other way around. In order for decode2 to know what endianness is currently in use, we pass the big_endian flag down from icache through decode1 to decode2. This is always in sync with what execute1 is using because only rfid or an interrupt can change MSR[LE], and those operations all cause a flush and redirect. There is now an extra column in the decode tables in decode1 to indicate whether the instruction needs to be repeated. Decode1 also enforces the rule that lq with RT = RT and lqarx with RA = RT or RB = RT are illegal. Decode2 now passes a 'repeat' flag and a 'second' flag to execute1, and execute1 passes them on to loadstore1. The 'repeat' flag is set for both iterations of a repeated instruction, and 'second' is set on the second iteration. Execute1 does not take asynchronous or trace interrupts on the second iteration of a repeated instruction. Loadstore1 uses 'next_addr' for the second iteration of a repeated load/store so that we access the second doubleword of the memory operand. Thus loadstore1 accesses the doublewords in increasing memory order. For 16-byte loads this means that the first iteration writes GPR RT\|1. It is possible that RA = RT\|1 (this is a legal but non-preferred form), meaning that if the memory operand was misaligned, the first iteration would overwrite RA but then the second iteration might take a page fault, leading to corrupted state. To avoid that possibility, 16-byte loads in LE mode take an alignment interrupt if the operand is not 16-byte aligned. (This is the case anyway for lqarx, and we enforce it for lq as well.) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	e6a5f237bc	FPU: Implement fmul[s] This implements the fmul and fmuls instructions. For fmul[s] with denormalized operands we normalize the inputs before doing the multiplication, to eliminate the need for doing count-leading-zeroes on P. This adds 3 or 5 cycles to the execution time when one or both operands are denormalized. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b628af6176	FPU: Implement fmr and related instructions This implements fmr, fneg, fabs, fnabs and fcpsgn and adds tests for them. This adds logic to unpack and repack floating-point data from the 64-bit packed form (as stored in memory and the register file) into the unpacked form in the fpr_reg_type record. This is not strictly necessary for fmr et al., but will be useful for when we do actual arithmetic. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	856e9e955f	core: Add framework for an FPU This adds the skeleton of a floating-point unit and implements the mffs and mtfsf instructions. Execute1 sends FP instructions to the FPU and receives busy, exception, FP interrupt and illegal interrupt signals from it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	45cd8f4fc3	core: Add support for floating-point loads and stores This extends the register file so it can hold FPR values, and implements the FP loads and stores that do not require conversion between single and double precision. We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores cause a FP unavailable interrupt if MSR[FP] = 0. The FPU facilities are optional and their presence is controlled by the HAS_FPU generic passed down from the top-level board file. It defaults to true for all except the A7-35 boards. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6a80825e70	decode1: Avoid overriding fields of v.decode in decode1 In the cases where we need to override the values from the decode ROMs, we now do that overriding after the clock edge (eating into decode2's cycle) rather than before. This helps timing a little. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	290b05f97d	core: Implement the maddhd, maddhdu and maddld instructions These instructions use major opcode 4 and have a third GPR input operand, so we need a decode table for major opcode 4 and some plumbing to get the RC register operand read. The multiply-add instructions use the same insn_type_t values as the regular multiply instructions, and we distinguish in execute1 by looking at the major opcode. This turns out to be convenient because we don't have to add any cases in the code that handles the output of the multiplier, and it frees up some insn_type_t values. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	893d2bc6a2	core: Don't generate logic for log data when LOG_LENGTH = 0 This adds "if LOG_LENGTH > 0 generate" to the places in the core where log output data is latched, so that when LOG_LENGTH = 0 we don't create the logic to collect the data which won't be stored. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	74062195ca	execute1: Do forwarding of the CR result to the next instruction This adds a path to allow the CR result of one instruction to be forwarded to the next instruction, so that sequences such as cmp; bc can avoid having a 1-cycle bubble. Forwarding is not available for dot-form (Rc=1) instructions, since the CR result for them is calculated in writeback. The decode.output_cr field is used to identify those instructions that compute the CR result in execute1. For some reason, the multiply instructions incorrectly had output_cr = 1 in the decode tables. This fixes that. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6687aae4d6	core: Implement a simple branch predictor This implements a simple branch predictor in the decode1 stage. If it sees that the instruction is b or bc and the branch is predicted to be taken, it sends a flush and redirect upstream (to icache and fetch1) to redirect fetching to the branch target. The prediction is sent downstream with the branch instruction, and execute1 now only sends a flush/redirect upstream if the prediction was wrong. Unconditional branches are always predicted to be taken, and conditional branches are predicted to be taken if and only if the offset is negative. Branches that take the branch address from a register (bclr, bcctr) are predicted not taken, as we don't have any way to predict the branch address. Since we can now have a mflr being executed immediately after a bl or bcl, we now track the update to LR in the hazard tracker, using the second write register field that is used to track RA updates for update-form loads and stores. For those branches that update LR but don't write any other result (i.e. that don't decrementer CTR), we now write back LR in the same cycle as the instruction rather than taking a second cycle for the LR writeback. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	65a36cc0fc	decode: Work out ispr1/ispr2 in parallel with decode ROM lookup This makes the logic that calculates which SPRs are being accessed work in parallel with the instruction decode ROM lookup instead of being dependent on the opcode found in the decode ROM. The reason for doing that is that the path from icache through the decode ROM to the ispr1/ispr2 fields has become a critical path. Thus we are now using only a very partial decode of the instruction word in the logic for isp1/isp2, and we therefore can no longer rely on them being zero in all cases where no SPR is being accessed. Instead, decode2 now ignores ispr1/ispr2 in all cases except when the relevant decode.input_reg_a/b or decode.output_reg_a is set to SPR. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6701e7346b	core: Use a busy signal rather than a stall This changes the instruction dependency tracking so that we can generate a "busy" signal from execute1 and loadstore1 which comes along one cycle later than the current "stall" signal. This will enable us to signal busy cycles only when we need to from loadstore1. The "busy" signal from execute1/loadstore1 indicates "I didn't take the thing you gave me on this cycle", as distinct from the previous stall signal which meant "I took that but don't give me anything next cycle". That means that decode2 proactively gives execute1 a new instruction as soon as it has taken the previous one (assuming there is a valid instruction available from decode1), and that then sits in decode2's output until execute1 can take it. So instructions are issued by decode2 somewhat earlier than they used to be. Decode2 now only signals a stall upstream when its output buffer is full, meaning that we can fill up bubbles in the upstream pipe while a long instruction is executing. This gives a small boost in performance. This also adds dependency tracking for rA updates by update-form load/store instructions. The GPR and CR hazard detection machinery now has one extra stage, which may not be strictly necessary. Some of the code now really only applies to PIPELINE_DEPTH=1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	49a4d9f67a	Add core logging This logs 256 bits of data per cycle to a ring buffer in BRAM. The data collected can be read out through 2 new SPRs or through the debug interface. The new SPRs are LOG_ADDR (724) and LOG_DATA (725). LOG_ADDR contains the buffer write pointer in the upper 32 bits (in units of entries, i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of doublewords, i.e. 8 bytes). Reading LOG_DATA gives the doubleword from the buffer at the read pointer and increments the read pointer. Setting bit 31 of LOG_ADDR inhibits the trace log system from writing to the log buffer, so the contents are stable and can be read. There are two new debug addresses which function similarly to the LOG_ADDR and LOG_DATA SPRs. The log is frozen while either or both of the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set. The buffer defaults to 2048 entries, i.e. 64kB. The size is set by the LOG_LENGTH generic on the core_debug module. Software can determine the length of the buffer because the length is ORed into the buffer write pointer in the upper 32 bits of LOG_ADDR. Hence the length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)). There is a program to format the log entries in a somewhat readable fashion in scripts/fmt_log/fmt_log.c. The log_entry struct in that file describes the layout of the bits in the log entries. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	afa82bea9c	decode2: Reformat to 4-space indentation Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	4a4a98d4b9	core: Do addpcis using the main adder (#189 ) By adding logic to decode2 to be able to send the instruction address down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add 4 to the immediate value (easy since the bottom 16 bits were zero), we can do addpcis using the main adder. This reduces the width of the result mux and frees up one value in insn_type_t, since we can now use OP_ADD for addpcis. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Shawn Anastasio	e606772aeb	Implement the addpcis instruction This commit adds support for the addpcis instruction from ISA 3.0. A new input_reg_b_t type, CONST_DX_HI, was added to support the shifted immediate value used in DX-Form instructions. Signed-off-by: Shawn Anastasio <shawn@anastas.io>	5 years ago
Paul Mackerras	dd2e71930c	debug: Provide a way to examine GPRs, fast SPRs and MSR This provides commands on the debug interface to read the value of the MSR or any of the 64 GSPR register file entries. The GSPR values are read using the B port of the register file in a cycle when decode2 is not using it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	167e37d667	Plumb insn_type through to loadstore1 In preparation for adding a TLB to the dcache, this plumbs the insn_type from execute1 through to loadstore1, so that we can have other operations besides loads and stores (e.g. tlbie) going to loadstore1 and thence to the dcache. This also plumbs the unit field of the decode ROM from decode2 through to execute1 to simplify the logic around which ops need to go to loadstore1. The load and store data formatting are now not conditional on the op being OP_LOAD or OP_STORE. This eliminates the inferred latches clocked by each of the bits of r.op that we were getting previously. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	5d85ede97d	dcache: Implement load-reserve and store-conditional instructions This involves plumbing the (existing) 'reserve' and 'rc' bits in the decode tables down to dcache, and 'rc' and 'store_done' bits from dcache to writeback. It turns out that we had 'RC' set in the 'rc' column for several ordinary stores and for the attn instruction. This corrects them to 'NONE', and sets the 'rc' column to 'ONE' for the conditional stores. In writeback we now have logic to set CR0 when the input from dcache has rc = 1. In dcache we have the reservation itself, which has a valid bit and the address down to cache line granularity. We don't currently store the reservation length. For a store conditional which fails, we set a 'cancel_store' signal which inhibits the write to the cache and prevents the state machine from starting a bus cycle or going to the STORE_WAIT_ACK state. Instead we set r1.stcx_fail which causes the instruction to complete in the next cycle with rc=1 and store_done=0. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	098f10136d	Fix a Diamond issue in decode2 By using a temporary we avoid a build issue in Diamond. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	5422007f83	Plumb loadstore1 input from execute1 not decode2 This allows us to use the bypass at the input of execute1 for the address and data operands for loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b14d982011	execute: Implement bypass from output of execute1 to input This enables back-to-back execution of integer instructions where the first instruction writes a GPR and the second reads the same GPR. This is done with a set of multiplexers at the start of execute1 which enable any of the three input operands to be taken from the output of execute1 (i.e. r.e.write_data) rather than the input from decode2 (i.e. e_in.read_data[123]). This also requires changes to the hazard detection and handling. Decode2 generates a signal indicating that the GPR being written is available for bypass, which is true for instructions that are executed in execute1 (rather than loadstore1/dcache). The gpr_hazard module stores this "bypassable" bit, and if the same GPR needs to be read by a subsequent instruction, it outputs a "use_bypass" signal rather than generating a stall. The use_bypass signal is then latched at the output of decode2 and passed down to execute1 to control the input multiplexer. At the moment there is no bypass on the inputs to loadstore1, but that is OK because all load and store instructions are marked as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	c9a2076dd3	execute1: Remember dest GPR, RC, OE, XER for slow operations For multiply and divide operations, execute1 now records the destination GPR number, RC and OE from the instruction, and the XER value. This means that the multiply and divide units don't need to record those values and then send them back to execute1. This makes the interface to those units a bit simpler. They simply report an overflow signal along with the result value, and execute1 takes care of updating XER if necessary. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	39d18d2738	Make divider hang off the side of execute1 With this, the divider is a unit that execute1 sends operands to and which sends its results back to execute1, which then send them to writeback. Execute1 now sends a stall signal when it gets a divide or modulus instruction until it gets a valid signal back from the divider. Divide and modulus instructions are no longer marked as single-issue. The data formatting step that used to be done in decode2 for div and mod instructions is now done in execute1. We also do the absolute value operation in that same cycle instead of taking an extra cycle inside the divider for signed operations with a negative operand. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	2167186b5f	Make multiplier hang off the side of execute1 With this, the multiplier isn't a separate pipe that decode2 issues instructions to, but rather is a unit that execute1 sends operands to and which sends the result back to execute1, which then sends it to writeback. Execute1 now sends a stall signal when it gets a multiply instruction until it gets a valid signal back from the multiplier. This all means that we no longer need to mark the multiply instructions as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago

1 2

84 Commits (d2cc31cde41428d46fd8bbe91d94ee4c520e93ba)