microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	722f239c02	Reimplement quadword loads and stores This adds implementations of lq, plq, stq, pstq, lqarx and stqcx. Because register file addresses are now computed in decode1 before we have the decode table entry for the instruction, we have to check the icode directly to know when to read register RS\|1 before RS (i.e. for stq and stqcx in LE mode, but not pstq). For the second instance of the instruction, loadstore1 uses the EA from the first instance + 8. It generates an alignment interrupt for unaligned lqarx and stqcx and for lq in LE mode with an unaligned address. (The reason for the latter case is that it writes RT\|1 before RT, and if we have RA = RT\|1 and the second instance traps, we will have overwritten RA.) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 weeks ago
Paul Mackerras	fa9df33f7e	Implement cfuged, pdepd and pextd This implements the cfuged, pdepd and pextd instructions in a new unit called bit_sorter (so called because cfuged and pextd can be viewed as sorting the bits of the mask). The cnt* instructions and the popcnt* instructions now use the same OP_COUNTB insn_type so as to free up an insn_type value to use for the new instructions. The new instructions are implemented using a slow and simple algorithm that takes 64 cycles to compute the result. The ex1 stage is stalled while this happens, as for a 64-bit multiply, or for a divide when there is no FPU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 month ago
Paul Mackerras	d7d7a3afd4	Implement VRSAVE SPR VRSAVE is a 32-bit software-use SPR accessible in user mode. It is stored in the SPR RAM. The value read from the RAM is trimmed to 32 bits at the ramspr_read process. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 month ago
Paul Mackerras	d112a7ad94	Implement scv and rfscv The main quirk here is that scv sets LR and CTR instead of SRR0 and SRR1, and likewise rfscv uses LR and CTR. Also, scv uses a set of 128 interrupt vectors starting at 0x17000. Fortunately, the layout of the SPR RAM was already such that LR and CTR were in the even and odd halves respectively at the same index, so reading or writing LR and CTR instead of SRR0 and SRR1 is quite easy. Use of scv is subject to an FSCR bit but not an HFSCR bit. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 month ago
Paul Mackerras	a88fa9c459	Implement DSCR The DSCR (Data Stream Control Register) is a user-accessible SPR that controls aspects of data prefetching. It has 25 bits of state defined in the ISA. This implements the register as a 25 read/write bits that do nothing, since we don't have any prefetching. The DSCR is accessible at two SPR numbers, 3 (unprivileged) and 17 (privileged). Access via these SPR numbers is controlled by an FSCR bit and an HFSCR bit. The FSCR bit controls access via SPR 3 in user mode. The HFSCR bit controls access via SPR 3 in user mode and either SPR number in privileged non-hypervisor mode, but since we don't implement privileged non-hypervisor mode, it does essentially the same thing as the FSCR bit. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 month ago
Paul Mackerras	205c0e2c78	Implement the wait instruction This implements the behaviour of the 'wait 0' instruction of pausing execution of instructions until an exception arises. The exceptions that terminate a wait are a pending trace exception, external interrupt request, PMU interrupt request, or decrementer negative exception. These exception conditions terminate a wait even if not enabled to generate an interrupt (e.g. if MSR[EE] is zero). This is implemented by having execute1 assert its busy_out signal while the wait state exists. The wait state is set by the completion of the wait instruction and cleared by a pending exception. If the WC operand of the wait instruction is non-zero, indicating wait for reservation loss or wait for a short period, then the wait instruction does not wait, but just acts as a no-op. In order to make space in the insn_type_t type without going over 64 elements, this combines OP_DCBT and OP_ICBT into a single OP_XCBT, since they were both no-ops (except for their influence on how SRR1 is set on a trace interrupt, where they were identical). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 month ago
Paul Mackerras	7bc7f335f1	Implement CTRL register The CTRL register has a single bit called RUN. It has some unusual behaviours: - It can only be written via SPR number 152, which is privileged - It can only be read via SPR number 136, which is non-privileged - Reading in problem state (user mode) returns the RUN bit in bit 0, but reading in privileged state (hypervisor mode) returns the RUN bit in bits 0 and 15. - Reading SPR 152 in problem state causes a HEAI (illegal instruction) interrupt, but reading in privileged state is a no-op; this is the same as for an unimplemented SPR. The RUN bit goes to the PMU and is also plumbed out to drive a LED on the Arty board. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 month ago
Paul Mackerras	ff0744b795	execute1: Make CFAR able to be written using mtspr and read using DMI debug mtspr to CFAR is currently a no-op, which is not what should happen. Make it set the contents of CFAR. Also provide access to CFAR via the DMI debug interface as register 0x31. Fixes: `c2da82764f` ("core: Implement CFAR register", 2020-06-15) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 month ago
Paul Mackerras	d2777dd1dd	Generate Hypervisor Emulation Assistance Interrupt for illegal instructions This implements the HEIR register (Hypervisor Emulation Instruction Register) and arranges for an illegal instruction to cause a Hypervisor Emulation Assistance Interrupt (HEAI) at vector 0xE40, and set HEIR to the illegal instruction. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 month ago
Paul Mackerras	e3f4ccedec	Implement facility unavailable and hypervisor facility unavailable interrupts This adds the FSCR and HFSCR registers and implements the associated behaviours of taking a facility unavailable or hypervisor facility unavailable interrupt if certain actions are attempted while the relevant [H]FSCR bit is zero. At present, two FSCR enable bits and three HFSCR enable bits are implemented. FSCR has bits for prefixed instructions and accesses to the TAR register, and HFSCR has those plus a bit that enables access to floating-point registers and instructions. FSCR and HFSCR can be accessed through the debug interface using register addresses 0x2e and 0x2f. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 month ago
Paul Mackerras	12a3d76217	Implement hrfid and make MSR[HV] always 1 Implementations without hypervisor/LPAR support are permitted by the architecture, but should have MSR[HV] forced to be 1 at all times, not 0, and should implement various instructions and registers that are only accessible in hypervisor mode. This commit implements MSR[HV] as a constant 1 bit and adds the hrfid instruction, which behaves exactly the same as rfid except that it reads HSRR0/1 instead of SRR0/1. We already have HSRR0/1 and HSPRG0/1 implemented. When HV=1, Linux expects external interrupts to arrive as hypervisor interrupts, so this adds support for hypervisor interrupts (i.e., those that set HSRR0/1) and makes the external interrupt be a hypervisor interrupt. (If we had an LPCR register, the LPES bit would control this, but we don't.) The xics test is updated to read HSRR0/1 after an external interrupt. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 month ago
Paul Mackerras	6ef9395f10	Remove vestiges of the short (16-bit) multiplier option (#432 ) These aren't needed, and should have been removed in `d1e8e62fee` ("Remove option for "short" 16x16 bit multiplier", 2022-07-19), but were missed. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 month ago
Paul Mackerras	1c4b5def36	Improve timing of redirect_nia going from writeback to fetch1 This gets rid of the adder in writeback that computes redirect_nia. Instead, the main adder in the ALU is used to compute the branch target for relative branches. We now decode b and bc differently depending on the AA field, generating INSN_brel, INSN_babs, INSN_bcrel or INSN_bcabs as appropriate. Each one has a separate entry in the decode table in decode1; the *rel versions use CIA as the A input. The bclr/bcctr/bctar and rfid instructions now select ramspr_result for the main result mux to get the redirect address into ex1.e.write_data. For branches which are predicted taken but not actually taken, we need to redirect to the following instruction. We also need to do that for isync. We do this in the execute2 stage since whether or not to do it depends on the branch result. The next_nia computation is moved to the execute2 stage and comes in via a new leg on the secondary result multiplexer, making next_nia available ultimately in ex2.e.write_data. This also means that the next_nia leg of the primary result multiplexer is gone. Incrementing last_nia by 4 for sc (so that SRR0 points to the following instruction) is also moved to execute2. Writing CIA+4 to LR was previously done through the main result multiplexer. Now it comes in explicitly in the ramspr write logic. Overall this removes the br_offset and abs_br fields and the logic to add br_offset and next_nia, and one leg of the primary result multiplexer, at the cost of a few extra control signals between execute1 and execute2 and some multiplexing for the ramspr write side and an extra input on the secondary result multiplexer. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 year ago
Paul Mackerras	b50170cd1d	Implement byte reversal instructions This implements the byte-reverse halfword, word and doubleword instructions: brh, brw, and brd. These instructions were added to the ISA in version 3.1. They use a new OP_BREV insn_type value. The logic for these instructions is implemented in logical.vhdl. In order to avoid going over 64 insn_type values, OP_AND and OP_OR were combined into OP_LOGIC, which is like OP_AND except that the RS input can be inverted as well as the RB input. The various forms of OR instruction are then implemented using the identity a OR b = NOT (NOT a AND NOT b) The 'is_signed' field of the instruction decode table is used to indicate that RS should be inverted. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 year ago
Paul Mackerras	fd8c0000c0	Implement set[n]bc[r] instructions This implements the setbc, setnbc, setbcr and setnbcr instructions. Because the insn_type_t type already has 64 elements, this uses the existing OP_SETB for the new instructions, and has execute1 compute different results depending on bits 6-9 of the instruction. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	1 year ago
Paul Mackerras	c4492c843a	Implement interrupts for prefixed instructions This arranges to generate an illegal instruction type program interrupt for illegal prefixed instructions, that is, those where the suffix is not a legal value given the prefix, or the prefix has a reserved value in the subtype field. This implementation doesn't generate an interrupt for the invalid 8LS:D and MLS:D instruction forms where R = 1 and RA != 0. (In those cases it uses (RA) as the addend, i.e. it ignores the R bit.) This detects the case where the address of an instruction prefix is equal mod 64 to 60, and generates an alignment interrupt in that case. This also arranges to set bit 34 of SRR1 when an interrupt occurs due to a prefixed instruction, for those interrupts where that is required (i.e. trace, alignment, floating-point unavailable, data storage, data segment, and most cases of program interrupt). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	7af0e001ad	Move insn_codes for mcrfs, mtfsb0/1 and mtfsfi This moves the insn_code values for mcrfs, mtfsb0/1 and mtfsfi into the region used for floating-point instructions. This means that in no-FPU implementations, they will get turned into illegal instructions in predecode. We then don't need the code in execute1 that makes FP instructions illegal in no-FPU implementations. We also remove the NONE value for unit_t, since it was only ever used with insn_type = OP_ILLEGAL, and the check for unit = NONE was redundant with the check for insn_type = OP_ILLEGAL. Thus the check for unit = NONE is no longer needed and is removed here. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	6fa468ca3d	execute1: Reduce metavalue warnings Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	b0aa5340b8	execute1: Make it clear that divide logic is not included when HAS_FPU=true This adds a "not HAS_FPU" condition in a few places to make it obvious that logic to interface to the divide unit is not included when we have an FPU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	d1e8e62fee	Remove option for "short" 16x16 bit multiplier Now that we have a 33 bit x 33 bit signed multiplier in execute1, there is really no need for the 16 bit multiplier. The coremark results are just as good without it as with it. This removes the option for the sake of simplicity. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	e02d8060ed	Change the multiplier interface to support signed multipliers This adds an 'is_signed' signal to MultiplyInputType to indicate whether the data1 and data2 fields are to be interpreted as signed or unsigned numbers. The 'not_result' field is replaced by a 'subtract' field which provides a more intuitive interface for requesting that the product be subtracted from the addend rather than added, i.e. subtract = 1 gives C - A * B, vs. subtract = 0 giving C + A * B. (Previously the users of the multipliers got the same effect by complementing the addend and setting not_result = 1.) The is_32bit field is removed because it is no longer used now that we have a separate 32-bit multiplier. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	595a758400	execute1: Add a pipelined 33-bit signed multiplier This adds a pipelined 33-bit by 33-bit signed multiplier with one cycle latency to the execute pipeline, and uses it for the mullw, mulhw and mulhwu instructions. Because it has one cycle of latency we can assume that its result is available in the second execute stage without needing to add busy logic to the second stage. This adds both a generic version of the multiplier and a Xilinx-specific version using four DSP slices of the Artix-7. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	6db626d245	icache: Log 36 bits of instruction rather than 32 This expands the field in the log buffer that stores the instruction fetched from the icache to 36 bits, so that we get the insn_code and illegal instruction indication. To do this, we reclaim 3 unused bits from execute1's portion and one other unused bit (previously just set to 0 in core.vhdl). This also alters the trigger behaviour to stop after one quarter of the log buffer has been filled with samples after the trigger, or 256 entries, whichever is less. This is to ensure that the trigger event doesn't get overwritten when the log buffer is small. This updates fmt_log to the new log format. Valid instructions are printed as a decimal insn_code value followed by the bottom 26 bits of the instruction. Illegal instructions are printed as "ill" followed by the full 32 bits of the instruction. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	26dc1e879c	Eliminate use of primary opcode outside of decode1 This changes code that previously looked at the primary opcode (bits 26 to 31) of the instruction to use other methods, in places other than in stage0 of decode1. * Extend rc_t to have a new value, RCOE, indicating that the instruction has both Rc and OE bits. * Decode2 now tells execute1 whether the instruction has a third operand, used for distinguishing between multiply and multiply-add instructions. * The invert_a field of the decode ROM is overloaded for load/store instructions to indicate cache-inhibited loads and stores. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	932da4c114	FPU: Simplify IDLE state code Do more decoding of the instruction ahead of the IDLE state processing so that the IDLE state code becomes much simpler. To make the decoding easier, we now use four insn_type_t codes for floating-point operations rather than two. This also rearranges the insn_type_t values a little to get the 4 FP opcode values to differ only in the bottom 2 bits, and put OP_DIV, OP_DIVE and OP_MOD next to them. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	7a60c118ed	loadstore1: Simplify address generation in OP_FETCH_FAILED case Instead of having a multiplexer in loadstore1 in order to be able to put the instruction address into v.addr, we now set decode.input_reg_a to CIA in the decode table entry for OP_FETCH_FAILED. That means that the operand selection machinery in decode2 will supply the instruction address to loadstore1 on the lv.addr1 input and no special case is needed in loadstore1. This saves a few LUTs (~40 on the Artix-7). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	939c7e39dd	execute1: Fix trace interrupt on sc instruction This fixes a bug which causes a trace interrupt to store the wrong value in SRR0 in the case where the instruction that has just completed is followed by a sc (system call) instruction. What happens is that first the traced instruction sets ex1.trace_next. Then, when the sc instruction following it comes in, the execute1_actions process sets v.e.last_nia to next_nia because it is an sc instruction, even though it is not going to be executed -- we are going to take the trace interrupt instead. Then when the trace interrupt is taken, we incorrectly set SRR0 to the incremented address (the address of the instruction following the sc). To fix this, we have execute1_actions set a new flag if the current instruction is sc, and only set v.e.last_nia to next_nia if we actually execute the sc (in the "if go = '1'" case). Fixes: `813e2317bf` ("execute1: Restructure to separate out execution of side effects", 2022-06-18) Reported-by: Anton Blanchard <anton@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Michael Neuling	e440db13d7	Metavalue cleanup for execute1.vhdl Signed-off-by: Michael Neuling <mikey@neuling.org>	3 years ago
Michael Neuling	caf458be37	Metavalue cleanup for common.vhdl This affects other files which have been included here. Signed-off-by: Michael Neuling <mikey@neuling.org>	3 years ago
Paul Mackerras	d0f319290f	Restore debug access to SPRs This provides access to the SPRs via the JTAG DMI interface. For now they are still accessed as if they were GPR/FPRs using the same numbering as before (GPRs at 0 - 0x1f, SPRs at 0x20 - 0x2d, FPRs at 0x40 - 0x5f). For XER, debug reads now report the full value, not just the bits that were previously stored in the register file. The "slow" SPR mux is not used for debug reads. Decode2 determines on each cycle whether a debug SPR access will happen next cycle, based on whether there is a request and whether the current instruction accesses the SPR RAM. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	fdb3ef6874	Finish off taking SPRs out of register file With this, the register file now contains 64 entries, for 32 GPRs and 32 FPRs, rather than the 128 it had previously. Several things get simplified - decode1 no longer has to work out the ispr{1,2,o} values, decode_input_reg_{a,b,c} no longer have the t = SPR case, etc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	337b104250	Move LR, CTR and TAR out of the register file By putting CTR on the odd side and LR and TAR on the even side, we can read and write CTR for bdnz-style instructions in parallel with reading LR or TAR for indirect branches and writing LR for branches with LK=1. Thus we don't need to double up any of these instructions, giving a simplification in decode2. We now have logic for printing LR and CTR at the end of a simulation in execute1, in addition to the similar logic in register_file and cr_file. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	bc4d02cb0d	Start removing SPRs from register file This starts the process of removing SPRs from the register file by moving SRR0/1, SPRG0-3, HSRR0/1 and HSPRG0/1 out of the register file and putting them into execute1. They are stored in a pair of small RAM arrays, referred to as "even" and "odd". The reason for having two arrays is so that two values can be read and written in each cycle. For example, SRR0 and SRR1 can be written in parallel by an interrupt and read in parallel by the rfid instruction. The addresses in the RAM which will be accessed are determined in the decode2 stage. We have one write address for both sides, but two read addresses, since in future we will want to be able to read CTR at the same time as either LR or TAR. We now have a connection from writeback to execute1 which carries the partial SRR1 value for an interrupt. SRR0 comes from the execute pipeline; we no longer need to carry instruction addresses along the LSU and FPU pipelines. Since SRR0 and SRR1 can be written in the same cycle now, we don't need the little state machine in writeback any more. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	73cc5167ec	Use FPU for division instructions if we have an FPU - Arrange for XER to be written for OE=1 forms - Arrange for condition codes to be set for RC=1 forms (including correct handling for 32-bit mode) - Don't instantiate the divider if we have an FPU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	a95f8aab38	FPU: Add integer division logic to FPU This adds logic to the FPU to accomplish 64-bit integer divisions. No instruction actually uses this yet. The algorithm used is to obtain an estimate of the reciprocal of the divisor using the lookup table and refine it by one to three iterations of the Newton-Raphson algorithm (the number of iterations depends on the number of significant bits in the dividend). Then the reciprocal is multiplied by the dividend to get the quotient estimate. The remainder is calculated as dividend - quotient * divisor. If the remainder is greater than or equal to the divisor, the quotient is incremented, or if a modulo operation is being done, the divisor is subtracted from the remainder. The inverse estimate after refinement is good enough that the quotient estimate is always equal to or one less than the true quotient. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	d1850fea29	Track hazards explicitly for XER overflow bits This provides a mechanism for tracking updates to the XER overflow bits (SO, OV, OV32) and stalling instructions which need current values of those bits (mfxer, integer compare instructions, integer Rc=1 instructions, addex) or which writes carry bits (since all the XER common bits are written together, if we are writing CA/CA32 we need up-to-date values of SO/OV/OV32). This will enable updates to SO/OV/OV32 to be done at other places besides the ex1 stage. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	9a8a8e50f8	FPU: Add stage-2 stall ability to FPU This makes the FPU able to stall other units at execute stage 2 and be stalled by other units (specifically the LSU). This means that the completion and writeback for an instruction can now end up being deferred until the second cycle of a following instruction, i.e. the cycle when the state machine has gone through IDLE state into one of the DO_* states, which means we need to latch the destination FPR number, CR mask, etc. from the previous instruction so that we present the correct information to writeback. The advantage of this is that we can get rid of the in_progress signal from the LSU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	ef122868d5	Do CR0 setting for Rc=1 instructions in execute2 instead of writeback This lets us forward the CR0 result to following instructions that use CR, meaning they get to issue one cycle earlier. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	e030a500e8	Allow integer instructions and load/store instructions to execute together Execute1 and loadstore1 now send each other stall signals that indicate that a valid instruction in stage 2 can't complete in this cycle, and hence any valid instruction in stage 1 in the other unit can't move to stage 2. With this in place, an ALU instruction can move into stage 1 while a LSU instruction is in stage 2. Since the FPU doesn't yet have a way to stall completion, we can't yet start FPU instructions while any LSU or ALU instruction is in progress. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	4b6148ada6	Add a bypass path from the execute2 stage This enables some instructions to issue earlier and thus improves performance, at the cost of some extra multiplexers in decode2. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	3510071d9a	Add a second execute stage to the pipeline This adds a second execute stage to the pipeline, in order to match up the length of the pipeline through loadstore and dcache with the length through execute1. This will ultimately enable us to get rid of the 1-cycle bubble that we currently have when issuing ALU instructions after one or more LSU instructions. Most ALU instructions execute in the first stage, except for count-zeroes and popcount instructions (which take two cycles and do some of their work in the second stage) and mfspr/mtspr to "slow" SPRs (TB, DEC, PVR, LOGA/LOGD, CFAR). Multiply and divide/mod instructions take several cycles but the instruction stays in the first stage (ex1) and ex1.busy is asserted until the operation is complete. There is currently a bypass from the first stage but not the second stage. Performance is down somewhat because of that and because this doesn't yet eliminate the bubble between LSU and ALU instructions. The forwarding of XER common bits has been changed somewhat because now there is another pipeline stage between ex1 and the committed state in cr_file. The simplest thing for now is to record the last value written and use that, unless there has been a flush, in which case the committed state (obtained via e_in.xerc) is used. Note that this fixes what was previously a benign bug in control.vhdl, where it was possible for control to forget an instructions dependency on a value from a previous instruction (a GPR or the CR) if this instruction writes the value and the instruction gets to the point where it could issue but is blocked by the busy signal from execute1. In that situation, control may incorrectly not indicate that a bypass should be used. That didn't matter previously because, for ALU and FPU instructions, there was only one previous instruction in flight and once the current instruction could issue, the previous instruction was completing and the correct value would be obtained from register_file or cr_file. For loadstore instructions there could be two being executed, but because there are no bypass paths, failing to indicate use of a bypass path is fine. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	521a5403a9	execute1: Rename 'r' to 'ex1' Maybe this will give us slightly better names in critical path reports and the like. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	813e2317bf	execute1: Restructure to separate out execution of side effects We now have a record that represents the actions taken in executing an instruction, and a process that computes that for the incoming instruction. We no longer have 'current' or 'r.cur_instr', instead things like the destination register are put into r.e in the first cycle of an instruction and not reinitialized in subsequent busy cycles. For mfspr and mtspr, we now decode "slow" SPR numbers (those SPRs that are not stored in the register file) to a new "spr_selector" record in decode1 (excluding those in the loadstore unit). With this, the result for mfspr is determined in the data path. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	204fedc63f	Move XER low bits out of register file Besides the overflow and status carry bits, XER has 18 bits which need to retain the value written by mtxer (in case software wants to emulate the move-assist instructions (lswi, lswx, stswi, stswx). Until now these bits (and others) have been stored in the GPR file as a "fast" SPR, but this causes complications because XER is not really a fast SPR. Instead, we now store these 18 bits in the 'ctrl' signal, which exists in execute1. This will enable us to simplify the data path in future, and has the added bonus that with a little bit of plumbing, we can get the full XER value printed when dumping registers at the end of a simulation. Therefore this changes scripts/run_test.sh to remove the greps which exclude XER from the comparison of actual and expected register results. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Anton Blanchard	843361f2be	execute1: sub_mux_sel and result_mux_sel are unused Remove them. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	3 years ago
Anton Blanchard	a750365ffa	Remove some FPGA style signal inits These don't work on the ASIC flow, so remove them and initialise them explicitly where required. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	3 years ago
Paul Mackerras	2491aa7fc5	core: Make popcnt* take two cycles This moves the calculation of the result for popcnt* into the countbits unit, renamed from countzero, so that we can take two cycles to get the result. The motivation for this is that the popcnt* calculation was showing up as a critical path. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	734e4c4a52	core: Add a short multiplier This adds an optional 16 bit x 16 bit signed multiplier and uses it for multiply instructions that return the low 64 bits of the product (mull[dw][o] and mulli, but not maddld) when the operands are both in the range -2^15 .. 2^15 - 1. The "short" 16-bit multiplier produces its result combinatorially, so a multiply that uses it executes in one cycle. This improves the coremark result by about 4%, since coremark does quite a lot of multiplies and they almost all have operands that fit into 16 bits. The presence of the short multiplier is controlled by a generic at the execute1, SOC, core and top levels. For now, it defaults to off for all platforms, and can be enabled using the --has_short_mult flag to fusesoc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	a68921edca	core: Fix mcrxrx, addpcis and bpermd - mcrxrx put the bits in the wrong order - addpcis was setting CR0 if the instruction bit 0 = 1, which it shouldn't - bpermd was producing 0 always and additionally had the wrong bit numbering Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	65c43b488b	PMU: Add several more events This implements most of the architected PMU events. The ones missing are mostly the ones that depend on which level of the cache hierarchy data is fetched from. The events implemented here, and their raw event codes, are: Floating-point operation completed (100f4) Load completed (100fc) Store completed (200f0) Icache miss (200fc) ITLB miss (100f6) ITLB miss resolved (400fc) Dcache load miss (400f0) Dcache load miss resolved (300f8) Dcache store miss (300f0) DTLB miss (300fc) DTLB miss resolved (200f6) No instruction available and none being executed (100f8) Instruction dispatched (200f2, 300f2, 400f2) Taken branch instruction completed (200fa) Branch mispredicted (400f6) External interrupt taken (200f8) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago

1 2 3 4

200 Commits (722f239c025e55bb45e477ca70f8f6500d7801b8)