microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	3510071d9a	Add a second execute stage to the pipeline This adds a second execute stage to the pipeline, in order to match up the length of the pipeline through loadstore and dcache with the length through execute1. This will ultimately enable us to get rid of the 1-cycle bubble that we currently have when issuing ALU instructions after one or more LSU instructions. Most ALU instructions execute in the first stage, except for count-zeroes and popcount instructions (which take two cycles and do some of their work in the second stage) and mfspr/mtspr to "slow" SPRs (TB, DEC, PVR, LOGA/LOGD, CFAR). Multiply and divide/mod instructions take several cycles but the instruction stays in the first stage (ex1) and ex1.busy is asserted until the operation is complete. There is currently a bypass from the first stage but not the second stage. Performance is down somewhat because of that and because this doesn't yet eliminate the bubble between LSU and ALU instructions. The forwarding of XER common bits has been changed somewhat because now there is another pipeline stage between ex1 and the committed state in cr_file. The simplest thing for now is to record the last value written and use that, unless there has been a flush, in which case the committed state (obtained via e_in.xerc) is used. Note that this fixes what was previously a benign bug in control.vhdl, where it was possible for control to forget an instructions dependency on a value from a previous instruction (a GPR or the CR) if this instruction writes the value and the instruction gets to the point where it could issue but is blocked by the busy signal from execute1. In that situation, control may incorrectly not indicate that a bypass should be used. That didn't matter previously because, for ALU and FPU instructions, there was only one previous instruction in flight and once the current instruction could issue, the previous instruction was completing and the correct value would be obtained from register_file or cr_file. For loadstore instructions there could be two being executed, but because there are no bypass paths, failing to indicate use of a bypass path is fine. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Anton Blanchard	d3a7517318	divider: Fix d_out.overflow U state issue While we should only look at this when d_out.valid = 1, we may as remove some U state across interfaces. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	2 years ago
Anton Blanchard	601f3211be	Reformat divider Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Paul Mackerras	c9a2076dd3	execute1: Remember dest GPR, RC, OE, XER for slow operations For multiply and divide operations, execute1 now records the destination GPR number, RC and OE from the instruction, and the XER value. This means that the multiply and divide units don't need to record those values and then send them back to execute1. This makes the interface to those units a bit simpler. They simply report an overflow signal along with the result value, and execute1 takes care of updating XER if necessary. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	39d18d2738	Make divider hang off the side of execute1 With this, the divider is a unit that execute1 sends operands to and which sends its results back to execute1, which then send them to writeback. Execute1 now sends a stall signal when it gets a divide or modulus instruction until it gets a valid signal back from the divider. Divide and modulus instructions are no longer marked as single-issue. The data formatting step that used to be done in decode2 for div and mod instructions is now done in execute1. We also do the absolute value operation in that same cycle instead of taking an extra cycle inside the divider for signed operations with a negative operand. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	f37ef56d79	Remove unused signal Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	5a0458dec1	divider: Fix overflow calculation We were signalling overflow when neg_result=1 but the result was zero. Fix this. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	501b6daf9b	Add basic XER support The carry is currently internal to execute1. We don't handle any of the other XER fields. This creates type called "xer_common_t" that contains the commonly used XER bits (CA, CA32, SO, OV, OV32). The value is stored in the CR file (though it could be a separate module). The rest of the bits will be implemented as a separate SPR and the two parts reconciled in mfspr/mtspr in latter commits. We always read XER in decode2 (there is little point not to) and send it down all pipeline branches as it will be needed in writeback for all type of instructions when CR0:SO needs to be updated (such forms exist for all pipeline branches even if we don't yet implement them). To avoid having to track XER hazards, we forward it back in EX1. This assumes that other pipeline branches that can modify it (mult and div) are running single issue for now. One additional hazard to beware of is an XER:SO modifying instruction in EX1 followed immediately by a store conditional. Due to our writeback latency, the store will go down the LSU with the previous XER value, thus the stcx. will set CR0:SO using an obsolete SO value. I doubt there exist any code relying on this behaviour being correct but we should account for it regardless, possibly by ensuring that stcx. remain single issue initially, or later by adding some minimal tracking or moving the LSU into the same pipeline as execute. Missing some obscure XER affecting instructions like addex or mcrxrx. [paulus@ozlabs.org - fix CA32 and OV32 for OP_ADD, fix order of arguments to set_ov] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	374f4c536d	writeback: Do data formatting and condition recording in writeback This adds code to writeback to format data and test the result against zero for the purpose of setting CR0. The data formatter is able to shift and mask by bytes and do byte reversal and sign extension. It can also put together bytes from two input doublewords to support unaligned loads (including unaligned byte-reversed loads). The data formatter starts with an 8:1 multiplexer that is able to direct any byte of the input to any byte of the output. This lets us rotate the data and simultaneously byte-reverse it. The rotated/reversed data goes to a register for the unaligned cases that overlap two doublewords. Then there is per-byte logic that does trimming, sign extension, and splicing together bytes from a previous input doubleword (stored in data_latched) and the current doubleword. Finally the 64-bit result is tested to set CR0 if rc = 1. This removes the RC logic from the execute2, multiply and divide units, and the shift/mask/byte-reverse/sign-extend logic from loadstore2. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	82c19d4e7a	divider: Reduce delay in detecting 32-bit overflow Timing analysis showed that even with the output register, timing was still a bit tight in the output stage, where the carry has to propagate all the way through the 64-bit negater, and we were then testing the top 33 bits to determine if a 32-bit operation had overflowed. Instead of detecting overflow at the end, we watch for any 1 bits getting shifted into the top 32 bits of the quotient register as we are doing the division. That is relatively easy to do and simplifies the output stage. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	c7025f9f28	divider: Add an output register This puts the output of the divider through a register. With the addition of the logic to detect overflow, the combinatorial output logic of the divider was becoming a critical path. Adding the output register adds a cycle to the latency of the divider but helps make timing at 100MHz on the A7-100. This also makes the valid, write_reg_enable and write_cr_enable fields of the output be registered, which eliminates warnings about register/latch pins with no clock. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d4f51e08c8	divider: Return 0 for invalid and overflow cases, like P9 does This adds logic to detect the cases where the quotient of the division overflows the range of the output representation, and return all zeroes in those cases, which is what POWER9 does. To do this, we extend the dividend register by 1 bit and we do an extra step in the division process to get a 2^64 bit of the quotient, which ends up in the 'overflow' signal. This catches all the cases where dividend >= 2^64 * divisor, including the case where divisor = 0, and the divde/divdeu cases where \|RA\| >= \|RB\|. Then, in the output stage, we also check that the result fits in the representable range, which depends on whether the division is a signed division or not, and whether it is a 32-bit or 64-bit division. If dividend >= 2^64 or the result doesn't fit in the representable range, write_data is set to 0 and write_cr_data to 0x20000000 (i.e. cr0.eq = 1). POWER9 sets the top 32 bits of the result to zero for 32-bit signed divisions, and sets CR0 when RC=1 according to the 64-bit value (i.e. CR0.LT is always 0 for 32-bit signed divisions, even if the 32-bit result is negative). However, modsw with a negative result sets the top 32 bits to all 1s. We follow suit. This updates divider_tb to check the invalid cases as well as the valid case. This also fixes a small bug where the reset signal for the divider was driven from rst when it should have been driven from core_rst. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	25b9450475	divider: Do absolute-value ops in divider instead of decode This moves the negation of negative operands for signed divide and modulus operations out of the decode2 stage and into the divider. If either of the operands for a signed divide or modulus operation is negative, the divider now takes an extra cycle to negate the operands that are negative. The interface to the divider now has an 'is_signed' signal rather than a 'neg_result' signal, and the dividend and divisor can be negative, so divider_tb had to be updated for the new interface. The reason for doing this is that one of the worst timing violations on the Arty A7-100 at 100MHz involved the carry chain in the adders that did the negation of the dividend and divisor in the decode stage. Moving the negations to a separate cycle fixes that and also seems to reduce the total number of slice LUTs used. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	e6536d4b8b	divider: Always compute result/sresult/d_out.write_reg_data These are intended to be combinatorial. The previous code was giving warnings in vivado about registers/latches with no clock defined. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	a01ffaeb64	Speed up the divider a little This looks for cases where the next 8 bits of the quotient are obviously going to be zero, because the top 72 bits of the 128-bit dividend register are all zero. In those cases we shift 8 zero bits into the quotient and increase count by 8. We only do this if count < 56. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d5bc6c8824	Add a divider unit and a testbench for it This adds a divider unit, connected to the core in much the same way that the multiplier unit is connected. The division algorithm is very simple-minded, taking 64 clock cycles for any division (even 32-bit division instructions). The decoding is simplified by making use of regularities in the instruction encoding for div* and mod* instructions. Instead of having PPC_* encodings from the first-stage decoder for each of the different div* and mod* instructions, we now just have PPC_DIV and PPC_MOD, and the inputs to the divider that indicate what sort of division operation to do are derived from instruction word bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago

16 Commits (d0f319290fd22724a06b6db628aa7ee3458ca1bc)