microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	1037c6aa2e	core: Implement mtmsr instruction This is like mtmsrd except it only alters the lower 32 bits of the MSR. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	dc1544db69	FPU: Implement floating multiply-add instructions This implements fmadd, fmsub, fnmadd, fnmsub and their single-precision counterparts. The single-precision versions operate the same as the double-precision versions until the final rounding and overflow/underflow steps. This adds an S register to store the low bits of the product. S shifts into R on left shifts, and can be negated, but doesn't do any other arithmetic. This adds a test for the double-precision versions of these instructions. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c083b9507d	FPU: Implement ftdiv and ftsqrt Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c350bc1f25	FPU: Implement fsqrt[s] and add a test for fsqrt This implements the floating square-root calculation using a table lookup of the inverse square root approximation, followed by three iterations of Goldschmidt's algorithm, which gives estimates of both sqrt(FRB) and 1/sqrt(FRB). Then the residual is calculated as FRB - R * R and that is multiplied by the 1/sqrt(FRB) estimate to get an adjustment to R. The residual and the adjustment can be negative, and since we have an unsigned multiplier, the upper bits can be wrong. In practice the adjustment fits into an 8-bit signed value, and the bottom 8 bits of the adjustment product are correct, so we sign-extend them, divide by 4 (because R is in 10.54 format) and add them to R. Finally the residual is calculated again and compared to 2*R+1 to see if a final increment is needed. Then the result is rounded and written back. This implements fsqrts as fsqrt, but with rounding to single precision and underflow/overflow calculation using the single-precision exponent range. This could be optimized later. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	394f993e75	FPU: Implement frsqrte[s] and a test for frsqrte This implements frsqrte by table lookup. We first normalize the input if necessary and adjust so that the exponent is even, giving us a mantissa value in the range [1.0, 4.0), which is then used to look up an entry in a 768-entry table. The 768 entries are appended to the table for reciprocal estimates, giving a table of 1024 entries in total. frsqrtes is implemented identically to frsqrte. The estimate supplied is accurate to 1 part in 1024 or better. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	49f3d1e77a	FPU: Implement fcmpu and fcmpo Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4cd9301da6	FPU: Implement fsel Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4ad5ab9203	FPU: Implement fre[s] This just returns the value from the inverse lookup table. The result is accurate to better than one part in 512 (the architecture requires 1/256). This also adds a simple test, which relies on the particular values in the inverse lookup table, so it is not a general test. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	9cce936251	FPU: Implement fdiv[s] This implements floating-point division A/B by a process that starts with normalizing both inputs if necessary. Then an estimate of 1/B from a lookup table is refined by 3 Newton-Raphson iterations and then multiplied by A to get a quotient. The remainder is calculated as A - R * B (where R is the result, i.e. the quotient) and the remainder is compared to 0 and to B to see whether the quotient needs to be incremented by 1. The calculations of 1 / B are done with 56 fraction bits and intermediate results are truncated rather than rounded, meaning that the final estimate of 1 / B is always correct or a little bit low, never too high, and thus the calculated quotient is correct or 1 unit too low. Doing the estimate of 1 / B with sufficient precision that the quotient is always correct to the last bit without needing any adjustment would require many more bits of precision. This implements fdivs by computing a double-precision quotient and then rounding it to single precision. It would be possible to optimize this by e.g. doing only 2 iterations of Newton-Raphson and then doing the remainder calculation and adjustment at single precision rather than double precision. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	e6a5f237bc	FPU: Implement fmul[s] This implements the fmul and fmuls instructions. For fmul[s] with denormalized operands we normalize the inputs before doing the multiplication, to eliminate the need for doing count-leading-zeroes on P. This adds 3 or 5 cycles to the execution time when one or both operands are denormalized. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	86b826cd7e	FPU: Implement fadd[s] and fsub[s] and add tests for them Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4807d0bdb6	FPU: Implement fmrgew and fmrgow and add tests for them Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0ad2aa3014	FPU: Implement floating round-to-integer instructions This implements frin, friz, frip and frim, and adds tests for them. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	03d1aa968a	FPU: Implement floating convert to integer instructions This implements fctiw, fctiwz, fctiwu, fctiwuz, fctid, fctidz, fctidu and fctiduz, and adds tests for them. There are some subtleties around the setting of the inexact (XX) and invalid conversion (VXCVI) flags in the FPSCR. If the rounded value ends up being out of range, we need to set VXCVI and not XX. For a conversion to unsigned word or doubleword of a negative value that rounds to zero, we need to set XX and not VXCVI. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	34b5d4a7b5	FPU: Implement the frsp instruction This brings in the invalid exception for the case of frsp with a signalling NaN as input, and the need to be able to convert a signalling NaN to a quiet NaN. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	9e8fb293ed	FPU: Implement floating convert from integer instructions This implements fcfid, fcfidu, fcfids and fcfidus, which convert 64-bit integer values in an FPR into a floating-point value. This brings in a lot of the datapath that will be needed in future, including the shifter, adder, mask generator and count-leading-zeroes logic, along with the machinery for rounding to single-precision or double-precision, detecting inexact results, signalling inexact-result exceptions, and updating result flags in the FPSCR. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b628af6176	FPU: Implement fmr and related instructions This implements fmr, fneg, fabs, fnabs and fcpsgn and adds tests for them. This adds logic to unpack and repack floating-point data from the 64-bit packed form (as stored in memory and the register file) into the unpacked form in the fpr_reg_type record. This is not strictly necessary for fmr et al., but will be useful for when we do actual arithmetic. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	fc2968f132	FPU: Implement remaining FPSCR-related instructions This implements mcrfs, mtfsfi, mtfsb0/1, mffscr, mffscrn, mffscrni and mffsl. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	856e9e955f	core: Add framework for an FPU This adds the skeleton of a floating-point unit and implements the mffs and mtfsf instructions. Execute1 sends FP instructions to the FPU and receives busy, exception, FP interrupt and illegal interrupt signals from it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	9d285a265c	core: Add support for single-precision FP loads and stores This adds code to loadstore1 to convert between single-precision and double-precision formats, and implements the lfs* and stfs* instructions. The conversion processes are described in Power ISA v3.1 Book 1 sections 4.6.2 and 4.6.3. These conversions take one cycle, so lfs* and stfs* are one cycle slower than lfd* and stfd*. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	45cd8f4fc3	core: Add support for floating-point loads and stores This extends the register file so it can hold FPR values, and implements the FP loads and stores that do not require conversion between single and double precision. We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores cause a FP unavailable interrupt if MSR[FP] = 0. The FPU facilities are optional and their presence is controlled by the HAS_FPU generic passed down from the top-level board file. It defaults to true for all except the A7-35 boards. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b589d2d472	execute1: Implement trace interrupts Trace interrupts occur when the MSR[TE] field is non-zero and an instruction other than rfid has been successfully completed. A trace interrupt occurs before the next instruction is executed or any asynchronous interrupt is taken. Since the trace interrupt is defined to set SRR1 bits depending on whether the traced instruction is a load or an instruction treated as a load, or a store or an instruction treated as a store, we need to make sure the treated-as-a-load instructions (icbi, icbt, dcbt, dcbst, dcbf) and the treated-as-a-store instructions (dcbtst, dcbz) have the correct opcodes in decode1. Several of them were previously marked as OP_NOP. We don't yet implement the SIAR or SDAR registers, which should be set by trace interrupts. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6a80825e70	decode1: Avoid overriding fields of v.decode in decode1 In the cases where we need to override the values from the decode ROMs, we now do that overriding after the clock edge (eating into decode2's cycle) rather than before. This helps timing a little. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	83816cb9e3	core: Implement BCD Assist instructions addg6s, cdtbcd, cbcdtod To avoid adding too much logic, this moves the adder used by OP_ADD out of the case statement in execute1.vhdl so that the result can be used by OP_ADDG6S as well. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	7052ceef4a	core: Implement the wait instruction as a no-op Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	7246bd6f67	core: Implement the reserved no-op instructions These are no-ops that are reserved for future use as performance hints, so we just need to treat them as no-ops. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	5fafdc56ef	core: Implement the addex instruction The addex instruction is like adde but uses the XER[OV] bit for the carry in and out rather than XER[CA]. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	1a7aebeef8	Add random number generator and implement the darn instruction This adds a true random number generator for the Xilinx FPGAs which uses a set of chaotic ring oscillators to generate random bits and then passes them through a Linear Hybrid Cellular Automaton (LHCA) to remove bias, as described in "High Speed True Random Number Generators in Xilinx FPGAs" by Catalin Baetoniu of Xilinx Inc., in: https://pdfs.semanticscholar.org/83ac/9e9c1bb3dad5180654984604c8d5d8137412.pdf This requires adding a .xdc file to tell vivado that the combinatorial loops that form the ring oscillators are intentional. The same code should work on other FPGAs as well if their tools can be told to accept the combinatorial loops. For simulation, the random.vhdl module gets compiled in, which uses the pseudorand() function to generate random numbers. Synthesis using yosys uses nonrandom.vhdl, which always signals an error, causing darn to return 0xffff_ffff_ffff_ffff. This adds an implementation of the darn instruction. Darn can return either raw or conditioned random numbers. On Xilinx FPGAs, reading a raw random number gives the output of the ring oscillators, and reading a conditioned random number gives the output of the LHCA. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	290b05f97d	core: Implement the maddhd, maddhdu and maddld instructions These instructions use major opcode 4 and have a third GPR input operand, so we need a decode table for major opcode 4 and some plumbing to get the RC register operand read. The multiply-add instructions use the same insn_type_t values as the regular multiply instructions, and we distinguish in execute1 by looking at the major opcode. This turns out to be convenient because we don't have to add any cases in the code that handles the output of the multiplier, and it frees up some insn_type_t values. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	8edfbf638b	core: Implement the cmpeqb and cmprb instructions Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b739372f7e	core: Implement the bpermd instruction Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	cce34039c3	core: Implement the setb instruction Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	fa77a6f683	core: Implement the mcrxrx instruction This also removes OP_MCRXR, as the mcrxr instruction was removed in version 3.0B of the Power ISA, having been phased-out for the server architecture since v2.02. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0fb8967290	core: Implement the TAR register and the bctar instruction Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	36297d35f8	decode1: Fix formatting Commit `d5c8c33bae` ("decode1: Reformat to 4-space indentation") resulted in some rows of major_decode_rom_array being misaligned. This fixes it. No code change. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	893d2bc6a2	core: Don't generate logic for log data when LOG_LENGTH = 0 This adds "if LOG_LENGTH > 0 generate" to the places in the core where log output data is latched, so that when LOG_LENGTH = 0 we don't create the logic to collect the data which won't be stored. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	74062195ca	execute1: Do forwarding of the CR result to the next instruction This adds a path to allow the CR result of one instruction to be forwarded to the next instruction, so that sequences such as cmp; bc can avoid having a 1-cycle bubble. Forwarding is not available for dot-form (Rc=1) instructions, since the CR result for them is calculated in writeback. The decode.output_cr field is used to identify those instructions that compute the CR result in execute1. For some reason, the multiply instructions incorrectly had output_cr = 1 in the decode tables. This fixes that. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c2da82764f	core: Implement CFAR register This implements the CFAR SPR as a slow SPR stored in 'ctrl'. Taken branches and rfid update it to the address of the branch or rfid instruction. To simplify the logic, this makes rfid use the branch logic to generate its redirect (requiring SRR0 to come in to execute1 on the B input and SRR1 on the A input), and the masking of the bottom 2 bits of NIA is moved to fetch1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6687aae4d6	core: Implement a simple branch predictor This implements a simple branch predictor in the decode1 stage. If it sees that the instruction is b or bc and the branch is predicted to be taken, it sends a flush and redirect upstream (to icache and fetch1) to redirect fetching to the branch target. The prediction is sent downstream with the branch instruction, and execute1 now only sends a flush/redirect upstream if the prediction was wrong. Unconditional branches are always predicted to be taken, and conditional branches are predicted to be taken if and only if the offset is negative. Branches that take the branch address from a register (bclr, bcctr) are predicted not taken, as we don't have any way to predict the branch address. Since we can now have a mflr being executed immediately after a bl or bcl, we now track the update to LR in the hazard tracker, using the second write register field that is used to track RA updates for update-form loads and stores. For those branches that update LR but don't write any other result (i.e. that don't decrementer CTR), we now write back LR in the same cycle as the instruction rather than taking a second cycle for the LR writeback. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	09ae2ce58d	decode1: Improve timing for slow SPR decode path This makes the logic that works out decode.unit and decode.sgl_pipe for mtspr/mfspr to/from slow SPRs detect the fact that the instruction is mtspr/mfspr based on a match with the instruction word rather than looking at v.decode.insn_type. This improves timing substantially, as the ROM lookup to get v.decode is relatively slow. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b3799c432b	decode1: Add a stash buffer to the output This means that the busy signal from execute1 (which can be driven combinatorially from mmu or dcache) now stops at decode1 and doesn't go on to icache or fetch1. This helps with timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	65a36cc0fc	decode: Work out ispr1/ispr2 in parallel with decode ROM lookup This makes the logic that calculates which SPRs are being accessed work in parallel with the instruction decode ROM lookup instead of being dependent on the opcode found in the decode ROM. The reason for doing that is that the path from icache through the decode ROM to the ispr1/ispr2 fields has become a critical path. Thus we are now using only a very partial decode of the instruction word in the logic for isp1/isp2, and we therefore can no longer rely on them being zero in all cases where no SPR is being accessed. Instead, decode2 now ignores ispr1/ispr2 in all cases except when the relevant decode.input_reg_a/b or decode.output_reg_a is set to SPR. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b5a7dbb78d	core: Remove fetch2 pipeline stage The fetch2 stage existed primarily to provide a stash buffer for the output of icache when a stall occurred. However, we can get the same effect -- of having the input to decode1 stay unchanged on a stall cycle -- by using the read enable of the BRAMs in icache, and by adding logic to keep the outputs unchanged on a clock cycle when stall_in = 1. This reduces branch and interrupt latency by one cycle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	49a4d9f67a	Add core logging This logs 256 bits of data per cycle to a ring buffer in BRAM. The data collected can be read out through 2 new SPRs or through the debug interface. The new SPRs are LOG_ADDR (724) and LOG_DATA (725). LOG_ADDR contains the buffer write pointer in the upper 32 bits (in units of entries, i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of doublewords, i.e. 8 bytes). Reading LOG_DATA gives the doubleword from the buffer at the read pointer and increments the read pointer. Setting bit 31 of LOG_ADDR inhibits the trace log system from writing to the log buffer, so the contents are stable and can be read. There are two new debug addresses which function similarly to the LOG_ADDR and LOG_DATA SPRs. The log is frozen while either or both of the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set. The buffer defaults to 2048 entries, i.e. 64kB. The size is set by the LOG_LENGTH generic on the core_debug module. Software can determine the length of the buffer because the length is ORed into the buffer write pointer in the upper 32 bits of LOG_ADDR. Hence the length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)). There is a program to format the log entries in a somewhat readable fashion in scripts/fmt_log/fmt_log.c. The log_entry struct in that file describes the layout of the bits in the log entries. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d5c8c33bae	decode1: Reformat to 4-space indentation Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	af909840e6	decode1: Make ld/std and lwa not be single-issue These were missed earlier when the single-issue flag was turned off on the other loads and stores by commit `1a244d3470` ("Remove single-issue constraint for most loads and stores"). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	4a4a98d4b9	core: Do addpcis using the main adder (#189 ) By adding logic to decode2 to be able to send the instruction address down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add 4 to the immediate value (easy since the bottom 16 bits were zero), we can do addpcis using the main adder. This reduces the width of the result mux and frees up one value in insn_type_t, since we can now use OP_ADD for addpcis. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Shawn Anastasio	e606772aeb	Implement the addpcis instruction This commit adds support for the addpcis instruction from ISA 3.0. A new input_reg_b_t type, CONST_DX_HI, was added to support the shifted immediate value used in DX-Form instructions. Signed-off-by: Shawn Anastasio <shawn@anastas.io>	5 years ago
Paul Mackerras	2843c99a71	MMU: Implement reading of the process table This adds the PID register and repurposes SPR 720 as the PRTBL register, which points to the base of the process table. There doesn't seem to be any point to implementing the partition table given that we don't have hypervisor mode. The MMU caches entry 0 of the process table internally (in pgtbl3) plus the entry indexed by the value in the PID register (pgtbl0). Both caches are invalidated by a tlbie[l] with RIC=2 or by a move to PRTBL. The pgtbl0 cache is invalidated by a move to PID. The dTLB and iTLB are cleared by a move to either PRTBL or PID. Which of the two page table root pointers is used (pgtbl0 or pgtbl3) depends on the MSB of the address being translated. Since the segment checking ensures that address(63) = address(62), this is sufficient to map quadrants 0 and 3. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	a658766fcf	Implement slbia as a dTLB/iTLB flush Slbia (with IH=7) is used in the Linux kernel to flush the ERATs (our iTLB/dTLB), so make it do that. This moves the logic to work out whether to flush a single entry or the whole TLB from dcache and icache into mmu. We now invalidate all dTLB and iTLB entries when the AP (actual pagesize) field of RB is non-zero on a tlbie[l], as well as when IS is non-zero. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago

1 2 3

118 Commits (c87b883a82746dae88b431b4b4046fc43a9448ac)