microwatt/loadstore1.vhdl

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

library work;
use work.common.all;
use work.helpers.all;

-- 2 cycle LSU
-- We calculate the address in the first cycle

entity loadstore1 is
    port (
        clk   : in std_ulogic;

        l_in  : in Execute1ToLoadstore1Type;

        l_out : out Loadstore1ToDcacheType
        );
end loadstore1;

architecture behave of loadstore1 is
    signal r, rin : Loadstore1ToDcacheType;
    signal lsu_sum : std_ulogic_vector(63 downto 0);
begin
    -- Calculate the address in the first cycle
    lsu_sum <= std_ulogic_vector(unsigned(l_in.addr1) + unsigned(l_in.addr2)) when l_in.valid = '1' else (others => '0');

    loadstore1_0: process(clk)
    begin
        if rising_edge(clk) then
            r <= rin;
        end if;
    end process;

    loadstore1_1: process(all)
        variable v : Loadstore1ToDcacheType;
        variable brev_lenm1 : unsigned(2 downto 0);
        variable byte_offset : unsigned(2 downto 0);
        variable j : integer;
        variable k : unsigned(2 downto 0);
    begin
        v := r;

        v.valid := l_in.valid;
        v.load := l_in.load;
        v.write_reg := l_in.write_reg;
        v.length := l_in.length;
        v.byte_reverse := l_in.byte_reverse;
        v.sign_extend := l_in.sign_extend;
        v.update := l_in.update;
        v.update_reg := l_in.update_reg;
	v.xerc := l_in.xerc;
        v.reserve := l_in.reserve;
        v.rc := l_in.rc;

	-- XXX Temporary hack. Mark the op as non-cachable if the address
	-- is the form 0xc-------
	--
	-- This will have to be replaced by a combination of implementing the
	-- proper HV CI load/store instructions and having an MMU to get the I
	-- bit otherwise.
	if lsu_sum(31 downto 28) = "1100" then
	    v.nc := '1';
	else
	    v.nc := '0';
	end if;

	-- XXX Do length_to_sel here ?

        -- Do byte reversing and rotating for stores in the first cycle
        if v.load = '0' then
            byte_offset := unsigned(lsu_sum(2 downto 0));
            brev_lenm1 := "000";
            if l_in.byte_reverse = '1' then
                brev_lenm1 := unsigned(l_in.length(2 downto 0)) - 1;
            end if;
            for i in 0 to 7 loop
                k := (to_unsigned(i, 3) xor brev_lenm1) + byte_offset;
                j := to_integer(k) * 8;
                v.data(j + 7 downto j) := l_in.data(i * 8 + 7 downto i * 8);
            end loop;
        end if;

        v.addr := lsu_sum;

        -- Update registers
        rin <= v;

        -- Update outputs
        l_out <= r;

        -- Asynchronous output of the low-order address bits (latched in dcache)
        l_out.early_low_addr <= lsu_sum(11 downto 0);
        l_out.early_valid <= l_in.valid;
    end process;
end;
Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`library ieee;`
			`use ieee.std_logic_1164.all;`
			`use ieee.numeric_std.all;`

			`library work;`
			`use work.common.all;`
Move byte reversal of stores to first cycle We are seeing some timing issues with the second cycle of loadstore, and we aren't doing much in the first cycle, so move it here. Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`use work.helpers.all;`
Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago
			`-- 2 cycle LSU`
			`-- We calculate the address in the first cycle`

			`entity loadstore1 is`
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`port (`
			`clk : in std_ulogic;`
Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago
Plumb loadstore1 input from execute1 not decode2 This allows us to use the bypass at the input of execute1 for the address and data operands for loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> 4 years ago			`l_in : in Execute1ToLoadstore1Type;`
Register outputs on loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago
dcache: Add a dcache This replaces loadstore2 with a dcache The dcache unit is losely based on the icache one (same basic cache layout), but has some significant logic additions to deal with stores, loads with update, non-cachable accesses and other differences due to operating in the execution part of the pipeline rather than the fetch part. The cache is store-through, though a hit with an existing line will update the line rather than invalidate it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> 5 years ago			`l_out : out Loadstore1ToDcacheType`
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`);`
Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`end loadstore1;`

			`architecture behave of loadstore1 is`
dcache: Add a dcache This replaces loadstore2 with a dcache The dcache unit is losely based on the icache one (same basic cache layout), but has some significant logic additions to deal with stores, loads with update, non-cachable accesses and other differences due to operating in the execution part of the pipeline rather than the fetch part. The cache is store-through, though a hit with an existing line will update the line rather than invalidate it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> 5 years ago			`signal r, rin : Loadstore1ToDcacheType;`
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`signal lsu_sum : std_ulogic_vector(63 downto 0);`
Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`begin`
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`-- Calculate the address in the first cycle`
			`lsu_sum <= std_ulogic_vector(unsigned(l_in.addr1) + unsigned(l_in.addr2)) when l_in.valid = '1' else (others => '0');`
Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`loadstore1_0: process(clk)`
			`begin`
			`if rising_edge(clk) then`
			`r <= rin;`
			`end if;`
			`end process;`
Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`loadstore1_1: process(all)`
dcache: Add a dcache This replaces loadstore2 with a dcache The dcache unit is losely based on the icache one (same basic cache layout), but has some significant logic additions to deal with stores, loads with update, non-cachable accesses and other differences due to operating in the execution part of the pipeline rather than the fetch part. The cache is store-through, though a hit with an existing line will update the line rather than invalidate it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> 5 years ago			`variable v : Loadstore1ToDcacheType;`
dcache: Add support for unaligned loads and stores For an unaligned load or store, we do the first doubleword (dword) of the transfer as normal, but then go to a new NEXT_DWORD state of the state machine to do the cache tag lookup for the second dword of the transfer. From the NEXT_DWORD state we have much the same transitions to other states as from the IDLE state (the transitions for OP_LOAD_HIT are a bit different but almost identical for the other op values). We now do the preparation of the data to be written in loadstore1, that is, byte reversal if necessary and rotation by a number of bytes based on the low 3 bits of the address. We do rotation not shifting so we have the bytes that need to go into the second doubleword in the right place in the low bytes of the data sent to dcache. The rotation and byte reversal are done in a single step with one multiplexer per byte by setting the select inputs for each byte appropriately. This also fixes writeback to not write the register value until it has received both pieces of an unaligned load value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> 4 years ago			`variable brev_lenm1 : unsigned(2 downto 0);`
			`variable byte_offset : unsigned(2 downto 0);`
			`variable j : integer;`
			`variable k : unsigned(2 downto 0);`
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`begin`
			`v := r;`
Register outputs on loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`v.valid := l_in.valid;`
			`v.load := l_in.load;`
			`v.write_reg := l_in.write_reg;`
			`v.length := l_in.length;`
			`v.byte_reverse := l_in.byte_reverse;`
			`v.sign_extend := l_in.sign_extend;`
			`v.update := l_in.update;`
			`v.update_reg := l_in.update_reg;`
Add basic XER support The carry is currently internal to execute1. We don't handle any of the other XER fields. This creates type called "xer_common_t" that contains the commonly used XER bits (CA, CA32, SO, OV, OV32). The value is stored in the CR file (though it could be a separate module). The rest of the bits will be implemented as a separate SPR and the two parts reconciled in mfspr/mtspr in latter commits. We always read XER in decode2 (there is little point not to) and send it down all pipeline branches as it will be needed in writeback for all type of instructions when CR0:SO needs to be updated (such forms exist for all pipeline branches even if we don't yet implement them). To avoid having to track XER hazards, we forward it back in EX1. This assumes that other pipeline branches that can modify it (mult and div) are running single issue for now. One additional hazard to beware of is an XER:SO modifying instruction in EX1 followed immediately by a store conditional. Due to our writeback latency, the store will go down the LSU with the previous XER value, thus the stcx. will set CR0:SO using an obsolete SO value. I doubt there exist any code relying on this behaviour being correct but we should account for it regardless, possibly by ensuring that stcx. remain single issue initially, or later by adding some minimal tracking or moving the LSU into the same pipeline as execute. Missing some obscure XER affecting instructions like addex or mcrxrx. [paulus@ozlabs.org - fix CA32 and OV32 for OP_ADD, fix order of arguments to set_ov] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> 5 years ago			`v.xerc := l_in.xerc;`
dcache: Implement load-reserve and store-conditional instructions This involves plumbing the (existing) 'reserve' and 'rc' bits in the decode tables down to dcache, and 'rc' and 'store_done' bits from dcache to writeback. It turns out that we had 'RC' set in the 'rc' column for several ordinary stores and for the attn instruction. This corrects them to 'NONE', and sets the 'rc' column to 'ONE' for the conditional stores. In writeback we now have logic to set CR0 when the input from dcache has rc = 1. In dcache we have the reservation itself, which has a valid bit and the address down to cache line granularity. We don't currently store the reservation length. For a store conditional which fails, we set a 'cancel_store' signal which inhibits the write to the cache and prevents the state machine from starting a bus cycle or going to the STORE_WAIT_ACK state. Instead we set r1.stcx_fail which causes the instruction to complete in the next cycle with rc=1 and store_done=0. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> 4 years ago			`v.reserve := l_in.reserve;`
			`v.rc := l_in.rc;`
Register outputs on loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago
dcache: Add a dcache This replaces loadstore2 with a dcache The dcache unit is losely based on the icache one (same basic cache layout), but has some significant logic additions to deal with stores, loads with update, non-cachable accesses and other differences due to operating in the execution part of the pipeline rather than the fetch part. The cache is store-through, though a hit with an existing line will update the line rather than invalidate it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> 5 years ago			`-- XXX Temporary hack. Mark the op as non-cachable if the address`
			`-- is the form 0xc-------`
			`--`
			`-- This will have to be replaced by a combination of implementing the`
			`-- proper HV CI load/store instructions and having an MMU to get the I`
			`-- bit otherwise.`
			`if lsu_sum(31 downto 28) = "1100" then`
			`v.nc := '1';`
			`else`
			`v.nc := '0';`
			`end if;`

			`-- XXX Do length_to_sel here ?`

dcache: Add support for unaligned loads and stores For an unaligned load or store, we do the first doubleword (dword) of the transfer as normal, but then go to a new NEXT_DWORD state of the state machine to do the cache tag lookup for the second dword of the transfer. From the NEXT_DWORD state we have much the same transitions to other states as from the IDLE state (the transitions for OP_LOAD_HIT are a bit different but almost identical for the other op values). We now do the preparation of the data to be written in loadstore1, that is, byte reversal if necessary and rotation by a number of bytes based on the low 3 bits of the address. We do rotation not shifting so we have the bytes that need to go into the second doubleword in the right place in the low bytes of the data sent to dcache. The rotation and byte reversal are done in a single step with one multiplexer per byte by setting the select inputs for each byte appropriately. This also fixes writeback to not write the register value until it has received both pieces of an unaligned load value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> 4 years ago			`-- Do byte reversing and rotating for stores in the first cycle`
			`if v.load = '0' then`
			`byte_offset := unsigned(lsu_sum(2 downto 0));`
			`brev_lenm1 := "000";`
			`if l_in.byte_reverse = '1' then`
			`brev_lenm1 := unsigned(l_in.length(2 downto 0)) - 1;`
			`end if;`
			`for i in 0 to 7 loop`
			`k := (to_unsigned(i, 3) xor brev_lenm1) + byte_offset;`
			`j := to_integer(k) * 8;`
			`v.data(j + 7 downto j) := l_in.data(i * 8 + 7 downto i * 8);`
			`end loop;`
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`end if;`
Move byte reversal of stores to first cycle We are seeing some timing issues with the second cycle of loadstore, and we aren't doing much in the first cycle, so move it here. Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`v.addr := lsu_sum;`
Register outputs on loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`-- Update registers`
			`rin <= v;`
Register outputs on loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`-- Update outputs`
			`l_out <= r;`
dcache: Trim one cycle from the load hit path Currently we don't get the result from a load that hits in the dcache until the fourth cycle after the instruction was presented to loadstore1. This trims this back to 3 cycles by taking the low order bits of the address generated in loadstore1 into dcache directly (not via the output register of loadstore1) and using them to address the read port of the dcache data RAM. We use the lower 12 address bits here in the expectation that any reasonable data cache design will have a set size of 4kB or less in order to avoid the aliasing problems that can arise with a virtually-indexed physically-tagged cache if the set size is greater than the smallest page size provided by the MMU. With this we can get rid of r2 and drive the signals going to writeback from r1, since the load hit data is now available one cycle earlier. We need a multiplexer on the read address of the data cache RAM in order to handle the second doubleword of an unaligned access. One small complication is that we now need an extra cycle in the case of an unaligned load which misses in the data cache and which reads the 2nd-last and last doublewords of a cache line. This is the reason for the PRE_NEXT_DWORD state; if we just go straight to NEXT_DWORD then we end up having the write of the last doubleword of the cache line and the read of that same doubleword occurring in the same cycle, which means we read stale data rather than the just-fetched data. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> 4 years ago
			`-- Asynchronous output of the low-order address bits (latched in dcache)`
			`l_out.early_low_addr <= lsu_sum(11 downto 0);`
			`l_out.early_valid <= l_in.valid;`
Reformat loadstore1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`end process;`
Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com> 5 years ago			`end;`