|
|
|
library ieee;
|
|
|
|
use ieee.std_logic_1164.all;
|
|
|
|
use ieee.numeric_std.all;
|
|
|
|
|
|
|
|
library work;
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
use work.utils.all;
|
|
|
|
use work.common.all;
|
|
|
|
|
|
|
|
entity fetch1 is
|
|
|
|
generic(
|
|
|
|
RESET_ADDRESS : std_logic_vector(63 downto 0) := (others => '0');
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
ALT_RESET_ADDRESS : std_logic_vector(63 downto 0) := (others => '0');
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
TLB_SIZE : positive := 64; -- L1 ITLB number of entries (direct mapped)
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
HAS_BTC : boolean := true
|
|
|
|
);
|
|
|
|
port(
|
|
|
|
clk : in std_ulogic;
|
|
|
|
rst : in std_ulogic;
|
|
|
|
|
|
|
|
-- Control inputs:
|
|
|
|
stall_in : in std_ulogic;
|
|
|
|
flush_in : in std_ulogic;
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
inval_btc : in std_ulogic;
|
|
|
|
stop_in : in std_ulogic;
|
|
|
|
alt_reset_in : in std_ulogic;
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
m_in : in MmuToITLBType;
|
|
|
|
|
|
|
|
-- redirect from writeback unit
|
|
|
|
w_in : in WritebackToFetch1Type;
|
|
|
|
|
|
|
|
-- redirect from decode1
|
|
|
|
d_in : in Decode1ToFetch1Type;
|
|
|
|
|
|
|
|
-- Request to icache
|
|
|
|
i_out : out Fetch1ToIcacheType;
|
|
|
|
|
|
|
|
-- outputs to logger
|
|
|
|
log_out : out std_ulogic_vector(42 downto 0)
|
|
|
|
);
|
|
|
|
end entity fetch1;
|
|
|
|
|
|
|
|
architecture behaviour of fetch1 is
|
|
|
|
type reg_internal_t is record
|
|
|
|
mode_32bit: std_ulogic;
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
rd_is_niap4: std_ulogic;
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
tlbcheck: std_ulogic;
|
|
|
|
tlbstall: std_ulogic;
|
|
|
|
next_nia: std_ulogic_vector(63 downto 0);
|
|
|
|
end record;
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
|
|
|
|
-- Mini effective to real translation cache
|
|
|
|
type erat_t is record
|
|
|
|
epn0: std_ulogic_vector(63 - MIN_LG_PGSZ downto 0);
|
|
|
|
epn1: std_ulogic_vector(63 - MIN_LG_PGSZ downto 0);
|
|
|
|
rpn0: std_ulogic_vector(REAL_ADDR_BITS - MIN_LG_PGSZ - 1 downto 0);
|
|
|
|
rpn1: std_ulogic_vector(REAL_ADDR_BITS - MIN_LG_PGSZ - 1 downto 0);
|
|
|
|
priv0: std_ulogic;
|
|
|
|
priv1: std_ulogic;
|
|
|
|
valid: std_ulogic_vector(1 downto 0);
|
|
|
|
mru: std_ulogic; -- '1' => entry 1 most recently used
|
|
|
|
end record;
|
|
|
|
|
|
|
|
signal r, r_next : Fetch1ToIcacheType;
|
|
|
|
signal r_int, r_next_int : reg_internal_t;
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
signal advance_nia : std_ulogic;
|
|
|
|
signal log_nia : std_ulogic_vector(42 downto 0);
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
signal erat : erat_t;
|
|
|
|
signal erat_hit : std_ulogic;
|
|
|
|
signal erat_sel : std_ulogic;
|
|
|
|
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
constant BTC_ADDR_BITS : integer := 10;
|
|
|
|
constant BTC_TAG_BITS : integer := 62 - BTC_ADDR_BITS;
|
|
|
|
constant BTC_TARGET_BITS : integer := 62;
|
|
|
|
constant BTC_SIZE : integer := 2 ** BTC_ADDR_BITS;
|
|
|
|
constant BTC_WIDTH : integer := BTC_TAG_BITS + BTC_TARGET_BITS + 2;
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
type btc_mem_type is array (0 to BTC_SIZE - 1) of std_ulogic_vector(BTC_WIDTH - 1 downto 0);
|
|
|
|
|
|
|
|
signal btc_rd_addr : unsigned(BTC_ADDR_BITS - 1 downto 0);
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
signal btc_rd_data : std_ulogic_vector(BTC_WIDTH - 1 downto 0) := (others => '0');
|
|
|
|
signal btc_rd_valid : std_ulogic := '0';
|
|
|
|
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
-- L1 ITLB.
|
|
|
|
constant TLB_BITS : natural := log2(TLB_SIZE);
|
|
|
|
constant TLB_EA_TAG_BITS : natural := 64 - (MIN_LG_PGSZ + TLB_BITS);
|
|
|
|
constant TLB_PTE_BITS : natural := 64;
|
|
|
|
|
|
|
|
subtype tlb_index_t is integer range 0 to TLB_SIZE - 1;
|
|
|
|
type tlb_valids_t is array(tlb_index_t) of std_ulogic;
|
|
|
|
subtype tlb_tag_t is std_ulogic_vector(TLB_EA_TAG_BITS - 1 downto 0);
|
|
|
|
type tlb_tags_t is array(tlb_index_t) of tlb_tag_t;
|
|
|
|
subtype tlb_pte_t is std_ulogic_vector(TLB_PTE_BITS - 1 downto 0);
|
|
|
|
type tlb_ptes_t is array(tlb_index_t) of tlb_pte_t;
|
|
|
|
|
|
|
|
signal itlb_valids : tlb_valids_t;
|
|
|
|
signal itlb_tags : tlb_tags_t;
|
|
|
|
signal itlb_ptes : tlb_ptes_t;
|
|
|
|
|
|
|
|
-- Values read from above arrays on a clock edge
|
|
|
|
signal itlb_valid : std_ulogic;
|
|
|
|
signal itlb_ttag : tlb_tag_t;
|
|
|
|
signal itlb_pte : tlb_pte_t;
|
|
|
|
signal itlb_hit : std_ulogic;
|
|
|
|
|
|
|
|
-- Simple hash for direct-mapped TLB index
|
|
|
|
function hash_ea(addr: std_ulogic_vector(63 downto 0)) return std_ulogic_vector is
|
|
|
|
variable hash : std_ulogic_vector(TLB_BITS - 1 downto 0);
|
|
|
|
begin
|
|
|
|
hash := addr(MIN_LG_PGSZ + TLB_BITS - 1 downto MIN_LG_PGSZ)
|
|
|
|
xor addr(MIN_LG_PGSZ + 2 * TLB_BITS - 1 downto MIN_LG_PGSZ + TLB_BITS)
|
|
|
|
xor addr(MIN_LG_PGSZ + 3 * TLB_BITS - 1 downto MIN_LG_PGSZ + 2 * TLB_BITS);
|
|
|
|
return hash;
|
|
|
|
end;
|
|
|
|
|
|
|
|
begin
|
|
|
|
|
|
|
|
regs : process(clk)
|
|
|
|
begin
|
|
|
|
if rising_edge(clk) then
|
|
|
|
log_nia <= r.nia(63) & r.nia(43 downto 2);
|
|
|
|
if r /= r_next and advance_nia = '1' then
|
|
|
|
report "fetch1 rst:" & std_ulogic'image(rst) &
|
|
|
|
" IR:" & std_ulogic'image(r_next.virt_mode) &
|
|
|
|
" P:" & std_ulogic'image(r_next.priv_mode) &
|
|
|
|
" E:" & std_ulogic'image(r_next.big_endian) &
|
|
|
|
" 32:" & std_ulogic'image(r_next_int.mode_32bit) &
|
|
|
|
" I:" & std_ulogic'image(w_in.interrupt) &
|
|
|
|
" R:" & std_ulogic'image(w_in.redirect) & std_ulogic'image(d_in.redirect) &
|
|
|
|
" S:" & std_ulogic'image(stall_in) &
|
|
|
|
" T:" & std_ulogic'image(stop_in) &
|
|
|
|
" nia:" & to_hstring(r_next.nia) &
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
" req:" & std_ulogic'image(r_next.req) &
|
|
|
|
" FF:" & std_ulogic'image(r_next.fetch_fail);
|
|
|
|
end if;
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
if advance_nia = '1' then
|
|
|
|
r <= r_next;
|
|
|
|
r_int <= r_next_int;
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
end if;
|
|
|
|
-- always send the up-to-date stop mark and req
|
|
|
|
r.stop_mark <= stop_in;
|
|
|
|
r.req <= r_next.req;
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
r.fetch_fail <= r_next.fetch_fail;
|
|
|
|
r_int.tlbcheck <= r_next_int.tlbcheck;
|
|
|
|
r_int.tlbstall <= r_next_int.tlbstall;
|
|
|
|
end if;
|
|
|
|
end process;
|
|
|
|
log_out <= log_nia;
|
|
|
|
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
btc : if HAS_BTC generate
|
|
|
|
signal btc_memory : btc_mem_type;
|
|
|
|
attribute ram_style : string;
|
|
|
|
attribute ram_style of btc_memory : signal is "block";
|
|
|
|
|
|
|
|
signal btc_valids : std_ulogic_vector(BTC_SIZE - 1 downto 0);
|
|
|
|
-- attribute ram_style of btc_valids : signal is "distributed";
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
|
|
|
|
signal btc_wr : std_ulogic;
|
|
|
|
signal btc_wr_data : std_ulogic_vector(BTC_WIDTH - 1 downto 0);
|
|
|
|
signal btc_wr_addr : std_ulogic_vector(BTC_ADDR_BITS - 1 downto 0);
|
|
|
|
begin
|
|
|
|
btc_wr_data <= w_in.br_taken &
|
|
|
|
r.virt_mode &
|
|
|
|
w_in.br_nia(63 downto BTC_ADDR_BITS + 2) &
|
|
|
|
w_in.redirect_nia(63 downto 2);
|
|
|
|
btc_wr_addr <= w_in.br_nia(BTC_ADDR_BITS + 1 downto 2);
|
|
|
|
btc_wr <= w_in.br_last;
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
|
|
|
|
btc_ram : process(clk)
|
|
|
|
variable raddr : unsigned(BTC_ADDR_BITS - 1 downto 0);
|
|
|
|
begin
|
|
|
|
if rising_edge(clk) then
|
|
|
|
if advance_nia = '1' then
|
|
|
|
if is_X(btc_rd_addr) then
|
|
|
|
btc_rd_data <= (others => 'X');
|
|
|
|
btc_rd_valid <= 'X';
|
|
|
|
else
|
|
|
|
btc_rd_data <= btc_memory(to_integer(btc_rd_addr));
|
|
|
|
btc_rd_valid <= btc_valids(to_integer(btc_rd_addr));
|
|
|
|
end if;
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
end if;
|
|
|
|
if btc_wr = '1' then
|
|
|
|
assert not is_X(btc_wr_addr) report "Writing to unknown address" severity FAILURE;
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
btc_memory(to_integer(unsigned(btc_wr_addr))) <= btc_wr_data;
|
|
|
|
end if;
|
|
|
|
if inval_btc = '1' or rst = '1' then
|
|
|
|
btc_valids <= (others => '0');
|
|
|
|
elsif btc_wr = '1' then
|
|
|
|
assert not is_X(btc_wr_addr) report "Writing to unknown address" severity FAILURE;
|
|
|
|
btc_valids(to_integer(unsigned(btc_wr_addr))) <= '1';
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
end if;
|
|
|
|
end if;
|
|
|
|
end process;
|
|
|
|
end generate;
|
|
|
|
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
erat_sync : process(clk)
|
|
|
|
begin
|
|
|
|
if rising_edge(clk) then
|
|
|
|
if rst /= '0' or m_in.tlbie = '1' then
|
|
|
|
erat.valid <= "00";
|
|
|
|
erat.mru <= '0';
|
|
|
|
else
|
|
|
|
if erat_hit = '1' then
|
|
|
|
erat.mru <= erat_sel;
|
|
|
|
end if;
|
|
|
|
if m_in.tlbld = '1' then
|
|
|
|
erat.epn0 <= m_in.addr(63 downto MIN_LG_PGSZ);
|
|
|
|
erat.rpn0 <= m_in.pte(REAL_ADDR_BITS-1 downto MIN_LG_PGSZ);
|
|
|
|
erat.priv0 <= m_in.pte(3);
|
|
|
|
erat.valid(0) <= '1';
|
|
|
|
erat.valid(1) <= '0';
|
|
|
|
erat.mru <= '0';
|
|
|
|
elsif r_int.tlbcheck = '1' and itlb_hit = '1' then
|
|
|
|
if erat.mru = '0' then
|
|
|
|
erat.epn1 <= r.nia(63 downto MIN_LG_PGSZ);
|
|
|
|
erat.rpn1 <= itlb_pte(REAL_ADDR_BITS-1 downto MIN_LG_PGSZ);
|
|
|
|
erat.priv1 <= itlb_pte(3);
|
|
|
|
erat.valid(1) <= '1';
|
|
|
|
else
|
|
|
|
erat.epn0 <= r.nia(63 downto MIN_LG_PGSZ);
|
|
|
|
erat.rpn0 <= itlb_pte(REAL_ADDR_BITS-1 downto MIN_LG_PGSZ);
|
|
|
|
erat.priv0 <= itlb_pte(3);
|
|
|
|
erat.valid(0) <= '1';
|
|
|
|
end if;
|
|
|
|
erat.mru <= not erat.mru;
|
|
|
|
end if;
|
|
|
|
end if;
|
|
|
|
end if;
|
|
|
|
end process;
|
|
|
|
|
|
|
|
-- Read TLB using the NIA for the next cycle
|
|
|
|
itlb_read : process(clk)
|
|
|
|
variable tlb_req_index : std_ulogic_vector(TLB_BITS - 1 downto 0);
|
|
|
|
begin
|
|
|
|
if rising_edge(clk) then
|
|
|
|
if advance_nia = '1' then
|
|
|
|
tlb_req_index := hash_ea(r_next.nia);
|
|
|
|
if is_X(tlb_req_index) then
|
|
|
|
itlb_pte <= (others => 'X');
|
|
|
|
itlb_ttag <= (others => 'X');
|
|
|
|
itlb_valid <= 'X';
|
|
|
|
else
|
|
|
|
itlb_pte <= itlb_ptes(to_integer(unsigned(tlb_req_index)));
|
|
|
|
itlb_ttag <= itlb_tags(to_integer(unsigned(tlb_req_index)));
|
|
|
|
itlb_valid <= itlb_valids(to_integer(unsigned(tlb_req_index)));
|
|
|
|
end if;
|
|
|
|
end if;
|
|
|
|
end if;
|
|
|
|
end process;
|
|
|
|
|
|
|
|
-- TLB hit detection
|
|
|
|
itlb_lookup : process(all)
|
|
|
|
begin
|
|
|
|
itlb_hit <= '0';
|
|
|
|
if itlb_ttag = r.nia(63 downto MIN_LG_PGSZ + TLB_BITS) then
|
|
|
|
itlb_hit <= itlb_valid;
|
|
|
|
end if;
|
|
|
|
end process;
|
|
|
|
|
|
|
|
-- iTLB update
|
|
|
|
itlb_update: process(clk)
|
|
|
|
variable wr_index : std_ulogic_vector(TLB_BITS - 1 downto 0);
|
|
|
|
begin
|
|
|
|
if rising_edge(clk) then
|
|
|
|
wr_index := hash_ea(m_in.addr);
|
|
|
|
if rst = '1' or (m_in.tlbie = '1' and m_in.doall = '1') then
|
|
|
|
-- clear all valid bits
|
|
|
|
for i in tlb_index_t loop
|
|
|
|
itlb_valids(i) <= '0';
|
|
|
|
end loop;
|
|
|
|
elsif m_in.tlbie = '1' then
|
|
|
|
assert not is_X(wr_index) report "icache index invalid on write" severity FAILURE;
|
|
|
|
-- clear entry regardless of hit or miss
|
|
|
|
itlb_valids(to_integer(unsigned(wr_index))) <= '0';
|
|
|
|
elsif m_in.tlbld = '1' then
|
|
|
|
assert not is_X(wr_index) report "icache index invalid on write" severity FAILURE;
|
|
|
|
itlb_tags(to_integer(unsigned(wr_index))) <= m_in.addr(63 downto MIN_LG_PGSZ + TLB_BITS);
|
|
|
|
itlb_ptes(to_integer(unsigned(wr_index))) <= m_in.pte;
|
|
|
|
itlb_valids(to_integer(unsigned(wr_index))) <= '1';
|
|
|
|
end if;
|
|
|
|
--ev.itlb_miss_resolved <= m_in.tlbld and not rst;
|
|
|
|
end if;
|
|
|
|
end process;
|
|
|
|
|
|
|
|
comb : process(all)
|
|
|
|
variable v : Fetch1ToIcacheType;
|
|
|
|
variable v_int : reg_internal_t;
|
|
|
|
variable next_nia : std_ulogic_vector(63 downto 0);
|
|
|
|
variable m32 : std_ulogic;
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
variable ehit, esel : std_ulogic;
|
|
|
|
variable eaa_priv : std_ulogic;
|
|
|
|
begin
|
|
|
|
v := r;
|
|
|
|
v_int := r_int;
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
v.predicted := '0';
|
|
|
|
v.pred_ntaken := '0';
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
v.req := not stop_in;
|
|
|
|
v_int.tlbstall := r_int.tlbcheck;
|
|
|
|
v_int.tlbcheck := '0';
|
|
|
|
|
|
|
|
if r_int.tlbcheck = '1' and itlb_hit = '0' then
|
|
|
|
v.fetch_fail := '1';
|
|
|
|
end if;
|
|
|
|
|
|
|
|
-- Combinatorial computation of the CIA for the next cycle.
|
|
|
|
-- Needs to be simple so the result can be used for RAM
|
|
|
|
-- and TLB access in the icache.
|
|
|
|
-- If we are stalled, this still advances, and the assumption
|
|
|
|
-- is that it will not be used.
|
|
|
|
m32 := r_int.mode_32bit;
|
|
|
|
if w_in.redirect = '1' then
|
|
|
|
next_nia := w_in.redirect_nia(63 downto 2) & "00";
|
|
|
|
m32 := w_in.mode_32bit;
|
|
|
|
v.virt_mode := w_in.virt_mode;
|
|
|
|
v.priv_mode := w_in.priv_mode;
|
|
|
|
v.big_endian := w_in.big_endian;
|
|
|
|
v_int.mode_32bit := w_in.mode_32bit;
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
v.fetch_fail := '0';
|
|
|
|
elsif d_in.redirect = '1' then
|
|
|
|
next_nia := d_in.redirect_nia(63 downto 2) & "00";
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
v.fetch_fail := '0';
|
|
|
|
elsif r_int.tlbstall = '1' then
|
|
|
|
-- this case is needed so that the correct icache tags are read
|
|
|
|
next_nia := r.nia;
|
|
|
|
else
|
|
|
|
next_nia := r_int.next_nia;
|
|
|
|
end if;
|
|
|
|
if m32 = '1' then
|
|
|
|
next_nia(63 downto 32) := (others => '0');
|
|
|
|
end if;
|
|
|
|
v.nia := next_nia;
|
|
|
|
|
|
|
|
v_int.next_nia := std_ulogic_vector(unsigned(next_nia) + 4);
|
|
|
|
|
|
|
|
-- Use v_int.next_nia as the BTC read address before it gets possibly
|
|
|
|
-- overridden with the reset or interrupt address or the predicted branch
|
|
|
|
-- target address, in order to improve timing. If it gets overridden then
|
|
|
|
-- rd_is_niap4 gets cleared to indicate that the BTC data doesn't apply.
|
|
|
|
btc_rd_addr <= unsigned(v_int.next_nia(BTC_ADDR_BITS + 1 downto 2));
|
|
|
|
v_int.rd_is_niap4 := '1';
|
|
|
|
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
-- If the last NIA value went down with a stop mark, it didn't get
|
|
|
|
-- executed, and hence we shouldn't increment NIA.
|
|
|
|
advance_nia <= rst or w_in.interrupt or w_in.redirect or d_in.redirect or
|
|
|
|
(not r.stop_mark and not (r.req and stall_in));
|
|
|
|
-- reduce metavalue warnings in sim
|
|
|
|
if is_X(rst) then
|
|
|
|
advance_nia <= '1';
|
|
|
|
end if;
|
|
|
|
|
|
|
|
-- Translate next_nia to real if possible, otherwise we have to stall
|
|
|
|
-- and look up the TLB.
|
|
|
|
ehit := '0';
|
|
|
|
esel := '0';
|
|
|
|
eaa_priv := '1';
|
|
|
|
if next_nia(63 downto MIN_LG_PGSZ) = erat.epn1 and erat.valid(1) = '1' then
|
|
|
|
ehit := '1';
|
|
|
|
esel := '1';
|
|
|
|
end if;
|
|
|
|
if next_nia(63 downto MIN_LG_PGSZ) = erat.epn0 and erat.valid(0) = '1' then
|
|
|
|
ehit := '1';
|
|
|
|
end if;
|
|
|
|
if v.virt_mode = '0' then
|
|
|
|
v.rpn := v.nia(REAL_ADDR_BITS - 1 downto MIN_LG_PGSZ);
|
|
|
|
eaa_priv := '1';
|
|
|
|
elsif esel = '1' then
|
|
|
|
v.rpn := erat.rpn1;
|
|
|
|
eaa_priv := erat.priv1;
|
|
|
|
else
|
|
|
|
v.rpn := erat.rpn0;
|
|
|
|
eaa_priv := erat.priv0;
|
|
|
|
end if;
|
|
|
|
if advance_nia = '1' and ehit = '0' and v.virt_mode = '1' and
|
|
|
|
r_int.tlbcheck = '0' and v.fetch_fail = '0' then
|
|
|
|
v_int.tlbstall := '1';
|
|
|
|
v_int.tlbcheck := '1';
|
|
|
|
end if;
|
|
|
|
if ehit = '1' or v.virt_mode = '0' then
|
|
|
|
if eaa_priv = '1' and v.priv_mode = '0' then
|
|
|
|
v.fetch_fail := '1';
|
|
|
|
else
|
|
|
|
v.fetch_fail := '0';
|
|
|
|
end if;
|
|
|
|
end if;
|
|
|
|
erat_hit <= ehit and advance_nia;
|
|
|
|
erat_sel <= esel;
|
|
|
|
|
|
|
|
if rst /= '0' then
|
|
|
|
if alt_reset_in = '1' then
|
|
|
|
v_int.next_nia := ALT_RESET_ADDRESS;
|
|
|
|
else
|
|
|
|
v_int.next_nia := RESET_ADDRESS;
|
|
|
|
end if;
|
|
|
|
elsif w_in.interrupt = '1' then
|
|
|
|
v_int.next_nia := 47x"0" & w_in.intr_vec(16 downto 2) & "00";
|
|
|
|
end if;
|
|
|
|
if rst /= '0' or w_in.interrupt = '1' then
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
v.req := '0';
|
Add TLB to icache
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches. This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.
The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB. The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address. TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.
If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST. That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.
One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3). If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.
Tlbie operations get sent from mmu to icache over a new connection.
Unfortunately the privileged instruction tests are broken for now.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
|
|
|
v.virt_mode := '0';
|
|
|
|
v.priv_mode := '1';
|
|
|
|
v.big_endian := '0';
|
|
|
|
v_int.mode_32bit := '0';
|
|
|
|
v_int.rd_is_niap4 := '0';
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
v_int.tlbstall := '0';
|
|
|
|
v_int.tlbcheck := '0';
|
|
|
|
v.fetch_fail := '0';
|
|
|
|
end if;
|
|
|
|
if v.fetch_fail = '1' then
|
|
|
|
v_int.tlbstall := '1';
|
|
|
|
end if;
|
|
|
|
if v_int.tlbstall = '1' then
|
|
|
|
v.req := '0';
|
|
|
|
end if;
|
|
|
|
|
|
|
|
-- If there is a valid entry in the BTC which corresponds to the next instruction,
|
|
|
|
-- use that to predict the address of the instruction after that.
|
Move iTLB from icache to fetch1
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
|
|
|
-- (w_in.redirect = '0' and d_in.redirect = '0' and r_int.tlbstall = '0')
|
|
|
|
-- implies v.nia = r_int.next_nia.
|
|
|
|
-- r_int.rd_is_niap4 implies r_int.next_nia is the address used to read the BTC.
|
|
|
|
if v.req = '1' and w_in.redirect = '0' and d_in.redirect = '0' and r_int.tlbstall = '0' and
|
|
|
|
btc_rd_valid = '1' and r_int.rd_is_niap4 = '1' and
|
|
|
|
btc_rd_data(BTC_WIDTH - 2) = r.virt_mode and
|
|
|
|
btc_rd_data(BTC_WIDTH - 3 downto BTC_TARGET_BITS)
|
|
|
|
= r_int.next_nia(BTC_TAG_BITS + BTC_ADDR_BITS + 1 downto BTC_ADDR_BITS + 2) then
|
|
|
|
v.predicted := btc_rd_data(BTC_WIDTH - 1);
|
|
|
|
v.pred_ntaken := not btc_rd_data(BTC_WIDTH - 1);
|
|
|
|
if btc_rd_data(BTC_WIDTH - 1) = '1' then
|
|
|
|
v_int.next_nia := btc_rd_data(BTC_TARGET_BITS - 1 downto 0) & "00";
|
|
|
|
v_int.rd_is_niap4 := '0';
|
fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
|
|
|
end if;
|
|
|
|
end if;
|
|
|
|
|
|
|
|
r_next <= v;
|
|
|
|
r_next_int <= v_int;
|
|
|
|
|
|
|
|
-- Update outputs to the icache
|
|
|
|
i_out <= r;
|
|
|
|
i_out.next_nia <= next_nia;
|
|
|
|
i_out.next_rpn <= v.rpn;
|
|
|
|
|
|
|
|
end process;
|
|
|
|
|
|
|
|
end architecture behaviour;
|