Prologue for large code model #22

<Uli>

I've now run into the first actual customer report of a problem due to a .toc section being more than 2 GB away from its associated .text section, so that the code to set up r2 in the global entry point prolog fails due to overflow. (This shows up in the context of the LLVM JIT, which has the problem that its allocates all sections independently of each other, without even trying to keep .toc close to .text. But even if that were improved, the customer still has a huge code base that might get close to 2 GB code size anyway. This is the particle detection software running at ATLAS/CMS at the LHC at CERN.)

Back when we designed the ABI, we accepted the 2 GB limitation for the small and medium code models, but planned on lifting the restriction for the large code model at some point; this was never actually implemented, however. I've now started to look into this missing piece, and there are a couple of question that are coming up.

First of all, what is the optimal code sequence to do that? I see two options:

A.) Load a 64-bit TOC offset from memory (relative to r12):

    .quad .TOC.-func

func:
ld r2, -8(r12)
add r2, r2, r12
.localentry func, .-func

The advantage of A.) is the small size: in fact the code size remains unchanged (2 instructions), we just need an additional 8 bytes of data per global entry point. The disadvantage may be that the data quadword is intermixed with code so that this cache line will end up in both L1 data and L1 instruction caches. Also, the "ld" may incur a stall waiting on the data moving into the L1-D cache. [ This might be improved by not hardcoding the -8 offset but allowing grouping of those quadwords for multiple functions into a single cache line. However, it might get a bit tricky to ensure that the quadword remains in range for the ld ... ]

<Alan>

Sequence (A) is best, I think. (B) has too many dependent insns.

For linker optimisation, please put a new marker reloc on the first insn, R_PPC64_ENTRY perhaps, to let the linker know it should possibly do some editing. Otherwise the linker would need a special pass to inspect code at all function symbols, since sequence (A) doesn't have relocations on the entry point code. With a marker reloc the code editing become easy to implement, and reliable.

<Mike>

I would think the way to go in most cases is to generate the small prologue and have the linker rewrite it to the big prologue.

Presumably we can't just rewrite existing prologues because we don't know that there is a DW available to hold the offset in front of the function, do we? However, we need to have the ability to rewrite medium code model code if we want to link libraries and/or objects generated with medium code model into a module that requires large code model?

To know that we do, presumably we have to indicate that a word is there... Either by putting a relocation on a word that can serve as a placeholder for the 64b offset, to know it's there, and we;re not overwriting the blr of the previous function, or by putting some reserved label on it that serves as "convention". -- Maybe that's the relocation that Alan was referring to?

Finally, when we allocate such a word, the question might be whether it should really be more than a word, e.g., a cache line, giving a linker the ability to pack multiple words into one line

\<Uli\> I've now run into the first actual customer report of a problem due to a .toc section being more than 2 GB away from its associated .text section, so that the code to set up r2 in the global entry point prolog fails due to overflow. (This shows up in the context of the LLVM JIT, which has the problem that its allocates all sections independently of each other, without even trying to keep .toc close to .text. But even if that were improved, the customer still has a huge code base that might get close to 2 GB code size anyway. This is the particle detection software running at ATLAS/CMS at the LHC at CERN.) Back when we designed the ABI, we accepted the 2 GB limitation for the small and medium code models, but planned on lifting the restriction for the large code model at some point; this was never actually implemented, however. I've now started to look into this missing piece, and there are a couple of question that are coming up. First of all, what is the optimal code sequence to do that? I see two options: A.) Load a 64-bit TOC offset from memory (relative to r12): .quad .TOC.-func func: ld r2, -8(r12) add r2, r2, r12 .localentry func, .-func The advantage of A.) is the small size: in fact the code size remains unchanged (2 instructions), we just need an additional 8 bytes of data per global entry point. The disadvantage may be that the data quadword is intermixed with code so that this cache line will end up in both L1 data and L1 instruction caches. Also, the "ld" may incur a stall waiting on the data moving into the L1-D cache. [ This might be improved by not hardcoding the -8 offset but allowing grouping of those quadwords for multiple functions into a single cache line. However, it might get a bit tricky to ensure that the quadword remains in range for the ld ... ] \<Alan\> Sequence (A) is best, I think. (B) has too many dependent insns. For linker optimisation, please put a new marker reloc on the first insn, R_PPC64_ENTRY perhaps, to let the linker know it should possibly do some editing. Otherwise the linker would need a special pass to inspect code at all function symbols, since sequence (A) doesn't have relocations on the entry point code. With a marker reloc the code editing become easy to implement, and reliable. \<Mike\> I would think the way to go in most cases is to generate the small prologue and have the linker rewrite it to the big prologue. Presumably we can't just rewrite existing prologues because we don't know that there is a DW available to hold the offset in front of the function, do we? However, we need to have the ability to rewrite medium code model code if we want to link libraries and/or objects generated with medium code model into a module that requires large code model? To know that we do, presumably we have to indicate that a word is there... Either by putting a relocation on a word that can serve as a placeholder for the 64b offset, to know it's there, and we;re not overwriting the blr of the previous function, or by putting some reserved label on it that serves as "convention". -- Maybe that's the relocation that Alan was referring to? Finally, when we allocate such a word, the question might be whether it should really be more than a word, e.g., a cache line, giving a linker the ability to pack multiple words into one line

Previous comments on this issue in RTC:

Ian McIntosh Jan 19, 2016 1:02 PM
What is sequence (B) ?
Michael Gschwind Jan 19, 2016 1:49 PM
Sequence B. You don't want to know :-)
ori, shift, ori shift, ori shift... etc etc, ad nauseam.
1. Ian McIntosh Jan 19, 2016 3:15 PM
  That was my guess, but I thought I'd check. Presumably the 4 ori s (or 1 oris and 3 ori s, or 2 oris s and 2 ori s and 1 rldimi?) would have new RLDs, each specific to the part of the displacement that instruction was providing, with linker changes to handle those. I prefer (A).

Previous comments on this issue in RTC: 1. Ian McIntosh Jan 19, 2016 1:02 PM What is sequence (B) ? 2. Michael Gschwind Jan 19, 2016 1:49 PM Sequence B. You don't want to know :-) ori, shift, ori shift, ori shift... etc etc, ad nauseam. 1. Ian McIntosh Jan 19, 2016 3:15 PM That was my guess, but I thought I'd check. Presumably the 4 ori s (or 1 oris and 3 ori s, or 2 oris s and 2 ori s and 1 rldimi?) would have new RLDs, each specific to the part of the displacement that instruction was providing, with linker changes to handle those. I prefer (A).

Labels Milestones

Prologue for large code model #22