vtable layout
Jason Merrill
jason at cygnus.com
Tue Aug 31 19:34:46 UTC 1999
>>>>> Christophe de Dinechin <ddd at cup.hp.com> writes:
> You always require the thunk _generation_. All I am saying is that
> as long as you use any of the non-virtual bases vtables, I think that
> you don't need to go through the thunk. In other words, you pay the
> thunk run-time penalty only when you call the virtual function
> through the virtual base's vtable (you always pay the space penalty).
Hmm? Where do you do the adjustment from a non-virtual base, if not in the
thunk?
> In the diamond case where the actually called functions are on each
> side of the diamond, you can't in general generate an adjustment
> thunk that is close to the target, whichever method you chose. But
> maybe I missed something in your discussion. Did you find a trick I
> did not understand?
Brian's proposal was to allocate, in a virtual base's vtable, base
adjustment slots for all the virtual functions provided by that base. This
prevents us from ever having to generate a third-party thunk, at the cost
of doubling the size of virtual base secondary vtables. This is clearly a
tradeoff.
> Also note Jim's idea of predicating the adjustment, using the low
> bit of the function pointer. This would mean that the adjustment
> would probably cost much less than 3 cycles, with an extra cost at
> call site that we did not analyze yet.
I still don't see how call-site adjustment can work under this model; we
don't know how to find the adjustment at the call site.
> 1/ Misprediction penalty
> All I can say is that the hypothesis that the penalty is 2 cycles or
> less is way too optimistic (by at least a factor of an odd prime
> number, and even more on the first implementation. What? No, I did
> not say it!). But, as I said earlier, I don't think that's the major
> factor.
Is there a term for the case when the branch predictor correctly predicts a
branch but the pipeline stalls because the prefetcher assumed no branch?
That's what I read from Brian's message. I would expect the penalty for
that to be lower, especially since the pipeline hasn't had a chance to fill
after the indirect branch. But I'll admit I don't know much about these
issues.
> Regarding whether the second branch would be correctly predicted or
> not... The documentation I have is quite difficult to decipher, so
> I'm not too sure. My impression is that at least on one
> implementation, the branch would predict correctly and not cause an
> additional penalty.
What would be the excuse for mispredicting an unconditional forward
pc-relative branch?
> 2/ I-cache
> You considered a D-cache miss in my proposal. Fair enough. Just note
> that the memory access is in the vtable, which is frequently
> accessed. A D-cache miss is "unlikely", a page fault virtually ( :-)
> impossible. The same line will probably be reused at the next virtual
> call to a function of the same class.
Will it matter that the offset will be located before the function pointer
in the vtable? In other words, does a load cause the cache to load data
from both sides of the requested address or does it only load forward?
> On the other hand, an I-cache miss with a thunk model is very
> likely. The thunk is used for a single (class, member) combination
> (as opposed to the offset that depends on the class alone). What is
> close are probably thunks for the same member and different type,
> which would be reused only if I called the same member function with
> a different dynamic type.
> Last comment on the subject: you _really_ don't want a cache miss,
> and the I-cache is _really_ small.
I assume you are referring here to an I-cache miss on the branch to the
main function? If so, that's a question of...
> 3/ Locality
> In my proposal, the secondary entry point immediately precedes the
> function. Page faults, cache load and prefetching all benefit from
> this locality. Locality also exists for the data accesses, which are
> close to a location immediately accessed (the vtable).
> For thunks, this can only be guaranteed to some extent at the
> page-fault level. A cache line is probably too small for a cache load
> at the thunk address to also load any code for the function.
A typical thunk consists of
add 4 to %rthis
branch to function
16 bytes, as you say. Are cache lines really so small that several of
these won't fit?
> 4/ Memory usage
> The memory usage is different. I think for small number of thunks,
> my proposal is worse (since it uses 48 bytes for the secondary entry
> point). On the other hand, as the number of adjustments grow, it gets
> better, since it uses 4 bytes per adjustment rather than 16.
I assume you mean 8 bytes (64 bits). And it uses more than that; in cases
where we use extra thunks, you have to pad out the vtable so that the
offsets line up.
> 5/ Summarizing the cost
> Zeroing out what is common (the indirect branch and the possible
> I-cache miss on the target code), the penalties are something like:
> - P1 * A + P2 * B + C for my proposal.
> - P3 * A + P4 * B + P5 * D + E for thunks
For thunks after the first, that is. The first one will have no penalty.
> - P1, P2 are the probabilities that a L0 or L1 data cache miss
> occurs in my proposal (either at the time the load is made, or later,
> because of additional cache pressure)
> - P3, P4 are the probabilities that an L0 or L1 I-cache miss occurs
> for the thunk (or later, as above)
> I know for a fact that A and B are much larger than C, D and E. I
> also assume that P3 > P1 and P4 > P2, given both the cache locality
> and memory size.
I'm not so sure about that. Cache locality in your proposal depends on the
size of the D vtable (and any others between it and the vptr we're using),
and whether the cache loads backwards. With thunks, cache locality depends
on the number of thunks generated; in other words, the number of times the
same function appears in distinct non-virtual bases.
Jason
More information about the cxx-abi-dev
mailing list