vtable layout

Tue Aug 31 19:34:46 UTC 1999

>>>>> Christophe de Dinechin <ddd at cup.hp.com> writes:

 > You always require the thunk _generation_. All I am saying is that  
 > as long as you use any of the non-virtual bases vtables, I think that  
 > you don't need to go through the thunk. In other words, you pay the  
 > thunk run-time penalty only when you call the virtual function  
 > through the virtual base's vtable (you always pay the space penalty).

Hmm?  Where do you do the adjustment from a non-virtual base, if not in the
thunk?

 > In the diamond case where the actually called functions are on each  
 > side of the diamond, you can't in general generate an adjustment  
 > thunk that is close to the target, whichever method you chose. But  
 > maybe I missed something in your discussion. Did you find a trick I  
 > did not understand?

Brian's proposal was to allocate, in a virtual base's vtable, base
adjustment slots for all the virtual functions provided by that base.  This
prevents us from ever having to generate a third-party thunk, at the cost
of doubling the size of virtual base secondary vtables.  This is clearly a
tradeoff.

 > Also note Jim's idea of predicating the adjustment, using the low  
 > bit of the function pointer. This would mean that the adjustment  
 > would probably cost much less than 3 cycles, with an extra cost at  
 > call site that we did not analyze yet.

I still don't see how call-site adjustment can work under this model; we
don't know how to find the adjustment at the call site.

 > 1/ Misprediction penalty

 > All I can say is that the hypothesis that the penalty is 2 cycles or  
 > less is way too optimistic (by at least a factor of an odd prime  
 > number, and even more on the first implementation. What? No, I did  
 > not say it!). But, as I said earlier, I don't think that's the major  
 > factor.

Is there a term for the case when the branch predictor correctly predicts a
branch but the pipeline stalls because the prefetcher assumed no branch?
That's what I read from Brian's message.  I would expect the penalty for
that to be lower, especially since the pipeline hasn't had a chance to fill
after the indirect branch.  But I'll admit I don't know much about these
issues.

 > Regarding whether the second branch would be correctly predicted or  
 > not... The documentation I have is quite difficult to decipher, so  
 > I'm not too sure. My impression is that at least on one  
 > implementation, the branch would predict correctly and not cause an  
 > additional penalty.

What would be the excuse for mispredicting an unconditional forward
pc-relative branch?

 > 2/ I-cache

 > You considered a D-cache miss in my proposal. Fair enough. Just note  
 > that the memory access is in the vtable, which is frequently  
 > accessed. A D-cache miss is "unlikely", a page fault virtually ( :-)  
 > impossible. The same line will probably be reused at the next virtual  
 > call to a function of the same class.

Will it matter that the offset will be located before the function pointer
in the vtable?  In other words, does a load cause the cache to load data
from both sides of the requested address or does it only load forward?

 > On the other hand, an I-cache miss with a thunk model is very  
 > likely. The thunk is used for a single (class, member) combination  
 > (as opposed to the offset that depends on the class alone). What is  
 > close are probably thunks for the same member and different type,  
 > which would be reused only if I called the same member function with  
 > a different dynamic type.

 > Last comment on the subject: you _really_ don't want a cache miss,  
 > and the I-cache is _really_ small.

I assume you are referring here to an I-cache miss on the branch to the
main function?  If so, that's a question of...

 > 3/ Locality

 > In my proposal, the secondary entry point immediately precedes the  
 > function. Page faults, cache load and prefetching all benefit from  
 > this locality. Locality also exists for the data accesses, which are  
 > close to a location immediately accessed (the vtable).

 > For thunks, this can only be guaranteed to some extent at the  
 > page-fault level. A cache line is probably too small for a cache load  
 > at the thunk address to also load any code for the function.

A typical thunk consists of

  add 4 to %rthis
  branch to function

16 bytes, as you say.  Are cache lines really so small that several of
these won't fit?

 > 4/ Memory usage

 > The memory usage is different. I think for small number of thunks,  
 > my proposal is worse (since it uses 48 bytes for the secondary entry  
 > point). On the other hand, as the number of adjustments grow, it gets  
 > better, since it uses 4 bytes per adjustment rather than 16.

I assume you mean 8 bytes (64 bits).  And it uses more than that; in cases
where we use extra thunks, you have to pad out the vtable so that the
offsets line up.

 > 5/ Summarizing the cost

 > Zeroing out what is common (the indirect branch and the possible  
 > I-cache miss on the target code), the penalties are something like:

 > - P1 * A + P2 * B + C for my proposal.

 > - P3 * A + P4 * B + P5 * D + E for thunks

For thunks after the first, that is.  The first one will have no penalty.

 > - P1, P2 are the probabilities that a L0 or L1 data cache miss  
 > occurs in my proposal (either at the time the load is made, or later,  
 > because of additional cache pressure)

 > - P3, P4 are the probabilities that an L0 or L1  I-cache miss occurs  
 > for the thunk (or later, as above)

 > I know for a fact that A and B are much larger than C, D and E. I  
 > also assume that P3 > P1 and P4 > P2, given both the cache locality  
 > and memory size.

I'm not so sure about that.  Cache locality in your proposal depends on the
size of the D vtable (and any others between it and the vptr we're using),
and whether the cache loads backwards.  With thunks, cache locality depends
on the number of thunks generated; in other words, the number of times the
same function appears in distinct non-virtual bases.

Jason