vtable layout

Tue Aug 31 18:10:46 UTC 1999

> Christophe:
>
> >In terms of performance, the impact is limited, because it will
> >occur only if you use an A* to call f() or g(). With a B*, a C* or a 
> >D*, the pair (vtable, offset) is unique. The same offset can be
> >reused for f() and g() and mean, in one case, "convert_to_X", in the 
> >other case, "convert_to_Y". Same thing for non-virtual inheritance. 
> >Last, the thunk generated in that case is no worse than the thunk 
> >that would be generated otherwise: we win in other cases, and don't 
> >lose in this one.
>
> I still don't fully understand.  What cases of virtual inheritance 
> will not require a thunk?  Is it when the virtual base appears only 
> once in the hierarchy (and so might as well not have been virtual)? 
> That is the only time when you can maintain the spatial relationship 
> between derived and virtual base vtables.

You always require the thunk _generation_. All I am saying is that  
as long as you use any of the non-virtual bases vtables, I think that  
you don't need to go through the thunk. In other words, you pay the  
thunk run-time penalty only when you call the virtual function  
through the virtual base's vtable (you always pay the space penalty).

> And, as Jason points out, you are using the worst kind of thunk, it 
> probably isn't even on the same page as any of the other code  
never mind
> the same cache line.

In the diamond case where the actually called functions are on each  
side of the diamond, you can't in general generate an adjustment  
thunk that is close to the target, whichever method you chose. But  
maybe I missed something in your discussion. Did you find a trick I  
did not understand?

> I interpret some of Christophe's
> earlier contributions to suggest that we are likely to have just
> suffered a mispredicted indirect branch, and in words stolen from
> Gulliver's Travels, which seems to have something to say about
> almost any situation, we may find the pipeline "lank as a bladder". 

This is indeed the appropriate interpretation :-) Some simulations  
we ran here showed 100% misprediction on the first branch in rather  
common cases.

Regarding whether the second branch would be correctly predicted or  
not... The documentation I have is quite difficult to decipher, so  
I'm not too sure. My impression is that at least on one  
implementation, the branch would predict correctly and not cause an  
additional penalty.

>
> Let's see how well I can summarize this for nonvirtual inheritance: 
>
> On the one hand we have Christophe's reach-back entry point which, 
> because of RAW dependencies, is intrinsically 3 cycles and may suffer 
> an extra D-cache miss, but which can always fall through.

Also note Jim's idea of predicating the adjustment, using the low  
bit of the function pointer. This would mean that the adjustment  
would probably cost much less than 3 cycles, with an extra cost at  
call site that we did not analyze yet.

> On the other hand we have the thunks we have been discussing, which 
> are one cycle but only one of them can fall through.  Others will have 
> a taken branch penalty which may or may not affect throughput.
>
> It looks to me that our performance is better in the fall-through case 
> and, as long as the penalty is 2 cycles or less, at least as good
> in the other cases, and we don't risk the extra D-cache miss, and  
we have
> avoided growing the vtables in a way that has a worst-case 2X  
expansion.

I think this is a slightly incomplete analysis. Unfortunately, it is  
unclear to me what exactly I can disclose here. Intel?

1/ Misprediction penalty

All I can say is that the hypothesis that the penalty is 2 cycles or  
less is way too optimistic (by at least a factor of an odd prime  
number, and even more on the first implementation. What? No, I did  
not say it!). But, as I said earlier, I don't think that's the major  
factor.

2/ I-cache

You considered a D-cache miss in my proposal. Fair enough. Just note  
that the memory access is in the vtable, which is frequently  
accessed. A D-cache miss is "unlikely", a page fault virtually ( :-)  
impossible. The same line will probably be reused at the next virtual  
call to a function of the same class.

On the other hand, an I-cache miss with a thunk model is very  
likely. The thunk is used for a single (class, member) combination  
(as opposed to the offset that depends on the class alone). What is  
close are probably thunks for the same member and different type,  
which would be reused only if I called the same member function with  
a different dynamic type.

Last comment on the subject: you _really_ don't want a cache miss,  
and the I-cache is _really_ small.

3/ Locality

In my proposal, the secondary entry point immediately precedes the  
function. Page faults, cache load and prefetching all benefit from  
this locality. Locality also exists for the data accesses, which are  
close to a location immediately accessed (the vtable).

For thunks, this can only be guaranteed to some extent at the  
page-fault level. A cache line is probably too small for a cache load  
at the thunk address to also load any code for the function.

4/ Memory usage

The memory usage is different. I think for small number of thunks,  
my proposal is worse (since it uses 48 bytes for the secondary entry  
point). On the other hand, as the number of adjustments grow, it gets  
better, since it uses 4 bytes per adjustment rather than 16.

5/ Summarizing the cost

Zeroing out what is common (the indirect branch and the possible  
I-cache miss on the target code), the penalties are something like:

- P1 * A + P2 * B + C for my proposal.

- P3 * A + P4 * B + P5 * D + E for thunks

The architectural variables A, B, C, D and E all depend on the IA64  
implementation being considered:

- A, B are the penalties for L0 cache miss and L1 cache miss

- C is the cost of the adjustment in my proposal. Currently, it is 3  
cycles, but could be less with Jim's idea.

- D is the cost of a resteer immediately following a mispredicted branch.

- E is the cost of a correctly predicted taken branch.

The P1, P2, P3, P4 and P5 are probabilities of something happening  
while executing C++ code:

- P1, P2 are the probabilities that a L0 or L1 data cache miss  
occurs in my proposal (either at the time the load is made, or later,  
because of additional cache pressure)

- P3, P4 are the probabilities that an L0 or L1  I-cache miss occurs  
for the thunk (or later, as above)

- P5 is the probability that a non-zero resteer occurs following the  
mispredicted branch. (It may be zero, I'm not sure).

I know for a fact that A and B are much larger than C, D and E. I  
also assume that P3 > P1 and P4 > P2, given both the cache locality  
and memory size. Therefore, I believe that the dominant factor is (P3  
- P1) * A + (P4 - P2) * B. Given the values for A, B, C, D and E on  
current implementations, this factor remains dominant as long as  
(P3-P1) or (P4-P2) is roughtly above 15 to 20% (for a C of 3). I  
believe that's the case, but this remains to be proven by a  
simulation.

Best regards
Christophe