Do we need to reopen B1?

Wed Mar 1 19:57:10 UTC 2000

Jim Dehnert wrote:
> 
> There are a number of things about the analysis here that are bothering
> me.  Please bear with me while I describe them, and let me know where
> I'm mixed up :).
> 

> >       for (i = 0; i < max; i++)
> >       {
> >               Shape *shape = shapes[i];
> >               shape->SetColor(red);
> >               shape->Scale(3.0);
> >               shape->Rotate(2.5);
> >               shape->Draw();
> >       }
> >

> First big problem.  In the case above (which I think of as the typical
> one -- certainly the one relevant to the cost of branch mispredicts),
> what is in common is the class containing the overridden function, i.e.
> the source class Shape for the this pointer.  What varies is the class
> where the final overrider resides, i.e. the target class of the this
> conversion.  If the target class is the same, there will only be I$
> misses the first time; if it varies, we'll get one at least the first
> time for each target class.  That will be true whichever implementation
> we use, since they all use different adjusting entries for different
> overriders.
> 
> Furthermore, there's a potential for a D$ miss each time the target
> class changes, since the vcall offset is coming from a different vtbl.
> That doesn't occur for the "AddAddAdd" implementation, though, which
> doesn't load anything from the vtbl.

AddAddAdd is always the best. No argument. If you can use AddAddAdd, use it.

If everything is in the cache, which is also a common occurence (code is small
enought, or number of targets is small enough), then you are in the 'Best'
column, and this case is 5 for thunks, 3 for adjusting entry point, so the
adjusting entry point still wins. Actually, if you are really really lucky and
hammer on that code a lot, you may even get 3 for the thunk solution (the
document I have strongly implies this is quite unlikely after an indirect
branch)

The interesting case is obviously when the final overriding class changes. Say
that we have 2 possible final overriders, A and B, and say that for the
particular pointer instance, we have A::SetColor, A::Scale and then B::Rotate,
B::Draw.

On the thunks approach we have:

1/ Shape->A thunk for A::SetColor: One I-cache miss for the thunk (never seen
before), one I-cache miss for A::SetColor (the thunk is not close enough)

2/ Shape->A thunk for A::Scale: Same thing. There is no reason that any of the
I-misses above gave us anything

3/ Shape->B thunk for B::Rotate: Idem

4/ Shape->B thunk for B::Draw: Item.

So we have 2 I-cache misses each time. If on the next iteration the dynamic type
are C and D, we will get the same misses again. In other words, we can expect 8
I-cache misses per iteration in the worst case.

On the adjustment approach, we have:

1/ Shape->A vcall offset D-cache miss, read from the Shape vtable. One I-cache
miss only, since we fall through.

2/ Shape->A vcall offset again, read from the A::Scale adjusting entry point
through the same Shape vtable. This one, I believe, is in the D-cache. We have
one I-cache miss.

3/ Shape->B vcall offset, relative to the Shape vtable, read before entering
B::Rotate: This one has no strong guarantees. However, we can stuff something
like 4 or 8 adjustments per cache line, so the chances that reading the Shape->A
vcall offset would also have brought in the Shape->B vcall offset are non zero.
We have one I-cache miss.

4/ Shape->B vcall offset, relative to the Shape vtable, read before entering
B::Draw. No D-cache miss, likely I-cache miss.

Here, we would expect 4 I-cache misses and 2 D-cache misses. Please feel free to
re-run this scenario with A, B, C, D (no benefit of one vs. the other), A, A, A,
A (only one D-cache miss), etc.

What I was trying to say is: I believe adjustment it is better than thunks in
the 'best' case, and when it degrades, it would probably degrade better on
typical code I can think of.

> 
> I think the "3 units of time" is wrong.  An implementation using a
> vcall offset has only a this pointer to work with.  It must load the
> vptr (2 cycles at least on most modern implementations I know of, with
> a cache hit, which it will always get because the caller just used it),
> add a displacement to the vcall offset (1 cycle), load the vcall offset
> (2 cycles plus possible D$ miss), and add the vcall offset (1 cycle).
> So I think we get at least 6 cycles for such an implementation.  If I
> assume that the usual override chain is short and non-virtual (and
> ultimately I think I may be able to use profiling to make it usually
> short), at 1 cycle per chain element the "AddAddAdd" method will be
> hard to beat.

First, the sequence you are proposing is not what I described. Loading the vptr
is unnecessary, because the correct vptr has just been loaded. As I indicated in
the initial description, this assumes we define at the ABI level which register
we load the vptr into (just as we do for the 'this' pointer.)

Second: Most of the timing information is, I believe, not disclosed by Intel for
Merced, and certainly not for McKinley. I stick to my relative values, but I
never said they were cycles, and I did say that I had substracted some constant
costs that were assumed identical in all hypothesis (even if these costs are
cache miss costs which vary from execution to execution but would be identical
between two models, such as an initial I-cache miss at the begining of the
code).

Hint: an indirect branch is more likely to have a cost impact on a subsequent
branch than on a subsequent load. If you substract the cost that is common
between the two, you may end up with 1 and 0, even though neither executes in 0
cycles.

> Next, let's think about "no D$ miss."  In order to have a single
> adjusting entry point, it must be able to always find the vcall offsets
> for a given target class/overrider (NOT for a given source class) at
> the same position relative to the vptr for the source class.  That
> means that all of the base classes of a given class with overriders
> must have the same layout of vcall offsets.

That's correct.

>  Aside from the
> implications of achieving this (which makes my head hurt),

You cannot achieve in general that without holes (unused entries). Holes appear
in particular if you have:

struct Left: B1, B2, B3, B4, ... B99 {}
struct Right: C1 {}
struct Derived: Primary, Left, Right {}

In that case, with a dumb algorithm, there will be 98 unused vcall entries in
the vtable for Right. In practice, I don't think this is a real problem
(compared to the overall cost of the vtable in the above case, assuming we
duplicate function descriptors, etc)

> >
> > As a reminder, the numbers I gave were  the following (I added memory usage):
> >
> >                         Best    Typical Worst I-mem   D-mem
> >         Thunks          5       9       20    16*F*C  0
> >         AddAddAdd       0+N/M   0+N/M   9+N/M 16*F*C  0
> >         Adjust          3       3       18    48*F    8*C+Pad
> >         Adjust/D-miss   8       8       21    48*F    8*C+Pad
> >
> > C: number of secondary bases requiring adjustments. F: Number of virtual
> > functions overriden in current class.
> 
> The I-mem case (16*F*C) isn't quite right.  Instead of F*C entries, you
> need what I'll call F', meaning that you count all of the overridden
> instances of f's, leaving out those with the same adjustment for the
> same f.  The point is that just because C requires an adjustment for
> one f, it might not require one for all the others (e.g. because some
> of them don't define all of the overridden functions).

That's a more precise count. It doesn't change the fact that 16 > 8 and F' is in
all likelyhood > 1 (at least for the pathological case we are dealing with.)

> I suspect that they
> don't matter much.  It looks to me like "AddAddAdd" is so much better
> that the alternative isn't worth the complications it introduces.

I don't plan to replace AddAddAdd. I plan to replace thunks as the "backup"
solution when AddAddAdd become too costly.

What I'm suggesting is:

1/ if you have less than (say) 5 different displacements and no multiple
inheritance, use "AddAddAdd" (not required by the ABI).

2/ When you have more than 5 different displacements and no virtual base, add a
vcall entry in each secondary vtable which requires a displacement (required by
the ABI) and emit and adjusting entry point rather than emitting thunks (not
required by the ABI). We could add a rule saying that we do that only if this
does not involve unreasonable padding.

3/ In all other cases (in particular in the presence of virtual bases), emit
thunks

Regards
Christophe