Do we need to reopen B1?
Jim Dehnert
dehnert at baalbek.engr.sgi.com
Wed Mar 1 09:22:24 UTC 2000
There are a number of things about the analysis here that are bothering
me. Please bear with me while I describe them, and let me know where
I'm mixed up :).
> From: Christophe de Dinechin <ddd at cup.hp.com>
>
> > If so, I agree that this is a feasible design. I'm still unconvinced that
> > it's a big win; since you're allocating the convert_to_foo slots at the
> > other end of the vtable from the function pointer, I don't see how you can
> > expect d-cache locality.
>
> I do not expect perfect D-cache locality (which is why I thought useful to
> specify in a previous e-mail the cost of a D-cache miss). I'm still looking for
> a better way to allocate it. Getting a better cache locality was the reason I
> was considering duplicating that (as you do for vcall offsets) and putting it
> next to the vtable entry.
>
> On the other hand, a frequent case where you do get cache locality is if you
> call different virtual functions for the same pointer in succession, and if the
> functions are overriden in the same class. Say:
>
> for (i = 0; i < max; i++)
> {
> Shape *shape = shapes[i];
> shape->SetColor(red);
> shape->Scale(3.0);
> shape->Rotate(2.5);
> shape->Draw();
> }
>
> This code seems fairly reasonable. In that case, you can expect to pay a cache
> miss only for the first call. Note that this does not apply if g is called
> through f, since in that case we use the non-adjusting entry-point anyway. On
> the other hand, on that same code, there a high probability of I-cache miss for
> each function, and of double I-cache miss for thunks. So thunks go to their
> worst scenario, whereas adjusting entry points remain in their 'typical, no
> D-cache miss' scenario. Here, we get 9 vs. 3 "abstract units of time".
First big problem. In the case above (which I think of as the typical
one -- certainly the one relevant to the cost of branch mispredicts),
what is in common is the class containing the overridden function, i.e.
the source class Shape for the this pointer. What varies is the class
where the final overrider resides, i.e. the target class of the this
conversion. If the target class is the same, there will only be I$
misses the first time; if it varies, we'll get one at least the first
time for each target class. That will be true whichever implementation
we use, since they all use different adjusting entries for different
overriders.
Furthermore, there's a potential for a D$ miss each time the target
class changes, since the vcall offset is coming from a different vtbl.
That doesn't occur for the "AddAddAdd" implementation, though, which
doesn't load anything from the vtbl.
I think the "3 units of time" is wrong. An implementation using a
vcall offset has only a this pointer to work with. It must load the
vptr (2 cycles at least on most modern implementations I know of, with
a cache hit, which it will always get because the caller just used it),
add a displacement to the vcall offset (1 cycle), load the vcall offset
(2 cycles plus possible D$ miss), and add the vcall offset (1 cycle).
So I think we get at least 6 cycles for such an implementation. If I
assume that the usual override chain is short and non-virtual (and
ultimately I think I may be able to use profiling to make it usually
short), at 1 cycle per chain element the "AddAddAdd" method will be
hard to beat.
Next, let's think about "no D$ miss." In order to have a single
adjusting entry point, it must be able to always find the vcall offsets
for a given target class/overrider (NOT for a given source class) at
the same position relative to the vptr for the source class. That
means that all of the base classes of a given class with overriders
must have the same layout of vcall offsets. Aside from the
implications of achieving this (which makes my head hurt), it means in
particular that they can't be with the function "pointers", which are
at different offsets in different bases. So I don't think "no D$ miss"
is actually a likely scenario.
> Summary:
>
> + The best case is clearly the 'AddAddAdd' for a limited number of offsets. No
> argument.
>
> + The second is adjusting entry point with no D-cache miss. This scenario is
> between 1 and 3 times as bad as the previous one, depending on the number of
> adds.
>
> + For third position, there is a tie speed-wise between the thunks and the
> adjusting entry point with D-cache miss, but then I believe the adjusting
> entry-point wins memory-wise except in pathological cases where padding would
> dominate. Speed-wise, this scenario is about 3 times as bad as scenario 2.
>
>
> As a reminder, the numbers I gave were the following (I added memory usage):
>
> Best Typical Worst I-mem D-mem
> Thunks 5 9 20 16*F*C 0
> AddAddAdd 0+N/M 0+N/M 9+N/M 16*F*C 0
> Adjust 3 3 18 48*F 8*C+Pad
> Adjust/D-miss 8 8 21 48*F 8*C+Pad
>
> C: number of secondary bases requiring adjustments. F: Number of virtual
> functions overriden in current class.
The I-mem case (16*F*C) isn't quite right. Instead of F*C entries, you
need what I'll call F', meaning that you count all of the overridden
instances of f's, leaving out those with the same adjustment for the
same f. The point is that just because C requires an adjustment for
one f, it might not require one for all the others (e.g. because some
of them don't define all of the overridden functions).
I don't fully understand some of the other entries, but beyond the
above comment about best case being at least 6, I suspect that they
don't matter much. It looks to me like "AddAddAdd" is so much better
that the alternative isn't worth the complications it introduces.
But of course I may be misinterpreting things badly -- please comment
if so.
To answer the question of the Subject: I haven't reopened issue B-1,
and think that before doing so we should see both a precise proposal
for how an alternate implementation would lay out the vcall offsets,
and a cleaned-up cost analysis based on the new proposal.
Jim
- Jim Dehnert dehnert at sgi.com
(650)933-4272
More information about the cxx-abi-dev
mailing list