Do we need to reopen B1?

Fri Feb 25 02:10:36 UTC 2000

Jason Merrill wrote:
> 
> >>>>> Christophe de Dinechin <ddd at cup.hp.com> writes:
> 
>  > In the absence of the "adjusting entry points" I proposed, you have to
>  > use thunks. Within a same class, of course, you can use the 'AddAddAdd'
>  > scheme, since you know the different offsets at compile time for the
>  > various subobjects that compose the class. However, a derived class
>  > cannot share this, because its vtable may very well be emitted in a
>  > different .o.
> 
> I strongly disagree.  This is the primary goal of my design.

Rereading through the e-mail and the writeup, I finally get it. But this was
totally obscured to me by the existing wording... So I guess Mark was not the
only one who got confused:

First:

	For each secondary vtable from a non-virtual base class 'B'
	which defines f, an additional entry point is generated which
	performs the constant adjustment from A* to B*. 

So, the intent is indeed that this is a real entry point, with a name and so on,
and that it actually converts from _ B* to A* _, correct? And this entry point
had a name that you can use to refer to it from other vtables, including in
other translation units, correct? So: we need to specify that name. Also, let's
no longer ever call them 'thunk', since to me a thunk is something that gets
emitted with the vtable, not with the target function.

I was also really sent off track by the note:

	Note that the ABI only specifies the multiple entry points;
	how those entry points are provided is unspecified. An existing
	compiler which uses thunks could be converted to use this ABI by
	only adding support for the vcall offsets. A more efficient
	implementation would be to emit all of the thunks immediately
	before the non-adjusting entry point to the function. Another
	might use predication rather than branches to reach the main
	function. Another might emit a new copy of the function for
	each entry point; this is a quality of implementation issue. 

Let me propose a few changes here:

	Note that the ABI only specifies the name and existence of
	multiple entry points. How those entry points are implemented
	is unspecified, as long as they are emitted with the corresponding
	virtual table. An existing compiler which uses thunks could be
	converted to use this ABI by only adding support for the vcall
	offsets (in the case of virtual inheritance), and by creating
	a named entry point and a thunk for each possible adjustment
	from a secondary base to the derived class (in the case of
	multiple, non-virtual inheritance).

	A more efficient implementation would be to emit all of the
	adjusting entry points immediately before the non-adjusting
	entry point to the function. Such thunks can also be combined
	in the form of multiple 'Add' instructions that fall through
	to the main entry point. Another implementation yet is to use
	predication rather than branches to reach the main function
	[Note: I'm not sure this actually works without a calling
	convention on predicates...]. Another might emit a new copy
	of the function for each entry point. Selecting one of these
	implementations is a quality of implementation issue.

Does this wording sound better?

> 
>  > For instance, consider:
> 
>  >      struct A { virtual a(); int aa; };
>  >      struct B { virtual b(); int bb; };
>  >      struct C { virtual c(); int cc; };
> 
>  >      struct D : A, B { virtual b(); int dd; }
>  >      struct E : D { /* Does not override b() */ }
> 
>  > The primary vtable for D points to D::b, no adjustment required. The
>  > secondary vtable for B in D points to D::b with, say, an 'Add'
>  > adjustment, which would be of -16 (from the B* to the D*).
> 
>  > When you emit the vtable for E, the good news is that the adjustment is
>  > the same (-16 from B* to D*). The bad news is that you don't know how to
>  > locate the thunk that adds -16 (unless we all agree on the algorithm for
>  > emitting this kind of "AddAddAdd" thunk, which we did not.)
> 
> Oh, come on.  Yes, we still need to define what entry point symbols will
> be emitted with D::b, and how they are mangled, so that derived classes can
> refer to them.  But we certainly can do that.

Of course, we can, which is why I said "unless we all agree [...], which we did
not"... To me, the initial wording just said: "Emit the thunk if you want".

[Egg on face: with that new understanding, what I wrote for shared libraries was
really n'importe quoi]

>  > So my proposal is simply that, for the non-virtual multiple inheritance
>  > case, we have a this-pointer adjustment offset in each secondary vtable,
>  > that adjusts form the secondary vtable class to the function's target
>  > class.
> 
>  > The algorithm I proposed to allocate these offsets was the following:
> 
>  > - While defining class C, we allocate a single offset "convert_to_C",
>  > that converts from the class of the secondary vtable in which it is
>  > stored to the C class.
> 
> We allocate this in each secondary vtable where we refer to a function
> overridden in C, correct?

Yes.

> 
> This changes the size of the secondary vtables, but we've agreed that
> doesn't matter.
> 
>  > - The convert_to_C in the C primary vtable is theoretically present, but
>  > its value is zero, so it may be omitted.
> 
>  > - For all other classes, the offset of 'convert_to_C' relative to the
>  > secondary vptr is constant. A naive algorithm to ensure that is that
>  > this offset is the first negative offset not used by a 'convert_to_X' in
>  > any of the bases. I acknoledge this generally may padding in the
>  > vtables, but I did not see a better way to do it, short of using a
>  > double indirection.
> 
> "in any of the bases where we want to allocate this value".  OK.

Correct.

> 
>  > This 'convert_to_C' is used by an adjusting entry point emitted
>  > immediatly prior to the main, non-adjusting entry point, which computes
>  > this += convert_to_C. If I use this secondary entry point at all, I know
>  > that the final overrider is C::f, and that the call was in the form
>  > bptr->f(), where bptr is a non-primary non-virtual base of C.
> 
>  > The same 'convert_to_C' value is shared by different virtual functions
>                            ^^^^^ you mean vptr offset?

Yes.

> 
>  > overriden in C, no matter from which base they are overriden (the
>  > interesting case, of course, being when they are overriden from multiple
>  > bases). The reason is that if C::f override B::f, the delta is the same
>  > than if C::g overrides B::g. It may be different than the delta between
>  > C::g and B2::g, but then I am using B2's secondary vtable, which
>  > contains the B2->C conversion.
> 
> OK.
> 
>  > This scheme has the following benefits:
> 
>  > - For each virtual function, you emit a secondary entry point with a
>  > known size.  The best possible size is 48 bytes, unless some magic
>  > predication thingie I did not think of can reduce that. No matter how
>  > many secondary vbases or how many derived classes reuse that function,
>  > the size is 48.
> 
> You don't need it at all if none of your bases need an adjustment.

Yes. But in that case, you don't need it in any of our schemes. I'm trying to
contrast this with the other thunk-generation models. 48 bytes is the cost of 3
of our adds, or of 3 "normal" thunks. So this scheme becomes beneficial
memory-wise for more than 3 bases with adjustment.

> 
>  > - For each secondary base, you add 8 (or more if padding) bytes to the
>  > table.
> 
> "each secondary base which refers to a funtion overridden in C".  Yes.
> 
>  > - This scheme works accross shared libraries, since the thunk used is
>  > always "tied" to the function.
> 
> This is also true for my design, as mentioned above.

Now I understand this.

> 
> As you say above, this only applies to non-virtual inheritance.  This
> scheme could be combined with the existing vcall offsets for virtual bases
> such that you would get one additional entry point per virtual base that
> refers to the function, or two if non-virtual bases of that virtual base
> also refer to the function.  Is this what you had in mind?

I believe the scheme _has to_ be combined with the vcall offsets. It doesn't
work for virtual inheritance. That's why I said earlier the two schemes were not
opposed. So, yes, that's definitely what I have in mind.

> 
> If so, I agree that this is a feasible design.  I'm still unconvinced that
> it's a big win; since you're allocating the convert_to_foo slots at the
> other end of the vtable from the function pointer, I don't see how you can
> expect d-cache locality.

I do not expect perfect D-cache locality (which is why I thought useful to
specify in a previous e-mail the cost of a D-cache miss). I'm still looking for
a better way to allocate it. Getting a better cache locality was the reason I
was considering duplicating that (as you do for vcall offsets) and putting it
next to the vtable entry.

On the other hand, a frequent case where you do get cache locality is if you
call different virtual functions for the same pointer in succession, and if the
functions are overriden in the same class. Say:

	for (i = 0; i < max; i++)
	{
		Shape *shape = shapes[i];
		shape->SetColor(red);
		shape->Scale(3.0);
		shape->Rotate(2.5);
		shape->Draw();
	}

This code seems fairly reasonable. In that case, you can expect to pay a cache
miss only for the first call. Note that this does not apply if g is called
through f, since in that case we use the non-adjusting entry-point anyway. On
the other hand, on that same code, there a high probability of I-cache miss for
each function, and of double I-cache miss for thunks. So thunks go to their
worst scenario, whereas adjusting entry points remain in their 'typical, no
D-cache miss' scenario. Here, we get 9 vs. 3 "abstract units of time".

I would not say that this is an incredible win. Yet, this is a win for some
common cases, it's not extraordinarily complicated, it was documented in my
initial proposal, and it got dropped, I believe, more by distraction than on
purpose (at least from my point of view.) We had found with 'AddAddAdd' an even
better way to deal with the most frequent case, but dropping the next case most
frequent case was not necessary.

Summary:

+ The best case is clearly the 'AddAddAdd' for a limited number of offsets. No
argument.

+ The second is adjusting entry point with no D-cache miss. This scenario is
between 1 and 3 times as bad as the previous one, depending on the number of
adds.

+ For third position, there is a tie speed-wise between the thunks and the
adjusting entry point with D-cache miss, but then I believe the adjusting
entry-point wins memory-wise except in pathological cases where padding would
dominate. Speed-wise, this scenario is about 3 times as bad as scenario 2.

As a reminder, the numbers I gave were  the following (I added memory usage):

                        Best    Typical Worst	I-mem	D-mem
        Thunks          5       9       20	16*F*C	0
        AddAddAdd       0+N/M   0+N/M   9+N/M	16*F*C	0
        Adjust          3       3       18	48*F	8*C+Pad
        Adjust/D-miss   8       8       21	48*F	8*C+Pad

C: number of secondary bases requiring adjustments. F: Number of virtual
functions overriden in current class.

Regards
Christophe