Virtual function calls

Thu Jun 24 00:31:27 UTC 1999

Folks,

Here is a summary of the virtual function call mechanism.

OPEN ISSUES THAT BELONG TO THIS DISCUSSION (Coming later)

1/ Keeping all of a class in a single load module. The vtable  
contains the target address and one copy of the target GP. This  
implies that it is not in text, and that it is generated by dld.

2/ Detailed layout of the virtual table

3/ How can we share class offsets?

1/ Scope and "State of the Art"

The following proposal applies only to calls to virtual functions  
when a this pointer adjustment is required from a base class to a  
derived class. Essentially, this means multiple inheritance, and the  
existence of two or more virtual table pointers (vptr) in the  
complete object. The multiple vptrs are required so that the layout  
of all bases is unchanged in the complete object. There will be one  
additional vptr for each base class which already required a vptr,  
but cannot be placed in the whole object so that it shares its vptr  
with the whole object. Note: when the vptr is shared, the base class  
is said to be the "primary base class", and there is only one such  
class.

For the primary base class, no pointer adjustment is needed. For all  
other bases, a pointer to the whole object is not a pointer to the  
base class, so whenever a pointer to the base class is needed,  
adjustment will occur.

In particular, when calling a virtual function, one does not know in  
advance in which class the function was actually defined. Depending  
on the actual class of the object pointed to, pointer adjustment may  
be needed or not, and the pointer adjustment value may vary from  
class to class. The existing solution is to have the vtable point not  
to the function itself, but to a "thunk" which does pointer  
adjustment when needed, and then jumps to the actual function.  
Another possibility is to have an offset in the vtable, which is used  
by the called function. However, more often than not, this implies  
adding zero.

Virtual bases make things slightly more complicated. In that case,  
the data layout is such that there is only one instance of the  
virtual base in the whole object. Therefore, the offset from a this  
pointer to a same virtual base may change along the inheritance tree.  
This is solved by placing an offset in the virtual table, which is  
used to adjust the this pointer to the virtual base.

2/ Proposal and Rationale

My proposal is to replace thunks with offsets, with two additional tricks:
- Give a virtual function two entry points, so as to bypass the  
adjustment when it's known to be zero
- Moving the adjustment at call-site, where it can be scheduled more  
easily, using a "reasonable" value, so that the adjustment is  
bypassed even more often.

The thunks are believed to cost more on IA64 than they would on  
other platforms. The reason is that they are small islands of code  
spread throughout the code, where you cannot guarantee any cache  
locality. Since they immediately follow an indirect branch, chances  
are we will always encounter both a branch misprediction and a  
I-cache miss in a row.

On the other hand, a virtual function call starts by reading the  
virtual function address. Reading the offset immediately thereafter  
should almost never cause a D-cache miss (cache locality should be  
good). More often than not, no adjustment is needed, or the  
adjustment will be done at call site correctly. In the worst case  
scenario, we perform two adjustments, one static at call site, and  
one dynamic in the callee, but this case should be really infrequent.

3/ New Calling Convention

The new calling convention requires that the 'this' pointer on entry  
points to the class for which the virtual function is just defined.  
That is, for A::f(), the pointer is an A* when the main entry of the  
function is reached. If the actual pointer is not an A*, then an  
adjusting entry point is used, which immediately precedes the  
function.

In the following, we will assume the following examples:
	struct A { virtual void f(); };
	struct B { virtual void g(); };
	struct C: A, B { }
	struct D : C { virtual void f(); virtual void g(); }
	struct E: Other, C { virtual void f(); virtual void g(); }
	struct F: D, E { virtual void f(); }
	void call_Cf(C *c) { c->f(); }
	void call_Cg(C *c) { c->g(); }
	void call_Df(D* d) { d->f(); }
	void call_Dg(D* d) { d->g(); }
	void call_Ef(E* e) { e->f(); }
	void call_Eg(E* e) { e->g(); }
	void call_Ff(F *ff) { ff->f(); }
	void call_Fg(F *ff) { ff->g(); }		// Invalid:  
ambiguous

a) Call site:
The caller performs adjustment to match the class of the last  
overrider of the given function.

+ call_Cf will assume that the pointer needs to be cast to an A*,  
since C::f is actually A::f. Since A is the primary base class, no  
adjustment is done at call site.

+ call_Cg is similar, but assumes that the actual type is a B*, and  
performs the adjustment, since B is not the primary base class.

+ call_Df and call_Dg will assume that the pointer needs to be cast  
to a D*, which is where D::f is defined. No adjustment is performed  
at call site.

b) Callee

+ A::f and B::g are defined in classes where there is a single vptr.  
They don't define a secondary entry point. Because of call-site  
conventions, they expect to always be called with the correct type.

+ D::f is defined in a class where there is more than one vptr, so  
it needs a secondary entry point and an entry 'convert_to_D' in the  
vtable. That's because it can be potentially called with either an A*  
or a B*. There are two vtables, one for A in D, one for B in D. The  
D::f entry in A in D points to the non-adjusting entry point, since A  
shares its vptr.

+ D::g requires a secondary entry point, that will read the same  
offset 'convert_to_D' from the vtable.

+ E also will require a 'convert_to_E' entry in the vtable, but this  
time, the vtable for A in C will have to point to an adjusting entry  
point, since A no longer shares the vptr with E (assuming Other has  
a vptr). This vtable is also the vtable of C in E.

c) Offsets in the vtable
Offsets have to be placed in the vtable at a position which does not  
conflict with any offset in the inheritance tree.

convert_to_D and convert_to_E are likely to be at the same offset in  
the vtable. This is not a problem, even if D and E are used in the  
same class, such as F, because this is the same offset in different  
vtables.

- call_Fg is invalid, because it is ambiguous.

- A notation such as ((E*) ff)->g() can be used to disambiguate, but  
in that case, we don't use the same vtable (either the E in F or D  
in F vtable). The E in F vtable uses that offset as 'convert_to_E',  
whereas the D in F vtable uses that offset as 'convert_to_D'.

- Similarly, call_Cf called with an F object will actually be called  
with the E in F or D in F, which disambiguates which C is actually  
used. The actual C* passed will have been adjusted by the caller  
unambiguously, or the call will be invalid.

- For functions overriden in F, an entry 'convert_to_F' is created  
anyway. This entry will not overlap with either convert_to_E or  
convert_to_D.

The fact that an offset is reserved does not mean that it is  
actually used. A vtable need to contain the offset only if it refers  
to a function that will use it. An offset of 0 is not needed, since  
the function pointer will point to the non-adjusting entry point in  
that case.

4/ Cases where adjustment is performed

+ For call_Cf: No adjustment is done at call site. No adjustment is  
done at callee site if the dynamic type is C,  or D, or D in F (that  
is, F casted to an E).

+ For call_Cg: Adjustment to B* is done at call-site. No further  
adjustment is needed if the dynamic type is C, D, or D in F. On the  
other hand, a second adjustment may happen for an E or E in F,  
because C is not their primary base.

In other words, adjustment is made only when necessary, and at a  
place where it is better scheduled than with thunks. The only bad  
case is double adjustment for call_Cg called with an E*. This case  
can probably be considered rare enough, compared to calls such as  
call_Cg called with a C*, where we now actually do the adjustment at  
the call-site.

4/ Comparing the code trails

Currently, the sequence for a virtual function call in a shared  
library will look as follows. I'm assuming +DD64, there would be some  
additional addp4 in +DD32. The trail below is the dynamic execution  
sequence. In bold and between #if/#endif, the affected code.

	// Compute the address of the vptr in the object, from the  
this pointer
	// Optional, since vptroffset is often 0. This also adjusts  
to the class
	// of the final overrider
	addi		Rthis=vptroffset_of_final_overrider,Rthis
	;;
	// Load the vptr in a register
	ld8		Rvptr=[Rthis]
	;;
	// Add the offset to get to the function descriptor pointer  
in the vtable
	// Never zero, this instruction is always generated
	addi		Rfndescr=fndescroffset,Rvptr
	;;
	// (Assuming inlined stub) Load the function address and new GP
	ld8		Rfnaddr=[Rfndescr],8
	;;
	// Load the new GP
	ld8		GP=[Rfndescr]
	mov		BRn=Rfnaddr
	;;
	// Perform the actual branch to the target

	// For prediction to occur correctly on McKinley,
	// at least 5 cycles between the mov to BR above and the use
	// of the BR. Up to 16 bundles...
	// This is very unlikely to be possible in most actual code, so...
	// ... Branch misprediction almost always
	// ... followed by I-Cache miss almost always if jumping to  
a thunk
	br.call	B0=BRn

#if OLD_ADJUST
thunk_A::f_from_a_B:
	// If the 'adjustment_from_B_to_A is the 'adjustment_to_A' above,
	// then in the new case, the vtable directly points to A::f
	addi		Rthis,adjustment_from_B_to_A

	// In most cases, we can probably generate a PC-relative  
branch here
	// It is unclear whether we would correctly predict that branch
	// (since it is assumed that we arrive here immediately following
	// a misprediction at call site)
	br		A::f
#endif // OLD_ADJUST

 // This occurs less often than OLD_ADJUST
// (it does not happen when call-site adjustment is correct)
#if NEW_ADJUST
adjusting_entry_A::f
	// Can't be executed in less than 3 cycles?
	addi		Rvptr=class_adjustment_offset,Rvptr
	;;
	// This loads data which is close to the fn descriptor,
	// so it's likely to be in the D-cache
	ld8		Rvptr=[Rvptr]
	;;
	add		Rthis=Rthis,Rvptr
#endif

A::f:
	alloc	...

Christophe