Ooooops (Virtual functions)

Thu Feb 24 19:58:31 UTC 2000

Folks,

After discussing with Jason, I definitely made a mistake in reading his writeup.
We killed my virtual call proposal without me noticing...

So, let me re-state the rationale and proposed implementation, and I'll try to
illustrate all cases with pseudo-code. Mark will be our sanity-checker by
implementing on his spare time, OK, Mark? ;-)

METHOD 1: THUNKS

A virtual function call going through a thunk will have the following code
trail:

		br	b0=(b6)
	thunk:	addi	r32=r32,this_delta
		br	target
	target:	// function body

We can assume reasonably that:
1/ The (b6) target is mispredicted always
2/ The thunk is not in the I-cache, because it depends on the (function, class)
pair, so it is rather infrequently called
3/ The thunk is more than 64-bytes or 128-bytes away from the target (with
16-bytes per bundle, this becomes true as soon as you have more than 4 thunks)
4/ The target code may or not be in the I-cache

For obvious reasons, I will subtract the cost of the first branch misprediction
in all the costs below, since this cost is shared between all implementations.

I'm trying to be vague enough and not disclose proprietary information. Let's
make the hypothetical hypothesis that, maybe, the cost of this sequence would be
between 5 and 20 cycles, with a 'typical' value at, say, 9 cycles. That's not
cheap.

METHOD 2: Add sequences

My proposal was to try to keep the thunks close to the 'target' label, and avoid
branches, so that we can consider for instance:

		br	b0=(b6)
	thunk3:	addi	r32=r32,this_delta1;;
	thunk2:	addi	r32=r32,this_delta2-this_delta1;;
	thunk1:	addi	r32=r32,this_delta3-this_delta2;;
	target:	// function body

This case would have an hypothetical cost which between 0+N/M and 9+N/M, where N
is the index of the thunk and M is the number of bundles executed per cycle. For
small N values and/or large M values, this becomes significantly better. One
important factor is that if there is an I-cache miss, at least, the prefetch
will start loading the body of the function. So the typical value this time
would be 9+N/M, where I assume that the function and the thunks are indeed in
the I-cache.

METHOD 3: Adjusting entry point

If the number of thunks created this way becomes too large for N/M to be really
smaller than 5, I proposed the following alternative:

		br	b0=(b6)
	common_thunk:
		add	r9=vptr,offset;;
		ld8	r9=[r9];;
		add	r32=r9,r32
	target:	// Function body		

This sequence would read the 'this_delta' value from the secondary vtable
corresponding to the type of the pointer being used. In the secondary vtable
B-in-C of any non-primary base B of class C, there would be a single such
'this_delta' converting from a B* to a C*. The constraint is that all such
values be stored at a fixed offset relative to their secondary vtable pointer.
See at end.

>From a cache perspective, it is important to note that the value loaded from the
secondary vtable is shared among virtual functions. That is, whether you call
ptr->f() or ptr->g(), you read the same value, as long as f() and g() are
overriden in the same class. Because of that, I believe that the likelihood of
D-cache miss for the load is significantly smaller than the likelihood of an
I-cache miss in the original sequence. I'll add below the cases with and without
D-cache miss.

Without D-cache miss, the cost of that sequence is anywhere between 3 and 12
cycles, with a sweet spot around 3 in the absence of I-cache miss. With D-cache
miss, it can be anywhere between 8 and 21 cycles, again, the sweet spot being
closer to 8 than from 21.

To summarize:
			Best	Typical	Worst
	Thunks		5	9	20
	AddAddAdd	0+N/M	0+N/M	9+N/M
	Adjust		3	3	18
	Adjust/D-miss	8	8	21

You notice that in the absence of D-cache miss, which I think we can justify,
this sequence is significantly better. I may get slightly worse than the thunk,
though, if you are really unlucky (that is, you manage to get a worst-case
D-cache miss and a worst-case I-cache miss in a row)

All this is, of course, purely hypothetical.

So the really good question is: can we allocate the 'this_delta' entries in the
secondary vtable so that they are at a constant offset in each secondary vtable?
My initial proposal was to allocate them at negative offsets from the secondary
vptr (where we also allocate the virtual base offsets), and to just take min(all
existing offsets) - 4. This has the obvious drawback that it introduces padding,
but it should work. Alternatively, we can just extend the vcall offsets to the
non-virtual inheritance case.

Regards
Christophe

Jason Merrill wrote:
> 
> >>>>> Christophe de Dinechin <ddd at cup.hp.com> writes:
> 
>  > Now, Jim, I just realized that for some reason, in the writeup, the
>  > vcall offsets are restricted for "virtual bases" cases. I am unable to
>  > implement the "single adjusting entry point" optimization if this is the
>  > case. I fixed it in the attached writeup, but I may be wrong.
> 
>  > Jason's writeup to B-1 says that you have to implement the vcall offsets
>  > even if you don't use them. I don't think we agreed to get rid of them
>  > (or is this something else important I missed?)
> 
> Apparently so.  Your earlier response to Mark reflects your original
> proposal, but what we eventually decided on was significantly different.
> The vcall offsets are allocated in the secondary vtable for a virtual base,
> one for each virtual function.  Converting from a nonvirtual base uses an
> additional entry point.
> 
> >From my writeup in the issues list:
> 
> We have decided that for virtual functions not inherited from a virtual
> base, regular thunks will work fine, since we can emit them immediately
> before the function to avoid the indirect branch penalty; we will use
> offsets in the vtable for functions that come from a virtual base, because
> it is impossible to predict what the offset between the current class and
> its virtual base will be in classes derived from the current class.
> 
> According to the issues list, we accepted my design at the 12/2 meeting.
> 
> Jason