[Proposal] Optional elements embedding in Fn API process bundle request and response

Rui Wang

Nov 3, 2021

Why in the past design we decoupled data plane and control plane, there should be a good reason?

This proposal will violate the reason? Does that cause any problem?

Yichi Zhang

Nov 3, 2021

Current workflow works well generally, the proposal is targeting a very specific scenario: small and cheap bundles. It is not feasible to assume that all the data can be transferred in control plane in one grpc message for large bundles due to memory constraint and efficiency concerns.

Luke Cwik

Nov 3, 2021

Protobuf does have a hard limit on maximum message size (https://stackoverflow.com/questions/34128872/google-protobuf-maximum-size#:~:text=Protobuf%20has%20a%20hard%20limit,manually%20if%20you%20need%20to.) which is smaller then all the data in many batch sized bundles.

The internal Dataflow streaming service uses a single proto for input and output data and this leads to bundles that can't be processed because they are too big.

2 replies

Luke Cwik

Nov 3, 2021

Is there a concern that these larger messages over the control plane create bottlenecks for smaller messages like splitting and progress reports?

Any recommendations on maximum amount of data to put in ProcessBundleRequest/ProcessBundleResponse?

Rui Wang

Nov 3, 2021

+1

with some detailed number/macro benchmarks, we could better understand the problem.

Yichi Zhang

Nov 4, 2021

That's a good question, I think the maximum amount of data can be selected conservatively, for example 1mb(needs some further experimenting) or maybe made configurable. If the larger messages on control plane are hitting grpc control flow and blocking splitting and progress reports then I think the bottleneck of the job would have existed on the data channel data transmission, considering the smaller volume and size of splitting and progress reports RPCs.

2 replies

Brian Hulette

Nov 3, 2021

Could this waiting negatively affect the current workflow?

Rui Wang

Nov 3, 2021

I can see one issue is In the past design SDK will prepare the UDF while runner side prepare the data, this hides latency of the SDK UDF preparation.

The worst case scenario could be SDK side will not able to prepare UDF earlier to get the latency hidden while runner side still need to pay a amount of time to have the data in place (then send both in a request).

Yichi Zhang

Nov 3, 2021

I think today runners are using data sized buffer for the data plane already (without data buffering we end up sending one grpc message per element). So the wait already existed, the proposal merely shifts the time of runner sending the process bundle request. And yes, this may delay SDK UDF preparation, however normally this preparation is quite cheap by reusing the cached bundle handler and transform runner. Regardless, the wait time should be short.

2 replies

Robert Burke

Nov 3, 2021

Do we have an idea of what that size limit should be? 4MB? 64MB?

Yichi Zhang

Nov 3, 2021

That's yet to be benchmarked, I've experimented conservatively with current data plane flush limit of 1mb.

1 reply

reuven lax

Nov 3, 2021

Why is it all or nothing? If the bundle is too large, can we just split it and include a smaller bundle in the response?

Robert Burke

Nov 3, 2021

Would that add much value vs the "need to now merge these streams" complexity the runner would need to deal with? And the similar complexity for the SDK? Knowing it's all or nothing feels like it would be much easier to debug.

reuven lax

Nov 3, 2021

yeah, but could lead to falling-off-the cliff syndrome - one extra byte pushes you into the slower path. When possible, graceful decay is preferable.

Luke Cwik

Nov 3, 2021

Larger bundles spend most of their time outside of passing data between two threads which is the largest reason for this performance improvement.

That being said it would be pretty easy to say that the ProcessBundleRequest contains the "prefix" of the inbound logical streams that would exist on the data channel.

Similarly the ProcessBundleRequest would represent the "suffix" of the outbound logical streams that would exist on the data channel.

3 replies

Robert Bradshaw

Nov 3, 2021

Note that the runner typically starts consuming data from the data plane before the response comes back (e.g. to handle large bundles). This doesn't significantly change the proposal though, assuming it can stop waiting right away given an appropriate ProcessBundleResponse.

Luke Cwik

Nov 3, 2021

Yes, the runner seems to be stuck with the additional co-ordination on the SDK outbound channel.

From a performance perspective if you split these two changes how much did the ProcessBundleRequest improve performance vs the ProcessBundleResponse?

Robert Burke

Nov 3, 2021

+1 to splitting. The runner can always know whether it needs to send data over the data channel and can avoid doing so when possible but must always be prepared for data from the SDK, which may not happen. It's not clear to me there's enough benefit on the Response side as there is on the Request side.

Robert Bradshaw

Nov 3, 2021

I think it needs to be split (e.g. the thread about capabilities below), but there should be similar benefits on each side (2 RPCs to 1, and possibly less flushing of a multiplexed data channel).

Yichi Zhang

Nov 3, 2021

yet, I think it should be possible that the runner wait on the ProcessBundleResponse and just go ahead and mark the data channel as completed and avoid waiting for the terminal last data. Thus we can avoid the additional coordination on the SDK outbound channel. I haven't really tested the performance improvement on the output path though.

4 replies

Robert Bradshaw

Nov 3, 2021

Do we also need a runner capability to allow the SDK to send data back to the runner in the process bundle response?

Luke Cwik

Nov 3, 2021

A runner capability seems necessary.

Yichi Zhang

Nov 3, 2021

Good point.

2 replies

You're suggesting