Apache Beam FN API: GCP IO Debuggability Metrics

Luke Cwik

May 6, 2020

Doesn't bigquery (and many services) have a RESOURCE_URI which could be used as a label instead?

This would mean that we could have fewer MonitoringInfoSpecs and break them down into request counts and latency metrics.

Any runner that understands the URI could parse it into a proper form like BQ table/dataset/project. This would also extend to Azure/AWS/...

Also, read/write are vague and many operations require a composite set of API calls to describe the entirety of an interaction so should one of the labels be the API name?
For example, a new BQ write connector is in the work where its a combination of a "create stream" + 1 or more "append" + 1 or more "flush" calls. The "create stream" happens sometimes and is saved across bundles yet the "append" happens once per element and the "flush" happens at most once per bundle (can be aggregated across bundles). All 3 of these interactions are considered a write but the create/flush are ammortized over many elements.

Alex Amato

May 6, 2020

As for using a single URI please see the "Using a single Identifier for each GCP IO’s MonitoringInfo Labels section.

As for calling multiple APIs, I was planning on focusing on the calls specifically to do batch write and batch reads. As looking at the throughput, errors over time, etc and latency are most meaningful for those.

I do see your point though, and this is a good question. Should we create separate metrics for those separate calls?

One approach I have seen is to have a method label on the metric. So if you had 3 calls open, write, and flush. Then the method could be added as a label.

But that would mean a system building a UI would need to deal with 3 metrics. It may be better to just make sure the errors from the open and flush are exposed well through other means. Like as visible error logs

Alex Amato

May 8, 2020

Had a thought. For a case like that. Maybe we should only count the request counts of those separate but related API requests for read/writes. But only collect latency for the batchRead/batchWrite API calls?

@millsd@google.com WDTY about this one?

Luke Cwik

May 9, 2020

I think every service call should have a distinct count + latency metric

Daniel Mills

May 9, 2020

Adding Method as a label on both count and latency seems like a reasonable approach for these sorts of IOs

Alex Amato

May 11, 2020

As for the question about using GCP Resource Names. I took a look into and don't think its a good idea. Please see the reasoning in "Using the GCP Resource Name..." section.

I've added the method label for each of the APIs.

I also added a BIGQUERY_QUERY_NAME label and suggested an update to the BigQueryIO.read() to allow specifying the query label.

Alex Amato

May 13, 2020

PTAL at current proposal. I am suggesting to use a single resource label, by either concatenating the others. Or using whatever standard exists today as params to the API.

6 replies

Luke Cwik

May 29, 2020

Are you saying that we should use strings like OK/INVALID_ARGUMENT/FAILED_PRECONDITION or the HTTP equivalent 200/300/400/403/...?

A lot more APIs map to the HTTP status codes then the GCP status codes.

Alex Amato

Aug 8, 2020

I am saying to use the human readable string names. So that these can be shown to a user. Plus gRPC based APIs don't offer the HTTP code. so we should avoid that.

1 reply

Luke Cwik

May 29, 2020

Why IO and not just API_REQUEST_COUNT?

Ditto for the URN and the request latencies?

Alex Amato

Aug 11, 2020

API_REQUEST_COUNT is fine, DONE

1 reply

Robert Bradshaw

May 6, 2020

Little here seems specific to bigquery, or even gcp. Instead, a generic RPC_REQUEST_COUNT and RPC_REQUEST_LATENCY type could be defined, and we could have a label to denote that this relates to bigquery. (Otherwise I see a huge amount of copy-paste as one tries to emulate this pattern for other sources, sinks, and services.)

Luke Cwik

May 6, 2020

I have the same comment just above where we should be using a resource URI + method name as labels. Most services have well defined resource URIs for things like s3 paths/bigquery tables/...

Alex Amato

May 6, 2020

I added a section about this to the Alternatives Considered. PTAL and let me know what you think. There was some thought around this in the past.

Robert Bradshaw

May 7, 2020

Thanks for adding the section. Added some comments there. I still prefer consolidation.

Alex Amato

May 11, 2020

I don't really recommend it, as we are struggling to find a common single identifier convention to use for all of these services. I looked into a couple of approaches.

Most service of them have something we could use, but in some cases we will be making one up.
And what they are doing now may be replaced in the future with GCP Resource Names, but that's not widely adopted in GCP.

Additionally, Google Monitoring will require parsed names to match other GCP metric formats. So, RHs will need to parse these out anyways. In order to report them to metric collection services such as Google Monitoring, to get the proper metric breakdowns.

I did add a GENERAL section, where you can use a single resource label, which would be good for a custom/non GCP IO, such as Kafka IO to record metrics on these RPCs.

Robert Bradshaw

May 12, 2020

Adding a GENERAL that is a sibling to BIGQUERY, DATASTORE, ... but represents many possible sources as distinct as the GCP ones doesn't feel like the right design. Kafka should be a sibling of PubSub, not a "nephew."

In general, I think the most flexible is a system that allows one to forward, query, filter, aggregate metrics as generically as possible, and a system that knows/cares about the particular differences between the various IO metrics can interact with the various subtypes individually. (One way to phrase this is that there's a generic interface here, and various concrete subtypes.) I do no think it makes sense to try to parse a "resource identifier" to determine, e.g., if one is looking at a bigtable or datastore metric (e.g. the "type" of IO in question could be a label). But it would be useful to be able to query "what resources are the slowest/returning the most errors/etc" without writing a query (let's imagine one can write SQL against the set of counters) that is a bunch of "union" statements of nearly, but not quite, identical selects, and needs to get modified every time a new source should be considered.

Maybe the set of precise labels could vary from source to source, but the common labels would remain the same.

Alex Amato

May 13, 2020

@robertwb@google.com @lcwik@google.com

PTAL and LMK what you think :). I am proposing using a single resource label always. But having optional ones. So a RH could choose whether or not it just wants to deal with the single label. Or to use the parsed labels.

6 replies

Udi Meiri

May 13, 2020

Is a process metric also a non-bundle metric?

Alex Amato

May 14, 2020

Yes, process metric is the same thing as a non-bundle metric. Any preference for the terminology here :)? I can update the doc to try and use the same term everywhere.

1 reply

Luke Cwik

May 29, 2020

I understand the reasoning and it would make sense in my opinion to only add the necessary concepts that the resource namespace doesn't support (e.g. SPANNER_TABLE) and the resource identifier would still be the spanner database.

Having each SDK own the set of labels to integrate with a specific monitoring system could lead to a huge proliferation of labels (and if the monitoring system changes) then we won't be updating old SDKs and will still have to build a translation system into a runner to update old labels to new labels. Starting off with resource id -> specific monitoring system translation would make more sense in my opinion.

Alex Amato

Aug 13, 2020

The SDKs should not be picking any label format. I don't mean to imply this. I only wish to prevent RunnerHarnesses from writing string parsing code. So I am requesting we use the RESOURCE as a fully qualified name. And additional labels which have the same information parsed out.

LMK if you think this is okay. If the redundancy is an issue, then I think we can remove all the additional labels and only use a RESOURCE label.

Yes, this is intentionally redundant, but having to write this string parsing can be burdensome and has been an issue in the past before dataflow built a structured metric format. So we have avoided that approach. In the past metrics were named with very large strings that alway needed parsing to make sense of their parts.

Luke Cwik

Aug 13, 2020

String parsing is an issue if this happens a lot to cause perf issue or it is not well defined leading to parsing complexity. These resource strings in the current format should be able to be done with any regex library that can pull out groups.

Also, duplicating the information leads to different perf problems (additional storage and more bytes being passed around).

2 replies

Luke Cwik

May 29, 2020

Any reason to define our own way for the resource identifier instead of using the names that GCP uses for these?

BigQuery Table: "//bigquery.googleapis.com/projects/{projectId}/datasets/{datasetId}/tables/{tableId}"
GCS Bucket: "//storage.googleapis.com/buckets/{bucketId}"
GCS Object: "//storage.googleapis.com/buckets/{bucketId}/objects/{objectId}"
BigTable Table: "//bigtable.googleapis.com/projects/{projectId}/instances/{instanceId}/tables/{tableId}"
PubSub Topic: "//pubsub.googleapis.com/projects/{projectId}/topics/{topicId}"
PubSub Subscription: "//pubsub.googleapis.com//projects/{projectId}/subscriptions/{subscriptionId}"

Example for spanner database:
https://cloud.google.com/apis/design/resources

Alex Amato

Aug 13, 2020

Now that we got a bit more clarification. I' updated this section to include the recommended formats if available. And for the remainder I invented some similar to that format. I think we can use those.

BigQuery View: "//bigquery.googleapis.com/projects/{projectId}/datasets/{datasetId}/views/{viewId}"BigQuery Query: "//bigquery.googleapis.com/projects/{projectId}/queries/{queryName}"Datastore Namespace: "//bigtable.googleapis.com/projects/{projectId}/namespaces/{namespaceId}"Spanner Database: "//spanner.googleapis.com/projects/{projectId}/databases/{databaseId}"

Luke Cwik

Aug 13, 2020

It would be good to use a made up namespace to prevent any future conflicts if an official one is used in the future

2 replies

Alex Amato

May 4, 2020

If we want to discourage use of this API. We could relocate it. i.e. MetricsInternal.counter(..)

Open to suggestions here. Is there some way to only make it available to IO libraries, rather than general user code?

Daniel Mills

May 5, 2020

Possible alternative: we could add explicit methods for the new metric types like GcpMetrics.BigQueryRequestCounts(String status, String table, String dataset, String project)

Luke Cwik

May 6, 2020

I think creating a "service call" object that supports metric makes sense and you can have it be a try with resources kind of thing:
try (ServiceCall call : serviceMetric.newCall()) {
status = MakeAnRpc();
call.withStatus(status);
}

This would wrap the count/timing metrics into one.

Daniel Mills

May 6, 2020

I like this, although it may need to support direct access too for IOs that use async rpcs

Luke Cwik

May 7, 2020

Yeah, async will have to follow another pattern.

Alex Amato

May 11, 2020

Incorporated this into the example. Still an open question as to what might happen inside ServiceCall, which may need to take a counter, and create other counters from it (by only changing the status label).

5 replies

Luke Cwik

Sep 14, 2020

Should the name somehow include what the metrics are scoped to? (like how ProcessBundleProgressRequest).
(e.g. ProcessMonitoringInfosRequest/HarnessMonitoringInfosRequest/?)

Alex Amato

Sep 14, 2020

HarnessMonitoringInfosRequest sounds good to me

1 reply

Luke Cwik

May 6, 2020

It is how it is built today but does not necessarily need to be this way.

We could metrics periodically from the SDK and it would look like:
MonitoringInfosRequest {}
MonitoringInfosResponse {
ActiveBundle {
string id = 1;
map<string, bytes> monitoring_data = 2;
}
// Metrics associated with a specific bundle
ActiveBundle active_bundles = 1;

// Metrics associated with the process.
map<string, bytes> monitoring_data = 2;
}

attempted metrics would come from the above while committed metrics would come from the ProcessBundleResponse.

Alex Amato

May 9, 2020

I see, you are suggesting that the runner harness would poll these periodically.
(I was suggesting adding code that would live in the SDKs to directly export to these services.)

Makes sense though, does this offer much more than what we get out of the ProcessBundleProgressResponses today?

I guess this would make the SDK harness to all of the pre aggregation, before shipping to the metric collection service. Instead of the runner harness doing the aggregation. Do I understand, correctly?

Luke Cwik

May 11, 2020

Two benefits I see are:
* Allows for metrics that are not related to a bundle, (cpu, mem, ...)
* Gets rid of a nuisance where we request progress for a bundle after it has completed due to timing

MetricsPusher was going down the route of having the runner do the aggregation and send them directly from the runner. There was another approach was to use the program submitting the job would use the job API to get the post aggregated metrics from send them to a metrics exporter.

Do you believe that the metrics exporter would be able to handle the load from a large Dataflow job if each worker was sending metrics?

Alex Amato

May 11, 2020

I encorporated this idea into the proposal. Suggesting we have non bundle metrics, which we will use for these GCP IOs.

3 replies

David Yan

May 14, 2020

I think for pubsub read, a subscription is created automatically if just the topic is given, so it's possible to use just the subscription. However, in that case the subscription is not likely meaningful to the user.

Alex Amato

May 15, 2020

Correct. The auto generated subscription is not the most meaningful. Though it will have a substring with the topic id.

Its important to collect both, as they can be quite different, when specified. Even in different GCP projects.

1 reply