Doesn't bigquery (and many services) have a RESOURCE_URI which could be used as a label instead?
This would mean that we could have fewer MonitoringInfoSpecs and break them down into request counts and latency metrics.
Any runner that understands the URI could parse it into a proper form like BQ table/dataset/project. This would also extend to Azure/AWS/...
Also, read/write are vague and many operations require a composite set of API calls to describe the entirety of an interaction so should one of the labels be the API name? For example, a new BQ write connector is in the work where its a combination of a "create stream" + 1 or more "append" + 1 or more "flush" calls. The "create stream" happens sometimes and is saved across bundles yet the "append" happens once per element and the "flush" happens at most once per bundle (can be aggregated across bundles). All 3 of these interactions are considered a write but the create/flush are ammortized over many elements.
Suggestion was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
May 6, 2020
Approver
As for using a single URI please see the "Using a single Identifier for each GCP IO’s MonitoringInfo Labels section.
As for calling multiple APIs, I was planning on focusing on the calls specifically to do batch write and batch reads. As looking at the throughput, errors over time, etc and latency are most meaningful for those.
I do see your point though, and this is a good question. Should we create separate metrics for those separate calls?
One approach I have seen is to have a method label on the metric. So if you had 3 calls open, write, and flush. Then the method could be added as a label.
But that would mean a system building a UI would need to deal with 3 metrics. It may be better to just make sure the errors from the open and flush are exposed well through other means. Like as visible error logs
Reply was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
May 8, 2020
Approver
Had a thought. For a case like that. Maybe we should only count the request counts of those separate but related API requests for read/writes. But only collect latency for the batchRead/batchWrite API calls?
I think every service call should have a distinct count + latency metric
Reply was deleted
Show more
Show less
Comment details cannot be verified
Daniel Mills
May 9, 2020
Approver
Adding Method as a label on both count and latency seems like a reasonable approach for these sorts of IOs
Reply was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
May 11, 2020
Approver
As for the question about using GCP Resource Names. I took a look into and don't think its a good idea. Please see the reasoning in "Using the GCP Resource Name..." section.
I've added the method label for each of the APIs.
I also added a BIGQUERY_QUERY_NAME label and suggested an update to the BigQueryIO.read() to allow specifying the query label.
Reply was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
May 13, 2020
Approver
PTAL at current proposal. I am suggesting to use a single resource label, by either concatenating the others. Or using whatever standard exists today as params to the API.
Reply was deleted
Show more
Show less
Comment details cannot be verified
6 replies
New
Luke Cwik
May 29, 2020
Approver
Are you saying that we should use strings like OK/INVALID_ARGUMENT/FAILED_PRECONDITION or the HTTP equivalent 200/300/400/403/...?
A lot more APIs map to the HTTP status codes then the GCP status codes.
Suggestion was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
Aug 8, 2020
Approver
I am saying to use the human readable string names. So that these can be shown to a user. Plus gRPC based APIs don't offer the HTTP code. so we should avoid that.
Reply was deleted
Show more
Show less
Comment details cannot be verified
1 reply
New
Luke Cwik
May 29, 2020
Approver
Why IO and not just API_REQUEST_COUNT?
Ditto for the URN and the request latencies?
Suggestion was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
Aug 11, 2020
Approver
API_REQUEST_COUNT is fine, DONE
Reply was deleted
Show more
Show less
Comment details cannot be verified
1 reply
New
Robert Bradshaw
May 6, 2020
Approver
Little here seems specific to bigquery, or even gcp. Instead, a generic RPC_REQUEST_COUNT and RPC_REQUEST_LATENCY type could be defined, and we could have a label to denote that this relates to bigquery. (Otherwise I see a huge amount of copy-paste as one tries to emulate this pattern for other sources, sinks, and services.)
Suggestion was deleted
Show more
Show less
Comment details cannot be verified
Luke Cwik
May 6, 2020
Approver
I have the same comment just above where we should be using a resource URI + method name as labels. Most services have well defined resource URIs for things like s3 paths/bigquery tables/...
Reply was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
May 6, 2020
Approver
I added a section about this to the Alternatives Considered. PTAL and let me know what you think. There was some thought around this in the past.
Reply was deleted
Show more
Show less
Comment details cannot be verified
Robert Bradshaw
May 7, 2020
Approver
Thanks for adding the section. Added some comments there. I still prefer consolidation.
Reply was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
May 11, 2020
Approver
I don't really recommend it, as we are struggling to find a common single identifier convention to use for all of these services. I looked into a couple of approaches.
Most service of them have something we could use, but in some cases we will be making one up. And what they are doing now may be replaced in the future with GCP Resource Names, but that's not widely adopted in GCP.
Additionally, Google Monitoring will require parsed names to match other GCP metric formats. So, RHs will need to parse these out anyways. In order to report them to metric collection services such as Google Monitoring, to get the proper metric breakdowns.
I did add a GENERAL section, where you can use a single resource label, which would be good for a custom/non GCP IO, such as Kafka IO to record metrics on these RPCs.
Reply was deleted
Show more
Show less
Comment details cannot be verified
Robert Bradshaw
May 12, 2020
Approver
Adding a GENERAL that is a sibling to BIGQUERY, DATASTORE, ... but represents many possible sources as distinct as the GCP ones doesn't feel like the right design. Kafka should be a sibling of PubSub, not a "nephew."
In general, I think the most flexible is a system that allows one to forward, query, filter, aggregate metrics as generically as possible, and a system that knows/cares about the particular differences between the various IO metrics can interact with the various subtypes individually. (One way to phrase this is that there's a generic interface here, and various concrete subtypes.) I do no think it makes sense to try to parse a "resource identifier" to determine, e.g., if one is looking at a bigtable or datastore metric (e.g. the "type" of IO in question could be a label). But it would be useful to be able to query "what resources are the slowest/returning the most errors/etc" without writing a query (let's imagine one can write SQL against the set of counters) that is a bunch of "union" statements of nearly, but not quite, identical selects, and needs to get modified every time a new source should be considered.
Maybe the set of precise labels could vary from source to source, but the common labels would remain the same.
PTAL and LMK what you think :). I am proposing using a single resource label always. But having optional ones. So a RH could choose whether or not it just wants to deal with the single label. Or to use the parsed labels.
Reply was deleted
Show more
Show less
Comment details cannot be verified
6 replies
New
Udi Meiri
May 13, 2020
Approver
Is a process metric also a non-bundle metric?
Suggestion was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
May 14, 2020
Approver
Yes, process metric is the same thing as a non-bundle metric. Any preference for the terminology here :)? I can update the doc to try and use the same term everywhere.
Reply was deleted
Show more
Show less
Comment details cannot be verified
1 reply
New
Luke Cwik
May 29, 2020
Approver
I understand the reasoning and it would make sense in my opinion to only add the necessary concepts that the resource namespace doesn't support (e.g. SPANNER_TABLE) and the resource identifier would still be the spanner database.
Having each SDK own the set of labels to integrate with a specific monitoring system could lead to a huge proliferation of labels (and if the monitoring system changes) then we won't be updating old SDKs and will still have to build a translation system into a runner to update old labels to new labels. Starting off with resource id -> specific monitoring system translation would make more sense in my opinion.
Suggestion was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
Aug 13, 2020
Approver
The SDKs should not be picking any label format. I don't mean to imply this. I only wish to prevent RunnerHarnesses from writing string parsing code. So I am requesting we use the RESOURCE as a fully qualified name. And additional labels which have the same information parsed out.
LMK if you think this is okay. If the redundancy is an issue, then I think we can remove all the additional labels and only use a RESOURCE label.
Yes, this is intentionally redundant, but having to write this string parsing can be burdensome and has been an issue in the past before dataflow built a structured metric format. So we have avoided that approach. In the past metrics were named with very large strings that alway needed parsing to make sense of their parts.
Reply was deleted
Show more
Show less
Comment details cannot be verified
Luke Cwik
Aug 13, 2020
Approver
String parsing is an issue if this happens a lot to cause perf issue or it is not well defined leading to parsing complexity. These resource strings in the current format should be able to be done with any regex library that can pull out groups.
Also, duplicating the information leads to different perf problems (additional storage and more bytes being passed around).
Reply was deleted
Show more
Show less
Comment details cannot be verified
2 replies
New
Luke Cwik
May 29, 2020
Approver
Any reason to define our own way for the resource identifier instead of using the names that GCP uses for these?
Now that we got a bit more clarification. I' updated this section to include the recommended formats if available. And for the remainder I invented some similar to that format. I think we can use those.
It would be good to use a made up namespace to prevent any future conflicts if an official one is used in the future
Reply was deleted
Show more
Show less
Comment details cannot be verified
2 replies
New
Alex Amato
May 4, 2020
Approver
If we want to discourage use of this API. We could relocate it. i.e. MetricsInternal.counter(..)
Open to suggestions here. Is there some way to only make it available to IO libraries, rather than general user code?
Suggestion was deleted
Show more
Show less
Comment details cannot be verified
Daniel Mills
May 5, 2020
Approver
Possible alternative: we could add explicit methods for the new metric types like GcpMetrics.BigQueryRequestCounts(String status, String table, String dataset, String project)
Reply was deleted
Show more
Show less
Comment details cannot be verified
Luke Cwik
May 6, 2020
Approver
I think creating a "service call" object that supports metric makes sense and you can have it be a try with resources kind of thing: try (ServiceCall call : serviceMetric.newCall()) { status = MakeAnRpc(); call.withStatus(status); }
This would wrap the count/timing metrics into one.
Reply was deleted
Show more
Show less
Comment details cannot be verified
Daniel Mills
May 6, 2020
Approver
I like this, although it may need to support direct access too for IOs that use async rpcs
Reply was deleted
Show more
Show less
Comment details cannot be verified
Luke Cwik
May 7, 2020
Approver
Yeah, async will have to follow another pattern.
Reply was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
May 11, 2020
Approver
Incorporated this into the example. Still an open question as to what might happen inside ServiceCall, which may need to take a counter, and create other counters from it (by only changing the status label).
Reply was deleted
Show more
Show less
Comment details cannot be verified
5 replies
New
Luke Cwik
Sep 14, 2020
Approver
Should the name somehow include what the metrics are scoped to? (like how ProcessBundleProgressRequest). (e.g. ProcessMonitoringInfosRequest/HarnessMonitoringInfosRequest/?)
Suggestion was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
Sep 14, 2020
Approver
HarnessMonitoringInfosRequest sounds good to me
Reply was deleted
Show more
Show less
Comment details cannot be verified
1 reply
New
Luke Cwik
May 6, 2020
Approver
It is how it is built today but does not necessarily need to be this way.
We could metrics periodically from the SDK and it would look like: MonitoringInfosRequest {} MonitoringInfosResponse { ActiveBundle { string id = 1; map<string, bytes> monitoring_data = 2; } // Metrics associated with a specific bundle ActiveBundle active_bundles = 1;
// Metrics associated with the process. map<string, bytes> monitoring_data = 2; }
attempted metrics would come from the above while committed metrics would come from the ProcessBundleResponse.
Suggestion was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
May 9, 2020
Approver
I see, you are suggesting that the runner harness would poll these periodically. (I was suggesting adding code that would live in the SDKs to directly export to these services.)
Makes sense though, does this offer much more than what we get out of the ProcessBundleProgressResponses today?
I guess this would make the SDK harness to all of the pre aggregation, before shipping to the metric collection service. Instead of the runner harness doing the aggregation. Do I understand, correctly?
Reply was deleted
Show more
Show less
Comment details cannot be verified
Luke Cwik
May 11, 2020
Approver
Two benefits I see are: * Allows for metrics that are not related to a bundle, (cpu, mem, ...) * Gets rid of a nuisance where we request progress for a bundle after it has completed due to timing
MetricsPusher was going down the route of having the runner do the aggregation and send them directly from the runner. There was another approach was to use the program submitting the job would use the job API to get the post aggregated metrics from send them to a metrics exporter.
Do you believe that the metrics exporter would be able to handle the load from a large Dataflow job if each worker was sending metrics?
Reply was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
May 11, 2020
Approver
I encorporated this idea into the proposal. Suggesting we have non bundle metrics, which we will use for these GCP IOs.
Reply was deleted
Show more
Show less
Comment details cannot be verified
3 replies
New
David Yan
May 14, 2020
Approver
I think for pubsub read, a subscription is created automatically if just the topic is given, so it's possible to use just the subscription. However, in that case the subscription is not likely meaningful to the user.
Suggestion was deleted
Show more
Show less
Comment details cannot be verified
Alex Amato
May 15, 2020
Approver
Correct. The auto generated subscription is not the most meaningful. Though it will have a substring with the topic id.
Its important to collect both, as they can be quite different, when specified. Even in different GCP projects.
Reply was deleted
Show more
Show less
Comment details cannot be verified
1 reply
New
You're suggesting
Gemini created these notes. They can contain errors so should be double-checked. How Gemini takes notes
Drag image to reposition
11
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9
10
Outline
Outline
Document tabs
Apache Beam FN API: GCP IO Debuggability Metrics
11
Headings you add to the document will appear here.
1 Goals
1.1 Non-goals
2 Background
2.1 Limitations in User Metrics API
2.2 Internal SDK Harness Metric APIs
3 Proposal: Introduce MonitoringInfoSpecs
4 Proposal: BigQueryIO allows you to name a query
5 Proposal: Metric Reporting API Changes
6 Proposal: SDKHs to Report non-bundle metrics.
7 Alternatives Considered
7.1 Using a sampling technique to record RPC times
7.2 Export metrics from SDK directly (OpenCensus)
7.3 Using a single Identifier for each GCP IO’s MonitoringInfo Labels
7.4 Using the GCP Resource Name format for an identifier
8 Future Support
8.1 Report active bundle metrics for failed bundles on ProcessBundleResponse
8.2 Report MonitoringInfos for Failed Bundles on ProcessBundleResponse
8.3 All Bundle Metrics are also available as non-bundle/process metrics.
9 Open Questions
9.1 Use best effort reporting semantics for failed bundles?