-
Notifications
You must be signed in to change notification settings - Fork 830
Add GC duration histogram #1854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
…OptIn metric property Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
…o list Replace the boolean property with a comma-separated list of OTel metric names, giving users fine-grained control over which metrics use OpenTelemetry Semantic Conventions. Use "*" to enable all metrics. Signed-off-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
| public class JvmGarbageCollectorMetrics { | ||
|
|
||
| private static final String JVM_GC_COLLECTION_SECONDS = "jvm_gc_collection_seconds"; | ||
| private static final String JVM_GC_DURATION = "jvm.gc.duration"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be consistent with the way the other metric is defined?
| private static final String JVM_GC_DURATION = "jvm.gc.duration"; | |
| private static final String JVM_GC_DURATION = "jvm_gc_duration"; |
| if (!GarbageCollectionNotificationInfo.GARBAGE_COLLECTION_NOTIFICATION.equals( | ||
| notification.getType())) { | ||
| return; | ||
| } | ||
|
|
||
| GarbageCollectionNotificationInfo info = | ||
| GarbageCollectionNotificationInfo.from( | ||
| (CompositeData) notification.getUserData()); | ||
|
|
||
| observe(gcDurationHistogram, info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try/catch here to avoid crashing the JVM is the notification listener throws something
| if (!GarbageCollectionNotificationInfo.GARBAGE_COLLECTION_NOTIFICATION.equals( | |
| notification.getType())) { | |
| return; | |
| } | |
| GarbageCollectionNotificationInfo info = | |
| GarbageCollectionNotificationInfo.from( | |
| (CompositeData) notification.getUserData()); | |
| observe(gcDurationHistogram, info); | |
| try { | |
| if (!GarbageCollectionNotificationInfo.GARBAGE_COLLECTION_NOTIFICATION.equals( | |
| notification.getType())) { | |
| return; | |
| } | |
| GarbageCollectionNotificationInfo info = | |
| GarbageCollectionNotificationInfo.from( | |
| (CompositeData) notification.getUserData()); | |
| observe(gcDurationHistogram, info); | |
| } catch (Exception e) { | |
| logger.warning( | |
| "Exception while processing garbage collection notification: " + e.getMessage()); | |
| } |
| .name(JVM_GC_DURATION) | ||
| .unit(Unit.SECONDS) | ||
| .help("Duration of JVM garbage collection actions.") | ||
| .labelNames("jvm.gc.action", "jvm.gc.name", "jvm.gc.cause") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jvm.gc.cause is still experimental, should we hold off on that one until it stabilizes?
| registerNotificationListener(gcDurationHistogram); | ||
| } | ||
|
|
||
| private void registerNotificationListener(Histogram gcDurationHistogram) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taking a look at the OTel implementation, they also track the listeners and do a cleanup, not sure if it makes sense for us to implement something like that, by making JvmMetrics an AutoClosable. Looks like it ends up requiring a lot of changes, not sure if its worth it
Another thing they do is check to see if the class exists first , which I think would be good to do here too
| } | ||
|
|
||
| @Test | ||
| public void testNonOtelMetricsAbsentWhenUseOtelEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public void testNonOtelMetricsAbsentWhenUseOtelEnabled() { | |
| void testNonOtelMetricsAbsentWhenUseOtelEnabled() { |
|
|
||
| @Test | ||
| @SuppressWarnings("rawtypes") | ||
| public void testGCDurationHistogramLabels() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public void testGCDurationHistogramLabels() throws Exception { | |
| void testGCDurationHistogramLabels() throws Exception { |
|
@zeitlinger for my use case this would work fine, although i have some doubts about this implementation as a goal solution - in my PR implementing this metric was just a few lines of code, and now i’d need to import (and in the future update) i like this approach from an architectural standpoint tho, nice to see that client_java is so easily extensible with otel work :) |
Yes, for this particular use case it's only a few lines - but you also get other metrics - and JFR support, etc. I don't think we want to double maintain this if we can avoid it.
we could package the otel sdk and exporter in an "otel support" pom - so you'd only have to import opentelemetry-runtime-telemetry-java8 - would that help? This would also take care of potential conflicts that might arise because the exporter imports this project as well - although it's not a problem in the demo:
There's always going to be an overhead in OTel because it's an abstraction - but OTel is basically just delegating to this project back to make the actual processing of the metrics. The OTel SDK is also very performance sensitive - and currently working in that area.
💯 |
Supercedes #1738
introduces a Prometheus histogram metric
jvm_gc_duration_secondsto record JVM garbage collection pause durations. existing metrics provide limited visibility into long GC pauses, making it difficult to detect latency spikes. this change leverages the already registered GC notifications (as used inJvmMemoryPoolAllocationMetrics) to capture pause durations without additional instrumentation.the histogram uses
0.01, 0.1, 1, 10buckets and includes labels forgc name,action, andcause, enabling detailed monitoring of both short and long GC pauses. this addresses the lack of visibility highlighted in community discussions such as this one. buckets are also defined according to the opentelemetry semantic conventions specExample result: