Skip to content

Conversation

@zeitlinger
Copy link
Member

Supercedes #1738

introduces a Prometheus histogram metric jvm_gc_duration_seconds to record JVM garbage collection pause durations. existing metrics provide limited visibility into long GC pauses, making it difficult to detect latency spikes. this change leverages the already registered GC notifications (as used in JvmMemoryPoolAllocationMetrics) to capture pause durations without additional instrumentation.

the histogram uses 0.01, 0.1, 1, 10 buckets and includes labels for gc name, action, and cause, enabling detailed monitoring of both short and long GC pauses. this addresses the lack of visibility highlighted in community discussions such as this one. buckets are also defined according to the opentelemetry semantic conventions spec

Example result:

# HELP jvm_gc_duration_seconds JVM GC pause duration histogram.
# TYPE jvm_gc_duration_seconds histogram
jvm_gc_duration_seconds_bucket{action="end of minor GC",cause="G1 Evacuation Pause",gc="G1 Young Generation",le="0.01"} 806
jvm_gc_duration_seconds_bucket{action="end of minor GC",cause="G1 Evacuation Pause",gc="G1 Young Generation",le="0.1"} 806
jvm_gc_duration_seconds_bucket{action="end of minor GC",cause="G1 Evacuation Pause",gc="G1 Young Generation",le="1.0"} 806
jvm_gc_duration_seconds_bucket{action="end of minor GC",cause="G1 Evacuation Pause",gc="G1 Young Generation",le="10.0"} 806
jvm_gc_duration_seconds_bucket{action="end of minor GC",cause="G1 Evacuation Pause",gc="G1 Young Generation",le="+Inf"} 806
jvm_gc_duration_seconds_count{action="end of minor GC",cause="G1 Evacuation Pause",gc="G1 Young Generation"} 806
jvm_gc_duration_seconds_sum{action="end of minor GC",cause="G1 Evacuation Pause",gc="G1 Young Generation"} 0.7360000000000005

gniadeck and others added 11 commits February 6, 2026 17:06
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
…OptIn metric property

Signed-off-by: gniadeck <77535280+gniadeck@users.noreply.github.com>
Signed-off-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
…o list

Replace the boolean property with a comma-separated list of OTel metric
names, giving users fine-grained control over which metrics use
OpenTelemetry Semantic Conventions. Use "*" to enable all metrics.

Signed-off-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
@zeitlinger zeitlinger self-assigned this Feb 6, 2026
Signed-off-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
public class JvmGarbageCollectorMetrics {

private static final String JVM_GC_COLLECTION_SECONDS = "jvm_gc_collection_seconds";
private static final String JVM_GC_DURATION = "jvm.gc.duration";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be consistent with the way the other metric is defined?

Suggested change
private static final String JVM_GC_DURATION = "jvm.gc.duration";
private static final String JVM_GC_DURATION = "jvm_gc_duration";

Comment on lines +115 to +124
if (!GarbageCollectionNotificationInfo.GARBAGE_COLLECTION_NOTIFICATION.equals(
notification.getType())) {
return;
}

GarbageCollectionNotificationInfo info =
GarbageCollectionNotificationInfo.from(
(CompositeData) notification.getUserData());

observe(gcDurationHistogram, info);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try/catch here to avoid crashing the JVM is the notification listener throws something

Suggested change
if (!GarbageCollectionNotificationInfo.GARBAGE_COLLECTION_NOTIFICATION.equals(
notification.getType())) {
return;
}
GarbageCollectionNotificationInfo info =
GarbageCollectionNotificationInfo.from(
(CompositeData) notification.getUserData());
observe(gcDurationHistogram, info);
try {
if (!GarbageCollectionNotificationInfo.GARBAGE_COLLECTION_NOTIFICATION.equals(
notification.getType())) {
return;
}
GarbageCollectionNotificationInfo info =
GarbageCollectionNotificationInfo.from(
(CompositeData) notification.getUserData());
observe(gcDurationHistogram, info);
} catch (Exception e) {
logger.warning(
"Exception while processing garbage collection notification: " + e.getMessage());
}

.name(JVM_GC_DURATION)
.unit(Unit.SECONDS)
.help("Duration of JVM garbage collection actions.")
.labelNames("jvm.gc.action", "jvm.gc.name", "jvm.gc.cause")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jvm.gc.cause is still experimental, should we hold off on that one until it stabilizes?

registerNotificationListener(gcDurationHistogram);
}

private void registerNotificationListener(Histogram gcDurationHistogram) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking a look at the OTel implementation, they also track the listeners and do a cleanup, not sure if it makes sense for us to implement something like that, by making JvmMetrics an AutoClosable. Looks like it ends up requiring a lot of changes, not sure if its worth it

Another thing they do is check to see if the class exists first , which I think would be good to do here too

}

@Test
public void testNonOtelMetricsAbsentWhenUseOtelEnabled() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void testNonOtelMetricsAbsentWhenUseOtelEnabled() {
void testNonOtelMetricsAbsentWhenUseOtelEnabled() {


@Test
@SuppressWarnings("rawtypes")
public void testGCDurationHistogramLabels() throws Exception {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void testGCDurationHistogramLabels() throws Exception {
void testGCDurationHistogramLabels() throws Exception {

@zeitlinger
Copy link
Member Author

Thinking about this more I created #1861 as an alternative that doesn't duplicate the otel instrumentation work.

Would that work for you @gniadeck ?

@zeitlinger zeitlinger marked this pull request as draft February 9, 2026 16:21
@gniadeck
Copy link

gniadeck commented Feb 9, 2026

@zeitlinger for my use case this would work fine, although i have some doubts about this implementation as a goal solution - in my PR implementing this metric was just a few lines of code, and now i’d need to import (and in the future update) opentelemetry-sdk, opentelemetry-exporter-prometheus, opentelemetry-runtime-telemetry-java-8 (according to the demo), just to get better GC visibility. i haven’t tested the performance, so i’m not sure what the footprint would be, but it could be a no-go for some low-latency projects that choose prometheus for its speed.

i like this approach from an architectural standpoint tho, nice to see that client_java is so easily extensible with otel work :)

@zeitlinger
Copy link
Member Author

now i’d need to import...

Yes, for this particular use case it's only a few lines - but you also get other metrics - and JFR support, etc.

I don't think we want to double maintain this if we can avoid it.

(and in the future update) opentelemetry-sdk, opentelemetry-exporter-prometheus, opentelemetry-runtime-telemetry-java-8 (according to the demo)

we could package the otel sdk and exporter in an "otel support" pom - so you'd only have to import opentelemetry-runtime-telemetry-java8 - would that help?

This would also take care of potential conflicts that might arise because the exporter imports this project as well - although it's not a problem in the demo:

[INFO] --- dependency:3.10.0:tree (default-cli) @ example-otel-jvm-runtime-metrics ---
[INFO] io.prometheus:example-otel-jvm-runtime-metrics:jar:1.5.0-SNAPSHOT
[INFO] +- io.prometheus:prometheus-metrics-core:jar:1.5.0-SNAPSHOT:compile
[INFO] |  +- io.prometheus:prometheus-metrics-model:jar:1.5.0-SNAPSHOT:compile
[INFO] |  +- io.prometheus:prometheus-metrics-config:jar:1.5.0-SNAPSHOT:compile
[INFO] |  \- io.prometheus:prometheus-metrics-tracer-initializer:jar:1.5.0-SNAPSHOT:compile
[INFO] |     +- io.prometheus:prometheus-metrics-tracer-common:jar:1.5.0-SNAPSHOT:compile
[INFO] |     +- io.prometheus:prometheus-metrics-tracer-otel:jar:1.5.0-SNAPSHOT:compile
[INFO] |     \- io.prometheus:prometheus-metrics-tracer-otel-agent:jar:1.5.0-SNAPSHOT:compile
[INFO] +- io.prometheus:prometheus-metrics-exporter-httpserver:jar:1.5.0-SNAPSHOT:compile
[INFO] |  \- io.prometheus:prometheus-metrics-exporter-common:jar:1.5.0-SNAPSHOT:compile
[INFO] |     +- io.prometheus:prometheus-metrics-exposition-textformats:jar:1.5.0-SNAPSHOT:compile
[INFO] |     \- io.prometheus:prometheus-metrics-exposition-formats:jar:1.5.0-SNAPSHOT:runtime <--- here
[INFO] +- io.opentelemetry:opentelemetry-sdk:jar:1.58.0:compile
[INFO] |  +- io.opentelemetry:opentelemetry-api:jar:1.58.0:compile
[INFO] |  |  \- io.opentelemetry:opentelemetry-context:jar:1.58.0:compile
[INFO] |  |     \- io.opentelemetry:opentelemetry-common:jar:1.58.0:compile
[INFO] |  +- io.opentelemetry:opentelemetry-sdk-common:jar:1.58.0:compile
[INFO] |  +- io.opentelemetry:opentelemetry-sdk-trace:jar:1.58.0:compile
[INFO] |  +- io.opentelemetry:opentelemetry-sdk-metrics:jar:1.58.0:compile
[INFO] |  \- io.opentelemetry:opentelemetry-sdk-logs:jar:1.58.0:compile
[INFO] +- io.opentelemetry:opentelemetry-exporter-prometheus:jar:1.58.0-alpha:compile
[INFO] |  +- io.opentelemetry:opentelemetry-exporter-common:jar:1.58.0:runtime
[INFO] |  +- io.opentelemetry:opentelemetry-sdk-extension-autoconfigure-spi:jar:1.58.0:runtime
[INFO] |  \- io.prometheus:prometheus-metrics-exposition-formats-no-protobuf:jar:1.3.10:runtime  <--- here

i haven’t tested the performance, so i’m not sure what the footprint would be, but it could be a no-go for some low-latency projects that choose prometheus for its speed.

There's always going to be an overhead in OTel because it's an abstraction - but OTel is basically just delegating to this project back to make the actual processing of the metrics.

The OTel SDK is also very performance sensitive - and currently working in that area.

i like this approach from an architectural standpoint tho, nice to see that client_java is so easily extensible with otel work :)

💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants