Datadog Launches GPU Monitoring Solution to Address AI Infrastructure Costs and Performance

Datadog, Inc. announced the general availability of its new GPU Monitoring solution on April 24, 2026, marking a significant expansion of its observability platform into the specialized infrastructure powering artificial intelligence. The launch addresses the growing complexity and high operational costs associated with scaling generative AI models and machine learning workloads across hybrid and multi-cloud environments. By providing granular visibility into hardware performance, Datadog aims to assist engineering and finance teams in optimizing their significant investments in compute resources.

The new GPU Monitoring tool provides real-time metrics on critical hardware health and performance indicators, including GPU utilization, memory allocation, power consumption, and thermal levels. These insights are integrated directly into the Datadog platform, allowing users to correlate GPU performance with application-level metrics and logs. The service supports a wide range of hardware, including the latest NVIDIA Blackwell and Hopper architectures, as well as AMD Instinct accelerators. It is designed to work seamlessly across major cloud providers—Amazon Web Services, Microsoft Azure, and Google Cloud Platform—and on-premises data centers.

A primary focus of the launch is the management of cloud expenditures. As organizations increase their reliance on high-end chips like the NVIDIA H100 and B200, the associated costs have become a primary concern for Chief Technology Officers. Datadog’s solution includes cost-attribution features that allow companies to track GPU spend by team, project, or specific model. This level of detail is intended to prevent idle waste, where expensive GPU resources are provisioned but not actively processing workloads, a common inefficiency in large-scale AI training and inference.

Yrieix Garnier, Vice President of Product at Datadog, stated that the complexity of AI infrastructure requires a new level of transparency. Garnier noted that while many organizations have successfully moved AI prototypes into production, they often struggle with the black box nature of GPU resource consumption. The new monitoring capabilities are intended to provide the telemetry needed to ensure that AI applications remain performant and cost-effective at scale. The platform also includes automated alerting for hardware failures or performance bottlenecks that could disrupt critical AI services.

This release complements Datadog’s existing AI-focused offerings, including LLM Observability and the Bits AI assistant. By adding hardware-level monitoring, Datadog now provides an end-to-end view of the AI stack, from the underlying silicon to the final model output. The company confirmed that GPU Monitoring is available immediately as part of its infrastructure monitoring suite, with pricing based on the number of monitored GPU instances.

Related Articles