30 Nov 2023
News
2 minute read

The IBM approach to reliable quantum computing

As we transition into the era of quantum utility, the expectations for reliability in quantum computing are evolving.

The IBM approach to reliable quantum computing

As we transition into the era of quantum utility, the expectations for reliability in quantum computing are evolving.

With this shift, the need for our services to produce predictable, reliable results is self evident. We’re proud of the significant strides in reliability that our team has made over the past year, and we’re excited to share our journey and the promise of even greater reliability in the years to come.

To support Qiskit Runtime workloads, we currently operate 34 Kubernetes clusters distributed across 10 datacenters, sporting a total of 49 terabytes of physical memory and over 10,000 compute cores, running microservices written in C, C++, Go, JavaScript, Python, and Rust. This infrastructure and microservice architecture drives the workload for one billion hardware circuits per day across 2,192 qubits and represents the backbone of our commitment to reliability and the foundation which we will continue to evolve.

A high-level view of our software architectureA high-level view of our software architecture.

Our team of developers has been working to tighten the internal feedback loops between our quantum systems and the teams responsible for their innovation and maintenance. This effort has been pivotal in our progress. By measuring and raising awareness about the key reliability metrics, we've created a rigorous cycle of continuous improvement.

We track two key reliability metrics against Qiskit Runtime workloads: the success rate of jobs, and the percentage of execution time spent on successful jobs. By segmenting these metrics — by program, environment, quantum backend, or processor family — we’ve empowered individual developer teams with actionable insights. This granularity has been crucial in our ability to measure the impact of specific issues and responsively prioritize development efforts.

We track two key reliability metrics against Qiskit Runtime workloads: the success rate of jobs, and the percentage of execution time spent on successful jobs.

We continue to invest in the observability of our quantum systems, which has made us better at anticipating, preventing, and resolving issues as they arise. We’ve integrated more researchers and developers into our observability efforts, enabling them to monitor for reliability issues and alert the necessary teams when things go awry. This has not only improved our incident response but has also fostered a culture of reflection and learning, where writing detailed incident reports has become a standard practice, and every challenge is an opportunity to improve our systems.

Today, we’re celebrating a milestone: we’ve improved our reliability by an order of magnitude over the past year. This is a testament to our ability to identify and prioritize problems and tackle them systematically. We've addressed the low-hanging fruit that once caused quantum jobs to fail and are now setting our sights on achieving another order of magnitude of improvement next year. Reliability is now a driving force behind long-term architectural changes in our software stack and infrastructure.

The success rate of quantum jobs over the past year.The success rate of quantum jobs over the past year.

Moreover, we’ve tightened feedback with our users. When jobs fail, we’re providing increasingly granular error codes with actionable error messages, guiding users through troubleshooting and problem resolution wherever possible. This not only improves user experience but also further aids our internal observability processes.

As we look ahead, we want to make a promise: when you see an error message for your quantum job, especially an “Internal Error,” know that it triggers an immediate response from our engineering team. There‘s a pager going off, and an engineer is actively engaged in triage and root cause analysis behind that error. We are more deeply aware of your user experience than ever before.

This level of dedication is what drives us forward and what will continue to elevate the reliability of IBM Quantum’s offerings. The progress we've made this year is just the beginning. We are committed to pursuing ever-higher levels of reliability, ensuring that as quantum computing enters an era of utility, IBM Quantum is poised and ready to support utility scale workloads.