IBM at ICSE 2026
- Rio de Janeiro, Brazil
About
IBM is proud to sponsor ICSE 2026, the IEEE/ACM International Conference on Software Engineering. ICSE is the premier software engineering conference. It will be held April 12-18 2026 in Rio de Janeiro. Core conference days will be Wednesday April 15 to Friday April 17.
ICSE provides a forum where researchers, practitioners, and educators gather together to present and discuss research results, innovations, trends, experiences and issues in the field of software engineering.
IBM Booth Schedule
Visit us at booth #8 on Wednesday, Thursday & Friday from 8:00 am - 5:00 pm.
- ALICE: Agentic Logic for Incident and Code bug Elimination
- ASTER: AI powered automated test generation at multiple levels
- Business Rules Discovery
- Functional Testing
- iSWE: IBM Software Engineering Agent for automated code remediation (Martin Hirzel)
- PL/I to Java LLM-Assisted Translation of PL/I Macro Procedures to Java (Takaaki Tateishi)
- Path Guider
- ScarfBench: Enterprise Java framework migration benchmark
- VerSE: Verifiable and composable software engineering
View the agenda below for our conference presentation schedule:
Agenda
- Description:
Systems incorporating large language models (LLMs) as a component are known to be sensitive (i.e., non-robust) to minor input variations that do not change the meaning of the input; such sensitivity may reduce the system’s usefulness. Here, we present a framework to evaluate robustness of systems using COBOL code as input; our application is translation between COBOL and Java programming languages, but the approach extends to other tasks such as code generation or explanation. Targeting robustness of systems with COBOL as input is essential yet challenging. Many business-critical applications are written in COBOL, yet these are typically proprietary legacy applications and their code is unavailable to LLMs for training. We develop a library of COBOL paragraph and full-program perturbation methods, and create variant-expanded versions of a benchmark dataset of examples for a specific task. The robustness of the LLM-based system is evaluated by measuring changes in values of individual and aggregate metrics calculated on the system’s outputs. Finally, we present a series of dynamic table and chart visualization dashboards that assist in debugging the system’s outputs, and monitoring and understanding root causes of the system’s sensitivity to input variation. These tools can be further used to improve the system by, for instance, indicating variations that should be handled by pre-processing steps.
Authors: - Description:
As REST APIs have become widespread in modern web services, comprehensive testing of these APIs is increasingly crucial. Because of the vast search space of operations, parameters, and parameter values, along with their dependencies and constraints, current testing tools often achieve low code coverage, resulting in suboptimal fault detection. To address this limitation, we present AutoRestTest, a novel tool that integrates the Semantic Property Dependency Graph (SPDG) with Multi-Agent Reinforcement Learning (MARL) and large language models (LLMs) for effective REST API testing. AutoRestTest determines operation-dependent parameters using the SPDG and employs five specialized agents (operation, parameter, value, dependency, and header) to identify dependencies of operations and generate operation sequences, parameter combinations, and values. Through an intuitive command-line interface, users can easily configure and monitor tests with successful operation count, unique server errors detected, and time elapsed. Upon completion, AutoRestTest generates a detailed report highlighting errors detected and operations exercised.
Authors:TSTyler StennettNON-IBMMKMyeongsoo KimNON-IBMSSPrincipal Research ScientistAOAlessandro OrsoNON-IBM
- Description:
Interaction with geospatial data requires domain specific knowledge and tools to access and process data from remote sensing, surveys and Internet-of Things objects. To overcome the data processing challenge, we propose an agentic framework based on prompting a Large Language Model to initiate geospatial processing, like filtering vector data and automatically discovering and cropping raster data. Furthermore, the agentic framework can access in real time a Geospatial Foundation Model library and run finetuning/inference for the area of interest. The generated geospatial images are returned to the user through the prompt, and integration of a visual Language model enables image captioning and description in order to better describe the geospatial data. The proposed framework orchestrates multiple agents in the background to seamlessly retrieve vector and raster data for the area of interest and distill complex data in readily usable information.
Authors:+6 more view allLTLeonardo TizzeiIBMGNResearch Data ScientistLKScientist, Geospatial AnalyticsIKIldar KhabibrakhmanovIBMMZMaciel ZorteaIBMHDHiyam DebaryIBM - Description:
We introduce a comprehensive validation framework for LLM-based agentic systems that provides systematic diagnosis and improvement of reliability failures. The framework includes fifteen failure- detection tools and two root-cause analysis modules that jointly uncover weaknesses across input handling, prompt design, and output generation. It integrates lightweight rule-based checks with LLM- as-a-judge assessments to support structured incident detection, classification, and repair. We applied the framework to IBM CUGA, evaluating its performance on the AppWorld and WebArena benchmarks. The analysis revealed recurrent planner misalignments, schema violations, brittle prompt dependencies, and more. Based on these insights, we refined both prompting and coding strategies, maintaining CUGA’s benchmark results while enabling mid-sized models such as Llama 4 and Mistral Medium to achieve notable accuracy gains, substantially narrowing the gap with frontier models. Beyond quantitative validation, we conducted an exploratory study that fed the framework’s diagnostic outputs and agent description into an LLM for self-reflection and prioritization. This interactive analysis produced actionable insights on recurring failure patterns and focus areas for improvement, demonstrating how validation itself can evolve into an agentic, dialogue-driven process. These results show a path toward scalable, quality assurance, and adaptive validation in production agentic systems, offering a foundation for more robust, interpretable, and self-improving agentic architectures.
Authors:HMAI Research ScientistSZSergey ZeltynIBMILAI Research ScientistLGLiane GalantiIBMAYAvi YaeliIBMSSSegev ShlomovIBM - Description:
AI-powered coding assistants are becoming ubiquitous, intimately embedded in software development process. The rise of large lan- guage models (LLMs) has accelerated the development of automated techniques and tools for supporting various software engineering tasks, e.g., program understanding, code generation, software test- ing, and program repair. As CodeLLMs are being employed toward automating these tasks, one question that arises, especially in en- terprise settings, is whether these coding assistants and code LLMs that power them are ready for real-world projects and enterprise use cases. In this paper we survey 57 developers from different domains and with varying software engineering skill about their experience with AI coding assistants and CodeLLMs. In parallel, we reviewed 35 user surveys on the usage, experience and expec- tations of professionals and students using AI coding assistants and codeLLMs. Based on our study findings and analysis of exist- ing surveys, we discuss the requirements for AI-powered coding assistants.
Authors:MMMichele MerlerIBMRPStaff Research ScientistRKRahul KrishnaIBMTHTin Kam HoIBMRPRaju PavuluriIBMMVMaja VukovicIBM - Description:
A recent research trend advocating for smaller, specialized code LLMs in agentic frameworks in conjunction with frontier ones has sparked interest in developing efficient strategies for multi-task learning while balancing performance, resource constraints, and deployment costs of such models. We investigate optimal approaches for creating small, multi-task code LLMs by comparing data mixing versus model merging strategies. We conduct extensive experiments across two model families (Qwen Coder and DeepSeek Coder) at two scales (2B and 7B parameters), fine-tuning them for code generation and code summarization tasks. Our evaluation on HumanEval, MBPP, and Code-to-Test (CodeXGlue) benchmarks reveals that model merging achieves the best overall performance at larger scale across model families, retaining 96% of specialized model performance on code generation tasks while maintaining summarization capabilities. Notably, merged models can even surpass individually fine-tuned models, with our best configuration of Qwen Coder 2.5 7B model achieving 92.7% Pass@1 on HumanEval compared to 90.9% for its task-specific fine-tuned equivalent. At smaller scale we find instead data mixing to be a preferred strategy to obtain a capable multi-task model. We further introduce a weight analysis technique to understand how different tasks affect model parameters and their implications for merging strategies. The results suggest that careful merging and mixing strategies can effectively combine task-specific capabilities without significant performance degradation, making them ideal for resource-constrained deployment scenarios.
Authors:MZMingzhi ZhuNON-IBMMMMichele MerlerIBMSPStacy PattersonNON-IBMRPRaju PavuluriIBMRKRahul KrishnaIBMBSBoris SobolevNON-IBM
- Description:
Automated test generation (ATG), which aims to reduce the cost of manual test suite development, has been investigated for decades and has produced countless techniques based on a variety of approaches: symbolic analysis, search-based, random and adaptive-random, learning-based, and, most recently, large-language-models-based approaches. However, despite this large body of research, there is still a gap in our understanding of the characteristics of developer-written tests and, consequently, in our assessment of how well ATG techniques and tools can generate realistic and representative tests. To bridge this gap, we conducted an extensive empirical study of developer-written tests for Java applications, covering 1.7 million test cases. Our study is the first of its kind in studying aspects of developer-written tests that are mostly neglected in the existing literature, such as test scope, complexity of test fixtures and assertions, types of inputs, and use of mocking. Based on the characterization, we then compare existing tests with those generated by two state-of-the-art ATG tools. Our results highlight that a vast majority of developer-written tests exhibit characteristics and complexity that are beyond the capabilities of current ATG tools. Finally, based on the insights gained from the study, we identify promising research directions that can help bridge the gap between current tool capabilities and more effective tool support for developer testing practices. We hope that this work can set the stage for new advances in the field and bring ATG tools closer to generating the types of tests developers write.
Authors:RPStaff Research ScientistTSTyler StennettIBMRPRaju PavuluriIBMNLNate LevinNON-IBMAOAlessandro OrsoNON-IBMSSPrincipal Research Scientist - Description:
Automated generation of test input data is a major technical challenge in software engineering research. We are developing a tool for automated test input data generation for COBOL, a language widely used in enterprise applications. As in prior studies, we employ a hybrid approach that combines symbolic execution and constraint solving. In addition, to accommodate the various data representations in COBOL, we introduce a byte-level data representation into the memory model used by our tool. However, this increases constraint complexity and size, leading to degraded solver performance. To address this issue, we propose transforming constraint logic formulae through value concretization. Our evaluation shows that this technique effectively alleviates the performance problem.
Authors:TYToshiaki YasueIBMKOKohichi OnoIBMAKSenior Research ScientistDSSTSM - AI4Z, Software Innovation LabFSSoftware Innovation Lab Japan Lead, STSM, Agents and Automation - Description:
Like traditional software, AI agents are prone to failure; specifically, they can enter ‘repetitive futile cycles’ — loops of unproductive behavior that are particularly difficult to detect. This paper introduces the concept of futile cycles and distinguishes them from productive cycles in agent execution trajectories. We propose unsupervised approaches for detecting futile cycles that leverage both structural and semantic representations of agent trajectories evaluated on a large dataset of trajectories for a LangGraph-based stock market multi-Agent application. Our hybrid approach achieves an F1 score of (precision: , recall: ), significantly outperforming individual structural (F1: ) and semantic (F1: ) methods.
Authors:FGResearch Software EngineerHKAI for IT AutomationDPDivya PathakIBMKRResearch ScientistMVMudit VermaIBMPMSenior Technical Staff Member & Senior Manager, AI for IT Automation - Description:
A software engineering issue (SWE issue) is easier to resolve when accompanied by a reproduction test. Unfortunately, most issues do not come with functioning reproduction tests, so this paper explores how to generate them automatically. The main difficulty with that is that the code to be tested is either missing or wrong, as evidenced by the existence of the issue in the first place. This has held back test generation for this scenario: without the correct code to execute, it is difficult to leverage execution feedback to generate good tests. This paper introduces novel ideas to get around this problem for leveraging execution feedback, implemented in a new reproduction test generator called e-Otter++. Experiments show that e-Otter++ represents a leap ahead in the state-of-the-art for this problem, generating tests with an average fail-to-pass rate of 63% on the TDD-Bench Verified benchmark.
Authors:TAToufique AhmedIBMJGSenior Research Engineer - AI for CodeASAvi ShinnarIBMMHMartin HirzelIBM
- Description:
Artificial Intelligence (AI) applications, such as Large Language Models, are primarily driven and executed by Graphics Processing Units (GPUs).
These GPU programs (kernels) consume substantial amounts of energy, yet software developers often lack the hardware expertise and ad hoc knowledge required to optimize for power efficiency. We propose FlipFlop, a framework using static code analysis to predict energy consumption and recommend Pareto-optimal thread block configurations considering both power consumption and execution time. Our framework requires no runtime execution and analyzes PTX code, a low-level instruction set for CUDA-enabled GPUs. It is validated across a diverse set of GPUs and kernels, including multi-head attention, convolution, and matrix multiplication. FlipFlop achieves 83% accuracy in identifying locally optimal energy-efficient configurations, while also minimizing developer effort by reducing the optimization search space by 93.4%. For multi-head attention kernels, it yields up to 79% energy savings and 106% throughput gains relative to NVIDIA's occupancy heuristic. By integrating static analysis with real-time monitoring and providing explainable optimization guidance, FlipFlop empowers developers to create sustainable, high-performance GPUs software which minimizes environmental and computational costs.Authors:SRSaurabhsingh RajputNON-IBMABAlex BrandtNON-IBMVESenior Research Scientist, ManagerTSTushar SharmaNON-IBM - Description:
Enterprise applications are typically tested at multiple levels, with service-level testing playing an important role in validating application functionality. Existing service-level testing tools, especially for RESTful APIs, often employ fuzzing and/or depend on OpenAPI specifications which are not readily available in real-world enterprise codebases. Moreover, these tools are limited in their ability to generate functional tests that effectively exercise meaningful scenarios. In this work, we present SAINT, a novel white-box testing approach for service-level testing of enterprise Java applications. SAINT combines static analysis, large language models (LLMs), and LLM-based agents to automatically generate endpoint and scenario-based tests. The approach builds two key models: an endpoint model, capturing syntactic and semantic information about service endpoints, and an operation dependency graph, capturing inter-endpoint ordering constraints. SAINT then employs LLM-based agents to generate tests. Endpoint-focused tests aim to maximize code and database interaction coverage. Scenario-based tests are synthesized by extracting application use cases from code and refining them into executable tests via planning, action, and reflection phases of the agentic loop. We evaluated SAINT on eight Java applications, including a proprietary enterprise application. Our results illustrate the effectiveness of SAINT in coverage, fault detection, and scenario generation. Moreover, a developer survey provides strong endorsement of the scenario-based tests generated by SAINT. Overall, our work shows that combining static analysis with agentic LLM workflows enables more effective, functional, and developer-aligned service-level test generation.
Authors:+1 more view allRPStaff Research ScientistRPRaju PavuluriIBMRHRuikai HuangNON-IBMTSTyler StennettNON-IBMRKRahul KrishnaIBMAOAlessandro OrsoNON-IBM - Description:
Automated generation of test input data is a major technical challenge in software engineering research. We are developing a tool for automated test input data generation for COBOL, a language widely used in enterprise applications. As in prior studies, we employ a hybrid approach that combines symbolic execution and constraint solving. In addition, to accommodate the various data representations in COBOL, we introduce a byte-level data representation into the memory model used by our tool. However, this increases constraint complexity and size, leading to degraded solver performance. To address this issue, we propose transforming constraint logic formulae through value concretization. Our evaluation shows that this technique effectively alleviates the performance problem.
Authors:TYToshiaki YasueIBMKOKohichi OnoIBMAKSenior Research ScientistDSSTSM - AI4Z, Software Innovation LabFSSoftware Innovation Lab Japan Lead, STSM, Agents and Automation - Description:
Like traditional software, AI agents are prone to failure; specifically, they can enter ‘repetitive futile cycles’ — loops of unproductive behavior that are particularly difficult to detect. This paper introduces the concept of futile cycles and distinguishes them from productive cycles in agent execution trajectories. We propose unsupervised approaches for detecting futile cycles that leverage both structural and semantic representations of agent trajectories evaluated on a large dataset of trajectories for a LangGraph-based stock market multi-Agent application. Our hybrid approach achieves an F1 score of (precision: , recall: ), significantly outperforming individual structural (F1: ) and semantic (F1: ) methods.
Authors:FGResearch Software EngineerHKAI for IT AutomationDPDivya PathakIBMKRResearch ScientistMVMudit VermaIBMPMSenior Technical Staff Member & Senior Manager, AI for IT Automation - Description:
As Large Language Models (LLMs) continue to advance in their ability to process natural and programming languages, they show promise in complex translation tasks across domains with strict compliance requirements. However, ensuring consistency in legally critical domains remains challenging due to inherent limitations, such as natural language ambiguity and the tendency to generate hallucinations. This paper explores an agentic approach that leverages LLMs for legal-critical software development. We use U.S. federal tax preparation software as a representative case study, where natural language tax code must be precisely translated into executable logic, ensuring high fidelity to regulatory requirements.
A fundamental challenge in developing legally critical software from specifications is the generation of test cases, which suffers from the oracle problem. Determining the correct output for a given scenario often requires interpreting legal statutes through a dialogic process, necessitating input from legal experts or independent reviewers. Prior research has proposed metamorphic testing as a potential solution by evaluating equivalence across similarly situated individuals. This paper adopts and extends this approach by introducing LLM agents specialized in generating metamorphic test cases. A key innovation of our work is a higher-order generalization of metamorphic tests, motivated by our case study of tax preparation software, wherein system outputs are analyzed across comparative shifts among similar individuals. Since manually generating such higher-order relations is tedious and error-prone, our agentic paradigm is well-suited for automating test case generation.
We design and implement cohorts of LLM-based agents, each simulating roles in real-world software development teams handling legal documents. Our framework includes a metamorphic testing agent that provides counterexamples while translating tax code into executable software logic. Our findings indicate that our agentic approach, utilizing smaller language models (e.g., GPT-4o-mini), outperforms frontier models (e.g., GPT-4o and Claude-3.5) in complex tax code generation scenarios, achieving a worst-case pass rate of 45% compared to 9%--15%. Furthermore, our evaluations reveal that incorporating higher-order metamorphic testing improves the pass rate in the most challenging scenarios by up to 50%. Our results present a compelling case for using agentic LLM-driven methodologies to generate robust and trustworthy legal-critical software from natural language specifications.
Authors:SKSina KhiabaniNON-IBMATAshutosh TrivediNON-IBMDSSTSM - AI4Z, Software Innovation LabSTSaeid Tizpaz-niariNON-IBM - Description:
Modernizing legacy enterprise systems often involves translating PL/I programs into modern languages such as Java. This task becomes significantly more complex when PL/I macro procedures are involved. The PL/I macro procedures are considered string-manipulating programs that generate PL/I code, and they make automated translation more complex. Recently, large language models (LLMs) have been explored for automated code translation. However, LLM-based code translation struggles to translate the PL/I macro procedures to Java programs that reproduce the behavior of the plain PL/I code generated by the original PL/I macro procedures.
This paper proposes a novel method called templatization, which uses symbolic execution to generate code templates (code with named placeholders) as an intermediate representation. By symbolically executing macro procedures and generating code templates, our approach facilitates LLMs to generate readable and maintainable Java code. Our preliminary experiment on ten PL/I macro procedures shows that the LLM-based translation through templatization successfully generates Java programs that reproduce the behavior of the macro-generated PL/I programs.
Authors:TTTakaaki TateishiIBMYKYasu KatsunoIBM
- Description:
Terraform is a popular Infrastructure-as-Code (IaC) tool for managing multi-cloud environments, but its providers evolve rapidly, introducing frequent breaking changes. These frequent updates pose migration challenges due to fragmented documentation and limited support, leading to delayed upgrades and accumulated technical debt. We present \textsc{TerraMod}, a framework that automates Terraform configuration migration across provider versions by leveraging external knowledge sources — changelogs, API schemas, and deprecation links. Evaluated on real-world breaking changes from the AWS Provider, \textsc{TerraMod} significantly reduces manual effort and mitigates technical lag. We plan to release the dataset upon publication.
Authors:PGResearch Software EngineerPAPooja AggarwalIBMBPBrent PaulovicksIBMPMPrateeti MohapatraIBMRLRong LeeIBMVSVadim SheininIBM
Careers
More events
- —
IBM at ICSE 2025
- Ottawa, Ontario, Canada
- —
IBM at ECTC 2026
- Orlando, FL, USA
- —
IBM at Open Source Summit NA 2026
- Minneapolis, MN, USA


