arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10のフィード

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

https://arxiv.org/api/query?search_query=all:%22software+testing%22&start=0&max_results=10&id_list=

フィード

Robust Mutation Analysis of Quantum Programs Under Noise

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Mutation analysis has long been used in classical software testing and has recently been adopted for assessing the robustness of quantum software testing techniques. However, existing studies assume ideal, noiseless execution, overlooking the impact of quantum hardware noise. In this paper, we present an empirical study of noise-aware mutation analysis for quantum programs. We analyze how noise affects mutant detection using 41 quantum programs, executed on noiseless and noisy simulators emulating three IBM devices with different noise profiles. We compare several distance metrics and thresholding strategies to evaluate mutant detection under realistic noise. Our results show that noise significantly alters the behavioral distance between programs and mutants, making equivalent mutants harder to distinguish from real faults. Density-matrix metrics achieve the best discrimination, with misclassification rates up to 16.77%, but are not accessible on real hardware. Among practical alterna

1日前

VISOR: A Vision-Language Model-based Test Oracle for Testing Robot

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test oracle problem in software testing. Traditionally, this assessment relies on task-specific symbolic oracles for task correctness and on human manual evaluation of robot behavior, which is time-consuming, subjective, and error-prone. To address this, we propose VISOR, a Vision-Language Model (VLM)-based approach for automated test oracle assessment that eliminates the need of expensive human evaluations. VISOR performs automated evaluation of task correctness and quality, addressing the limitations of existing symbolic test oracles, which are task-specific and provide pass/fail judgments without explicitly quantifying task quality. Given the inherent uncertainty in VLMs, VISOR also explicitly quantifies its own uncertainty during test assessments. We evaluated VISOR using two VLMs, i.e., GPT and Gemini, across four robotic tasks on over

3日前

Text-to-CAD Evaluation with CADTests

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Text-to-CAD has recently emerged as an important task with the potential to substantially accelerate design workflows. Despite its significance, there has been surprisingly little work on Text-to-CAD evaluation, and assessing CAD model generation performance remains a considerable challenge. In this work, we introduce a new evaluation perspective for Text-to-CAD based on automated testing. We propose CADTestBench, the first test-based benchmark for Text-to-CAD, based on CADTests, executable software tests that verify whether a generated CAD model satisfies the geometric and topological requirements of the input prompt. Using CADTestBench, we conduct comprehensive benchmarking of recent Text-to-CAD methods and further demonstrate that CADTests can also guide CAD model generation, yielding simple baselines that surpass performance of current methods. CADTestBench code and data are available at GitHub and Hugging Face dataset.

6日前

System Test Generation for Virtual Reality Applications using Scenario Models

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Virtual Reality (VR) applications are increasingly being integrated across a wide range of domains, including surgical training and industrial marketing. However, the long-term adoption and maintenance of VR applications remain limited, particularly due to the lack of effective, systematic, and reproducible software testing approaches tailored to their unique characteristics. To address this issue, we introduce UltraInstinctVR, a novel testing approach for VR applications. Relying on predefined VR models (scenarios), it automates the generation and execution of concrete VR system tests. In our empirical evaluation, we compare UltraInstinctVR with state-of-the-art automated VR testing approaches in terms of coverage and failure detection on 10 open-source VR applications. The results show that UltraInstinctVR outperforms existing automated tools for detecting unique failures and provides valuable insights for identifying real-world bugs in VR applications.

6日前

Empirical Derivations from an Evolving Test Suite

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

The paper presents a longitudinal empirical analysis of the automated, continuous, and virtualization-based software test suite of the NetBSD operating system. The longitudinal period observed spans from the initial roll out of the test suite in the early 2010s to late 2025. According to the results, the test suite has grown continuously, currently covering over ten thousand individual test cases. Failed test cases exhibit overall stability, although there have been shorter periods marked with more frequent failures. A similar observation applies to build failures, failures of the test suite to complete, and installation failures, all of which are also captured by the NetBSD's testing framework. Finally, code churn and kernel modifications do not provide longitudinally consistent statistical explanations for the failures. Although some periods exhibit larger effects, including particularly with respect to the kernel modifications, the effects are small on average. Even though only in a

6ヶ月前

Exploring the Output of Software Testing Tools through a Visual Comparative Analysis

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Software testing is a fundamental process of software development, and prior work has shown that visualizations of test results support testers' decision-making. However, Human-Computer Interaction research on software testing has yet to explore and understand the shared interface elements and patterns in visualization of testing outputs. To address this, we conducted a visual comparative analysis of the output of 50 software testing tools and harnesses (44 with CLI output, 6 with GUI output) across four popular programming languages. Our analysis reveals the common interface elements in software testing tools, how these tools display and visualize test results, as well as the specific make-up of the output. Our findings provide insight on how visual testing output is formatted and how colour is used across both CLI and GUI environments, identifying trends that can be applied by developers of testing tools.

9日前

Randomized and Diverse Input State Generation for Quantum Program Testing

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

With the accelerating development of quantum technologies and their growing computational potential, quantum systems are being adapted for simulations and other critical tasks across diverse domains, making the reliability of the corresponding quantum software an essential concern. Although recent efforts have started to incorporate quantum-specific properties such as magnitude, phase, and entanglement under the form of input-coverage criteria into software testing, the unique structure of the quantum state space demands for more comprehensive testing. In particular, the notion of complete state-space exploration has so far received little attention. To address this gap, we propose a framework for evaluating test circuit generators with respect to their coverage of the quantum state space. Our contribution is threefold: we develop a set of diversity scores that capture both local and global indicators of the extent to which the state space is explored; we propose a test circuit generat

9日前

Toward Automated Virtual Electronic Control Unit (ECU) Twins for Shift-Left Automotive Software Testing

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Automotive software increasingly outpaces hardware availability, forcing late integration and expensive hardware-in-the-loop (HiL) bottlenecks. The InnoRegioChallenge project investigated whether a virtual test and integration environment can reproduce electronic control unit (ECU) behavior early enough to run real software binaries before physical hardware exists. We report a prototype that generates instruction-accurate processor models in SystemC/TLM~2.0 using an agentic, feedback-driven workflow coupled to a reference simulator via the GNU Debugger (GDB). The results indicate that the most critical technical risk -- CPU behavioral fidelity -- can be reduced through automated differential testing and iterative model correction. We summarize the architecture, the agentic modeling loop, and project outcomes, and we discuss the technical approach in a manner consistent with the reported qualitative findings. While cloud-scale deployment and full toolchain integration remain future work

3ヶ月前

FeedbackLLM: Metadata driven Multi-Agentic Language Agnostic Test Case Generator with Evolving prompt and Coverage Feedback

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Traditional approaches to test case generation often involve manual effort and incur significant computational overhead. Additionally, these approaches are not scalable, and hence, unsuitable for complex software systems. Recently, Large Language Models (LLMs) have been applied to software testing. However, single-shot prompt engineering-based approaches tend to hallucinate and generate redundant test cases, resulting in fewer branches. To handle the above-mentioned limitations, in this paper, we propose FeedbackLLM, a novel automated language-agnostic test case generation framework based on tightly coupled two-stage approach. In the first stage, FeedbackLLM extracts the input constraints by parsing source code and generates the possible test cases. The quality of the test cases is evaluated in the second stage by the following two specialized LLM feedback agents: (i) Line Feedback Agent: extracts the metadata related to missed line executions and (ii) Branch Feedback Agent: extracts t

12日前

ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Ensuring the reliability of the Rust compiler is of paramount importance, given increasing adoption of Rust for critical systems development, due to its emphasis on memory and thread safety. However, generating valid test programs for the Rust compiler poses significant challenges, given Rust's complex syntax and strict requirements. With the growing popularity of large language models (LLMs), much research in software testing has explored using LLMs to generate test cases. Still, directly using LLMs to generate Rust programs often results in a large number of invalid test cases. Existing studies have indicated that test cases triggering historical compiler bugs can assist in software testing. Our investigation into Rust compiler bug issues supports this observation. Inspired by existing work and our empirical research, we introduce a bracket-based masking and filling strategy called clozeMask. The clozeMask strategy involves extracting test code from historical issue reports, identify

13日前