arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10のフィード

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

https://arxiv.org/api/query?search_query=all:%22software+testing%22&start=0&max_results=10&id_list=

フィード

LLMCFG-TGen: Using LLM-Generated Control Flow Graphs to Automatically Create Test Cases from Use Cases

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Appropriate test-case generation is critical in software testing and significantly impacts testing quality. Requirements-Based Test Generation (RBTG) derives test cases from software requirements to verify whether system behavior aligns with user needs and expectations. Requirements are often documented in Natural Language (NL), with use-case descriptions being a popular method for capturing functional behaviors and interaction flows in a structured, readable form. Recently, Large Language Models (LLMs) have shown strong potential for automating test generation from NL requirements. However, existing LLM-based approaches often fail to ensure comprehensive and non-redundant coverage, and may not adequately capture complex conditional logic, leading to incomplete test cases. To address these limitations, we propose an end-to-end approach called Test Generation based on LLM-generated Control Flow Graphs (LLMCFG-TGen), which generates test cases from NL use-case descriptions. It consists o

7ヶ月前

How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

LLM-based code agents navigate repositories through keyword search but miss the structural relationships, such as call graphs, inheritance hierarchies, and configuration dependencies, that define how software actually works. This makes agent navigation stochastic and difficult to reproduce across runs. We investigate whether lightweight static analysis can provide deterministic anchors for these agents: stable structural facts injected as plain-text comments that constrain probabilistic exploration and make navigation more predictable. Starting from a strong baseline, Codex from OpenAI, we systematically inject varying granularities of structural annotations and measure their effects on localization, trajectory behavior, and run-to-run stability. Our study identifies what we call the deterministic anchoring effect: static structure helps less by making agents "smarter" and more by making their navigation disciplined and reproducible. Three observations support this finding: (1) Anchori

3日前

To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

LLM-based agents for program repair are increasingly built on a "generate-run-revise" paradigm, iteratively executing tests to evaluate and refine patches. This execution-based approach has become standard practice in state-of-the-art systems. However, executions can be time-consuming and expensive, yet their impact on these agents remains underexplored. In this paper, we conduct a two-stage empirical study over execution behavior in LLM-based program repair. To characterize execution behavior at scale, we first analyze 7,745 agent traces from SWE-bench leaderboard submissions. Second, we evaluate 3,000 end-to-end repair attempts across 200 SWE-bench instances and three agents (Claude Code, Codex, and the open-source OpenCode) under four execution paradigms, which allows for a fine-grained comparison of performance and cost. Our analysis reveals three key observations: (1) Code execution is used across all agents and models analyzed, with an average of 8.8 test runs per task. Execution

3日前

ISTQB Certifications Under the Lens: Their Contributions to the Software-Testing Profession; and AI-assisted Synthesis of Practitioners' Endorsements and Criticisms

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Context: The International Software Testing Qualifications Board (ISTQB) certification dominates global software testing practice, with 1.2+ million certifications issued across 130+ countries. Yet it remains contested: practitioners value it for career advancement and shared terminology, while others criticize it as overly theoretical and somewhat disconnected from real-world testing practices. Objective: This study investigates the perceived value and critique of ISTQB certifications, the most widely recognized testing qualifications worldwide. Method: We conducted an AI-assisted Multivocal Literature Review (MLR), combining academic and grey literature to synthesize practitioner endorsements (RQ1) and criticisms (RQ2). ChatGPT's deep research capability was employed under continuous human oversight, with QA strategies ensuring transparency and reliability. As another analysis, we asked a panel of four independent experts to evaluate the precision of endorsements and fairness of crit

3ヶ月前

Humor in Software Testing Education

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Software testing is often perceived as monotonous, which can negatively influence students' emotional engagement with testing. While prior work suggests that humor can increase engagement in professional software development contexts, we know little about humor's effect in software testing education. This paper explores how humorous elements in software testing assignments affect students' emotional engagement, sense of belonging, and creative thinking. We introduced humor in introductory software testing courses at universities in Canada and Germany, and conducted a mixed-methods study with students. Our results show that humor had a strong positive influence on students' experiences of software testing. Students perceived testing as more engaging and less monotonous, felt more comfortable and accepted in class, and reported increased creative thinking about testing tasks. These effects were particularly strong for female students, especially with respect to sense of belonging. Our fi

9日前

Learning Critical Testing Literacy Through Puzzles: an Experience Report

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

In this paper, we report our experiences and takeaways from workshops using puzzles to learn CTL. Background: Software testing is important yet difficult to teach. We introduced a BoK of puzzle-based learning activities to teach CTL, based on a model of critical tester's cognition, leading to the pedagogical framework P4TEST. We conducted thirteen workshops with students, testers, teachers, and primary school pupils to assess puzzle-based teaching of critical testing literacy. Experience: Across eleven workshops, we used a semi-structured approach, varying puzzles, materials, and timing. In two additional workshops, we introduced workbooks and think-aloud sessions to gather more data on the learning experience. Observations: Participants consistently perceived themselves as experimenting while solving puzzles. Students tended to converge on solutions, while professionals continued exploring. Emotions were visible in behaviour but hard to surface through written reflection alone. Think-

10日前

AI-Driven Test Case Generation from Natural Language Requirements: A Survey of Techniques and Research Gaps

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Software testing is critical for verifying that systems meet specified requirements, yet remains among the most time-consuming and expensive activities in development. Requirements-based test generation allows test cases to be derived early from requirements artifacts, but generating them directly from natural language is challenging due to inherent ambiguity and imprecision. Recent advances in AI, natural language processing (NLP), and large language models (LLMs) have made automating this pipeline increasingly feasible, while introducing new risks including hallucination, reduced traceability, and inconsistent evaluation. This survey addresses four research questions: what AI and NLP techniques have been proposed for generating test cases from natural language requirements; what tools and frameworks support these approaches; how generated test cases are evaluated; and what research gaps remain. Following Kitchenham and Charters' systematic review guidelines, we searched major scholar

24日前

Deja Vu at Scale: Paraphrase-Robust Detection of Duplicate Gherkin Steps in Behaviour-Driven Software Testing with Sentence-Transformer Embeddings and a 1.1M-Step Open Benchmark

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Context. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication with documented maintenance cost. Prior detectors either require runnable tests or are single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public benchmark to calibrate it. Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a consolidation-savings model linking clusters to ISO/IEC 25010 maintainability sub-characteristics. Method. The corpus contains 347 public GitHub repositories, 23,667 .feature files, and 1,113,616 Gherkin steps, SPDX-tagged. The detector layers exact hashing, normalised Levenshtein, sentence-transformer cosine, and a Levenshtein-banded hybrid. Calibration uses 1,020 manually labelled step pairs under a released rubric (60-pair overlap, Fleiss kappa = 0.84). We report precision, recall, and F1 with bootstrap 95% C

2ヶ月前

Characterizing Tests in IoT Software: Practices, Challenges and Opportunities

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

The Internet of Things (IoT) is experiencing rapid growth. Smart devices are emerging in smart homes and industrial applications, performing mission-critical tasks. Bugs in IoT software can lead to severe consequences. For example, a buggy smart lock can allow unauthorized access to a private property. Testing is a primary practice to expose software bugs and ensure software quality. However, little is known about how IoT software is tested. To bridge this gap, we conducted the first empirical study on test cases in open-source IoT software. Specifically, we evaluated the effectiveness of test cases in IoT software, explored the challenges inherent in testing IoT software, and analyzed the usage of mock objects. Our results indicate that while IoT software often contains a considerable number of tests, their effectiveness remains limited. We identified the primary challenges in testing IoT software as managing complex interactions with various external dependencies, such as other netwo

18日前

When Generic Prompt Improvements Hurt: Evaluation-Driven Iteration for LLM Applications

arXiv Query: search_query=all:"software testing"&id_list=&start=0&max_results=10

Evaluating Large Language Model (LLM) applications differs from conventional software testing because outputs are probabilistic, semantically variable, and sensitive to prompt and model changes. This technical report proposes the Minimum Viable Evaluation Suite (MVES), an audit-oriented structure for application-level LLM evaluation. MVES links application categories to failure modes, metrics, required artifacts, and validation evidence across general LLM applications, retrieval-augmented systems, and agentic workflows. We pair the framework with a reproducible local evaluation harness covering structured extraction, RAG citation/content-compliance, and instruction-following checks. Using Ollama with Llama 3 8B Instruct and Qwen 2.5 7B Instruct, we evaluate five prompt conditions over expanded 30-case-per-suite ablations. The results show that, in the tested local conditions, generic prompt additions do not produce monotonic improvements: stronger output-contract prompts improve strict

5ヶ月前