Our engineering team ran a twelve week pilot to evaluate how autonomous AI agents can assist mainframe developers. The objective was to increase developer throughput while keeping guardrails around source code confidentiality. We are sharing the results that matter most to practitioners. The findings highlight the conditions where AI support is decisive along with the safety controls that prevented regressions.
Study Design
We instrumented the Zcrafter CLI with lightweight analytics that record anonymized task signatures. The pilot group covered three enterprise teams with a mix of COBOL, JCL, and REXX workloads. Each environment remained isolated. Agents never exfiltrated customer data because prompts were abstracted into structural metadata such as paragraph names, dependency graphs, and error codes.
Developers opted in when they started a lint session or scheduled autonomous monitoring. The agent proposed edits, supplied rationale, and awaited explicit approval. We logged timestamps, the number of prompts required per task, and whether the human accepted or amended the recommendation.
Key Observations
Runtime Diagnostics Improve First Pass Fixes
When an agent captured JCL execution traces and summarized abends, human operators resolved issues 28 percent faster on average across nine pilot jobs.
Dataset Hygiene Through Assisted Reviews
Guided COBOL linting reduced duplicate paragraphs and unreachable PERFORM targets by 42 percent in a 18 000 line code base without exposing proprietary business logic.
Risk Controls Remain Essential
Every automated recommendation passed through a human approval queue. The queue rejected 11 percent of suggestions, usually where legacy business rules were implicit and under documented.
What Worked
The most consistent gains came from pairing a developer with an agent that handled mechanical steps: dataset discovery, syntax linting, and regression detection. The human stayed focused on business rules. Repetition heavy procedures such as member renames or JCL overrides benefited from templated prompts that enforced standards. Teams remarked that knowledge sharing accelerated since the agent narrated each recommendation with references to manuals or existing code.
What Required Caution
The agent occasionally over corrected long standing conventions that were undocumented yet deliberate. Human oversight caught these cases. We also observed that prompt drift can occur when operators tweak instructions mid session. Establishing a centrally reviewed prompt library reduced unexpected phrasing and kept latency predictable.
Adoption Checklist
If you are evaluating a similar rollout, start with the following checklist compiled from the pilot.
- Inventory your existing automation scripts and identify steps that already follow deterministic workflows.
 - Limit the pilot scope to non production datasets or test LPARs paired with a playback environment.
 - Establish a red team review of masking rules to ensure sensitive account data never leaves the mainframe boundary.
 - Log every agent action with session level provenance so auditors can replay the decision path.
 - Encourage engineers to annotate accepted or rejected suggestions. These annotations sharpen future prompts without storing the underlying source code.
 
Looking Ahead
The next phase focuses on automated regression proof generation. We want each accepted suggestion to emit a durable audit artifact that explains the rationale line by line. This will help compliance teams verify that generative assistance stayed within policy. We will publish follow up notes once the results are validated.