The review involves 25 heavy users of AI tools inside the Pentagon, with officials comparing how competing models respond to identical tasks and how results change when prompts are tailored to each system. Early findings show wide variation across models, suggesting that performance is not simply a question of raw capability but also of how each tool handles military language, classified workflows, coding tasks, analytical reasoning and operational planning.
The exercise reflects the urgency created by the department’s move to phase out Claude after a dispute with Anthropic over permissible military uses. Anthropic has resisted unrestricted deployment of its systems, maintaining that its models should not be used for mass domestic surveillance or fully autonomous weapons. Defence officials have argued that military users need dependable access to advanced AI tools for lawful missions, including intelligence work, software development, simulation, logistics and cyber operations.
Claude had gained a strong position inside parts of the national security apparatus because of its ability to process long documents, assist with code, summarise intelligence material and support planning tasks. Its removal is proving more complex than a routine software switch. Pentagon users have built workflows around the model, while contractors and technical teams face the burden of testing, certification, integration and retraining before any replacement can be trusted in sensitive environments.
The department’s current testing process is therefore being watched closely by defence contractors, cloud providers and AI firms competing for a larger role in government systems. OpenAI, Google, xAI and other model developers are likely to benefit from any opening created by Anthropic’s difficulties, but the Pentagon’s own findings may determine whether one provider dominates or whether the department moves towards a multi-model architecture that reduces dependence on a single vendor.
Officials are examining whether different models should be assigned to different functions rather than choosing one universal successor. A model that performs strongly on coding may not be the best option for intelligence synthesis, while a tool that handles document analysis effectively may be weaker on structured operational tasks. The fact that altered prompts can improve results also points to a growing need for specialised prompt engineering and model governance inside military units.
The dispute has also sharpened a broader policy question: how much control should private AI companies retain after their models enter national security systems? Anthropic’s position has drawn support from those who argue that safeguards are necessary as AI becomes more capable. Critics inside the defence establishment contend that a private company should not be able to restrict lawful military use once its technology is embedded in government operations.
The Pentagon’s designation of Anthropic as a supply-chain risk has added another layer of tension. Such labels carry heavy consequences for contractors, cloud partners and agencies that rely on approved technology stacks. Companies working on defence projects must now assess whether Claude can remain in any workflow tied to Pentagon contracts, even where its use is indirect or embedded through third-party platforms.
Cloud infrastructure firms also face a delicate balancing act. Anthropic’s models are available through major commercial platforms, and enterprise customers outside defence continue to use Claude for business and software tasks. The Pentagon’s stance, however, has created a separate compliance burden for defence-linked customers, especially those operating across both civilian and national security markets.
The testing programme could also influence how future AI contracts are written. Procurement officials are expected to demand clearer terms on acceptable use, model access, auditability, data handling and continuity of service. The department is likely to prefer suppliers that can meet strict security requirements while accepting government-defined mission needs, particularly as AI systems move deeper into classified and operational settings.
For the AI industry, the episode marks a shift from experimental adoption to strategic dependence. Defence users are no longer treating large language models as peripheral productivity tools. They are becoming part of analytical, engineering and planning workflows that affect speed, staffing and decision support. That makes reliability, political resilience and contractual clarity as important as benchmark performance.
Follow Arabian Post
Select Arabian Post as your preferred source on Google and MSN News for trusted business news and Arab politics and updates.